Calculating probabilities

A good way to start thinking about calculating probabilities is:

\[ \frac{\mbox{number of times an event can occurs}}{\mbox{size of sample space}} \]

There are two methods for calculating probabilities:

The theoretical probability can be solved mathematically or numerically. Sometimes the math needed to solve a problem is too complex, or intractable that solving it using tools such as algebra and calculus is impossible or relies on certain theories and understandings that you haven’t encountered yet.

In this class we will explore probabilities both numerically (calculating theoretical probabilities), and also estimating probabilities using simulation. Simulation “simulates” a mathematical problem by using repeated sampling from a sample space and observing what occurs.

Example

How would you find the probability of rolling a 4 on a six sided die?

Theoretical Probability:

Count the number of ways a 4 can be rolled in a six sided die and divide by the number of all possible outcomes.

Estimated Probability:

Roll a 6-sided die 100 times and count the number of times a 4 appears. Then divide that number by 100 to estimate the probability that the die will roll a 4.

Example using R

Continuing with the deck of cards example, find the probability of selected events using R code.

numbers <- rep(c(1:10, "J", "Q", "K", "A"), 4)
suits <- rep(c("H", "C", "D", "S"), each = 13)
deck <- paste0(numbers, suits) # Sample Space
aces <- c("AH", "AC", "AD", "AS") # Event A
hearts <- paste0(c(1:10, "J", "Q", "K", "A"), "H") # Event B
  • \(P(A)\)
length(aces) / length(deck)
## [1] 0.07142857
  • \(A\cup B\):
(aces.and.hearts <- union(aces, hearts))
##  [1] "AH"  "AC"  "AD"  "AS"  "1H"  "2H"  "3H"  "4H"  "5H"  "6H"  "7H"  "8H" 
## [13] "9H"  "10H" "JH"  "QH"  "KH"
length(aces.and.hearts) / length(deck)
## [1] 0.3035714

Simulations with sample()

The R function sample can be used to simulate sampling from a bunch of outcomes. Think of it as putting names in a hat and individually drawing the names out of the hat. When you use the function sample you need to give a vector to sample from and also the sample size. The default is to sample without replacement with each item having equal probability of being selected.

❓ What does sampling without replacement mean?

once the outcome has been observed, it is removed from the sample space before the next draw. That is, the same outcome can’t be drawn more than once.

Example

Take a sample of 2 from the numbers 1 through 10 without replacement.

x <- 1:10   # Define the space
sample(x, size = 2) # Sample 2 items from the space
## [1] 7 9

You try it

Sample without replacement three months from the list of months

months <- c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug",
            "Sept","Oct","Nov","Dec")
sample(months,size = 3)
## [1] "Nov" "Jan" "May"

The defaults can be changed. For instance, if you want to sample with replacement you would just add replace=TRUE to the command. This allows you to sample more values than the size of the sample space

Example

x <- 1:5 
sample(x, size = 10, replace = TRUE)
##  [1] 5 2 1 3 4 4 2 1 2 2

You try it

Create a sample of 10 random days of the week.

days <- c("m", "t", "w", "r", "f", "sa", "su")
sample(days,size = 10, replace = TRUE)
##  [1] "m"  "t"  "r"  "f"  "w"  "su" "w"  "r"  "t"  "su"

Using simulation to compute probabilities

The goal of simulation is to compute probabilities of an event. We can do this in three steps:

  1. Simulate the experiment many times storing the results in a vector of observations.
  2. Use a logical statement to test whether or not each element in the vector of observations meets a criteria defined by an “event”. Store these results in a new vector.
  3. Compute the proportion of TRUEs in this new vector to figure out the probability.

Example

Suppose that two six-sided dice are rolled and the numbers appearing on the dice are added. Calculate the probabilities of the two events listed below using both theoretical methods and simulation.

  • Event D: The sum of the two dice is 6.
  • Event E: At least 1 die is a 2.

First we simulate the results from two die rolls, and the sample space that represents the sum of the two numbers added together.

  1. Create two vectors, each length 10,000 by sampling the numbers from 1 to 6. These represent the rolls on each of two dice.
die_1 <- sample(x=1:6, size=10000, replace=TRUE)
die_1[1:10] # look at what the first 10 rolls looks like for die 1
##  [1] 1 5 2 3 5 3 3 1 1 5
die_2 <- sample(x=1:6, size=10000, replace=TRUE)
die_2[1:10] # look at what the first 10 rolls looks like for die 2
##  [1] 4 5 5 4 6 5 3 4 6 4
  1. Add these two vectors of dice results together to create the sample space.
sum.of.2.dice <- die_1 + die_2
sum.of.2.dice[1:10] # confirm they add up as intended
##  [1]  5 10  7  7 11  8  6  5  7  9

Let’s look at event D: The sum of the two dice is 6.

Theoretical Probability: Using the sample space defined in Example 2.8 in the textbook, there are 5 ways the sum of two dice equals 6, out of 36 total combinations. So \(P(D) = 5/36 = 0.1389\)

Simulation: Use a logical statement to identify if sum.of.2.dice is equal to 6, and compare the TRUE and FALSE results to the original values. (trust but verify)

D <- sum.of.2.dice == 6
D[1:10] #just checking
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

Now calculate \(P(D)\) as:

sum(D)/length(D)
## [1] 0.1388
mean(D) # exact same formula
## [1] 0.1388

Let’s look at event E: At least 1 die is a 2.

Theoretical: Again using the sample space diagrammed out in the textbook, there are 11 ways either die 1 or die 2 will roll a 2, and this is out of 36 total possibilities, so \(P(E) = 11/36 = 0.306\)

Simulation: Create our vector of TRUE and FALSES, and take the mean.

E <- die_1==2 | die_2==2
mean(E)
## [1] 0.3054

You try it

Roll three six-sided dice and add all the face up numbers. Use simulation to estimate the probability that the sum of the three dice is at least 10.

die_1 <- sample(x=1:6, size=10000, replace=TRUE)
die_2 <- sample(x=1:6, size=10000, replace=TRUE)
die_3 <- sample(x=1:6, size=10000, replace=TRUE)

sum3d6 <- die_1 + die_2 + die_3

gteq10 <- (sum3d6>=10)
mean(gteq10)
## [1] 0.6294

Using replicate to repeat experiments

What if we wanted to repeat an experiment several times? We can keep clicking the run button and record the results each time. However, the replicate() function will do this for us and much more efficiently. To use the function replicate we just wrap the simulation code in brackets and tell R how many times we want to repeat (or replicate) the experiment.

Example

Simulate rolling a dice 7 times and computing the sum of all rolls and recording if the sum is greater then 30.

dice <- sample(1:6, size=7,replace=TRUE)
sum(dice)>30
## [1] FALSE

Let’s run this experiment 5 times.

replicate(n=5, {
  dice <- sample(1:6, size=7,replace=TRUE)
  sum(dice)>30
  })
## [1] FALSE FALSE FALSE  TRUE FALSE

We will almost always want to replicate things a large number of times, say \(n=10000\). We then store the output in a vector.

results_dice <- replicate(n = 10000,{
  dice <- sample(1:6, size=7,replace=TRUE)
  sum(dice)>30
  })

Now, calculate the probability of rolling a die 7 times and getting a sum larger than 30.

mean(results_dice)
## [1] 0.088

The following sequence is how you should approach writing code that uses the function replicate:

  1. Write code that performs the experiment a single time.
  2. Replicate the experiment a small number of times and check the results.
  3. Replicate the experiment a large number of times and store the results into a vector.
  4. Compute probability using the mean of the results vector.

Purpose of replication?

Why repeat experiments thousands of times?

To investigate the value of the parameter being estimated converges to the true probability in a specific setting.

Example 2.13 in Speegle: https://mathstat.slu.edu/~speegle/_book/probchapter.html#fig:llnplot

You try it

For both questions, compute the theoretical probability and then use simulation to confirm your results. Write all your work for both theoretical AND your simulation code in the space below.

  1. If two die are rolled, what is the probability that the difference between the two numbers is less than 3?

Simulation

Step 1: Write code that performs the experiment a single time. You may need to use the function abs which takes the absolute value of a number.

one_roll <- sample(1:6, size = 2, replace=TRUE)
abs(one_roll[1] - one_roll[2]) < 3
## [1] TRUE

Step 2: Replicate the above experiment a small number of times, say 20

replicate(20,{
  one_roll <- sample(1:6,2,replace=TRUE)
  abs(one_roll[1]-one_roll[2])<3
  })
##  [1] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE
## [13] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Step 3: Up the replicates and store them in a vector

diff_2d6_lt3 <- replicate(10000,{
  one_roll <- sample(1:6,2,replace=TRUE)
  abs(one_roll[1]-one_roll[2])<3
  })

Step 4: Take mean of results to determine probability

mean(diff_2d6_lt3)
## [1] 0.6642

Theoretical

With two six-sided dice, there are 36 possibilities.

(two.dice <- expand.grid(1:6, 1:6)) # create all possibilities
##    Var1 Var2
## 1     1    1
## 2     2    1
## 3     3    1
## 4     4    1
## 5     5    1
## 6     6    1
## 7     1    2
## 8     2    2
## 9     3    2
## 10    4    2
## 11    5    2
## 12    6    2
## 13    1    3
## 14    2    3
## 15    3    3
## 16    4    3
## 17    5    3
## 18    6    3
## 19    1    4
## 20    2    4
## 21    3    4
## 22    4    4
## 23    5    4
## 24    6    4
## 25    1    5
## 26    2    5
## 27    3    5
## 28    4    5
## 29    5    5
## 30    6    5
## 31    1    6
## 32    2    6
## 33    3    6
## 34    4    6
## 35    5    6
## 36    6    6
two.dice[,3] <- abs(two.dice[,1] - two.dice[,2]) # add a column of the difference
two.dice # visualize
##    Var1 Var2 V3
## 1     1    1  0
## 2     2    1  1
## 3     3    1  2
## 4     4    1  3
## 5     5    1  4
## 6     6    1  5
## 7     1    2  1
## 8     2    2  0
## 9     3    2  1
## 10    4    2  2
## 11    5    2  3
## 12    6    2  4
## 13    1    3  2
## 14    2    3  1
## 15    3    3  0
## 16    4    3  1
## 17    5    3  2
## 18    6    3  3
## 19    1    4  3
## 20    2    4  2
## 21    3    4  1
## 22    4    4  0
## 23    5    4  1
## 24    6    4  2
## 25    1    5  4
## 26    2    5  3
## 27    3    5  2
## 28    4    5  1
## 29    5    5  0
## 30    6    5  1
## 31    1    6  5
## 32    2    6  4
## 33    3    6  3
## 34    4    6  2
## 35    5    6  1
## 36    6    6  0
sum(two.dice[,3] <3) # count the number of elements in the event
## [1] 24
mean(two.dice[,3] <3) # calculate the proportion
## [1] 0.6666667
  1. A fair coin is repeatedly tossed ten times. Compute the probability that the last three coin tosses results in heads. (Hint, review Example 2.13 in the textbook for an example)

Simulation

results_3H <- replicate( 10000,{
  cointoss <- sample(c("H","T"), 10, replace=TRUE)
  sum(cointoss[8:10]=="H")==3
  })
mean(results_3H)
## [1] 0.128

Theoretical

Let’s simplify and only consider all possible options for 3 coins. Let h mean heads was observed and t means a tail was observed. The sample space then is:

\[S = \{hhh, hht, hth, thh, htt, tht, tth, ttt\}\]

and the event “all three coins are heads” is \(A = \{hhh\}\).

So the theoretical probability is 1/8 = .125. Our simulated result was pretty close!