A good way to start thinking about calculating probabilities is:
\[ \frac{\mbox{number of times an event can occurs}}{\mbox{size of sample space}} \]
There are two methods for calculating probabilities:
The theoretical probability can be solved mathematically or numerically. Sometimes the math needed to solve a problem is too complex, or intractable that solving it using tools such as algebra and calculus is impossible or relies on certain theories and understandings that you haven’t encountered yet.
In this class we will explore probabilities both numerically (calculating theoretical probabilities), and also estimating probabilities using simulation. Simulation “simulates” a mathematical problem by using repeated sampling from a sample space and observing what occurs.
How would you find the probability of rolling a 4 on a six sided die?
Theoretical Probability:
Count the number of ways a 4 can be rolled in a six sided die and divide by the number of all possible outcomes.
Estimated Probability:
Roll a 6-sided die 100 times and count the number of times a 4 appears. Then divide that number by 100 to estimate the probability that the die will roll a 4.
Continuing with the deck of cards example, find the probability of selected events using R code.
<- rep(c(1:10, "J", "Q", "K", "A"), 4)
numbers <- rep(c("H", "C", "D", "S"), each = 13)
suits <- paste0(numbers, suits) # Sample Space
deck <- c("AH", "AC", "AD", "AS") # Event A
aces <- paste0(c(1:10, "J", "Q", "K", "A"), "H") # Event B hearts
length(aces) / length(deck)
## [1] 0.07142857
<- union(aces, hearts)) (aces.and.hearts
## [1] "AH" "AC" "AD" "AS" "1H" "2H" "3H" "4H" "5H" "6H" "7H" "8H"
## [13] "9H" "10H" "JH" "QH" "KH"
length(aces.and.hearts) / length(deck)
## [1] 0.3035714
sample()
The R
function sample
can be used to simulate sampling from a bunch of outcomes. Think of it as putting names in a hat and individually drawing the names out of the hat. When you use the function sample
you need to give a vector to sample from and also the sample size. The default is to sample without replacement with each item having equal probability of being selected.
❓ What does sampling without replacement mean?
once the outcome has been observed, it is removed from the sample space before the next draw. That is, the same outcome can’t be drawn more than once.
Take a sample of 2 from the numbers 1 through 10 without replacement.
<- 1:10 # Define the space
x sample(x, size = 2) # Sample 2 items from the space
## [1] 7 9
Sample without replacement three months from the list of months
<- c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug",
months "Sept","Oct","Nov","Dec")
sample(months,size = 3)
## [1] "Nov" "Jan" "May"
The defaults can be changed. For instance, if you want to sample with replacement you would just add replace=TRUE
to the command. This allows you to sample more values than the size of the sample space
<- 1:5
x sample(x, size = 10, replace = TRUE)
## [1] 5 2 1 3 4 4 2 1 2 2
Create a sample of 10 random days of the week.
<- c("m", "t", "w", "r", "f", "sa", "su")
days sample(days,size = 10, replace = TRUE)
## [1] "m" "t" "r" "f" "w" "su" "w" "r" "t" "su"
The goal of simulation is to compute probabilities of an event. We can do this in three steps:
TRUE
s in this new vector to figure out the probability.Suppose that two six-sided dice are rolled and the numbers appearing on the dice are added. Calculate the probabilities of the two events listed below using both theoretical methods and simulation.
First we simulate the results from two die rolls, and the sample space that represents the sum of the two numbers added together.
<- sample(x=1:6, size=10000, replace=TRUE)
die_1 1:10] # look at what the first 10 rolls looks like for die 1 die_1[
## [1] 1 5 2 3 5 3 3 1 1 5
<- sample(x=1:6, size=10000, replace=TRUE)
die_2 1:10] # look at what the first 10 rolls looks like for die 2 die_2[
## [1] 4 5 5 4 6 5 3 4 6 4
2.dice <- die_1 + die_2
sum.of.2.dice[1:10] # confirm they add up as intended sum.of.
## [1] 5 10 7 7 11 8 6 5 7 9
Let’s look at event D: The sum of the two dice is 6.
Theoretical Probability: Using the sample space defined in Example 2.8 in the textbook, there are 5 ways the sum of two dice equals 6, out of 36 total combinations. So \(P(D) = 5/36 = 0.1389\)
Simulation: Use a logical statement to identify if sum.of.2.dice
is equal to 6, and compare the TRUE and FALSE results to the original values. (trust but verify)
<- sum.of.2.dice == 6
D 1:10] #just checking D[
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
Now calculate \(P(D)\) as:
sum(D)/length(D)
## [1] 0.1388
mean(D) # exact same formula
## [1] 0.1388
Let’s look at event E: At least 1 die is a 2.
Theoretical: Again using the sample space diagrammed out in the textbook, there are 11 ways either die 1 or die 2 will roll a 2, and this is out of 36 total possibilities, so \(P(E) = 11/36 = 0.306\)
Simulation: Create our vector of TRUE and FALSES, and take the mean.
<- die_1==2 | die_2==2
E mean(E)
## [1] 0.3054
Roll three six-sided dice and add all the face up numbers. Use simulation to estimate the probability that the sum of the three dice is at least 10.
<- sample(x=1:6, size=10000, replace=TRUE)
die_1 <- sample(x=1:6, size=10000, replace=TRUE)
die_2 <- sample(x=1:6, size=10000, replace=TRUE)
die_3
<- die_1 + die_2 + die_3
sum3d6
<- (sum3d6>=10)
gteq10 mean(gteq10)
## [1] 0.6294
replicate
to repeat experimentsWhat if we wanted to repeat an experiment several times? We can keep clicking the run button and record the results each time. However, the replicate()
function will do this for us and much more efficiently. To use the function replicate
we just wrap the simulation code in brackets and tell R how many times we want to repeat (or replicate) the experiment.
Simulate rolling a dice 7 times and computing the sum of all rolls and recording if the sum is greater then 30.
<- sample(1:6, size=7,replace=TRUE)
dice sum(dice)>30
## [1] FALSE
Let’s run this experiment 5 times.
replicate(n=5, {
<- sample(1:6, size=7,replace=TRUE)
dice sum(dice)>30
})
## [1] FALSE FALSE FALSE TRUE FALSE
We will almost always want to replicate things a large number of times, say \(n=10000\). We then store the output in a vector.
<- replicate(n = 10000,{
results_dice <- sample(1:6, size=7,replace=TRUE)
dice sum(dice)>30
})
Now, calculate the probability of rolling a die 7 times and getting a sum larger than 30.
mean(results_dice)
## [1] 0.088
The following sequence is how you should approach writing code that uses the function replicate
:
mean
of the results vector.Why repeat experiments thousands of times?
To investigate the value of the parameter being estimated converges to the true probability in a specific setting.
Example 2.13 in Speegle: https://mathstat.slu.edu/~speegle/_book/probchapter.html#fig:llnplot
For both questions, compute the theoretical probability and then use simulation to confirm your results. Write all your work for both theoretical AND your simulation code in the space below.
Simulation
Step 1: Write code that performs the experiment a single time. You may need to use the function abs
which takes the absolute value of a number.
<- sample(1:6, size = 2, replace=TRUE)
one_roll abs(one_roll[1] - one_roll[2]) < 3
## [1] TRUE
Step 2: Replicate the above experiment a small number of times, say 20
replicate(20,{
<- sample(1:6,2,replace=TRUE)
one_roll abs(one_roll[1]-one_roll[2])<3
})
## [1] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE
## [13] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Step 3: Up the replicates and store them in a vector
<- replicate(10000,{
diff_2d6_lt3 <- sample(1:6,2,replace=TRUE)
one_roll abs(one_roll[1]-one_roll[2])<3
})
Step 4: Take mean of results to determine probability
mean(diff_2d6_lt3)
## [1] 0.6642
Theoretical
With two six-sided dice, there are 36 possibilities.
<- expand.grid(1:6, 1:6)) # create all possibilities (two.dice
## Var1 Var2
## 1 1 1
## 2 2 1
## 3 3 1
## 4 4 1
## 5 5 1
## 6 6 1
## 7 1 2
## 8 2 2
## 9 3 2
## 10 4 2
## 11 5 2
## 12 6 2
## 13 1 3
## 14 2 3
## 15 3 3
## 16 4 3
## 17 5 3
## 18 6 3
## 19 1 4
## 20 2 4
## 21 3 4
## 22 4 4
## 23 5 4
## 24 6 4
## 25 1 5
## 26 2 5
## 27 3 5
## 28 4 5
## 29 5 5
## 30 6 5
## 31 1 6
## 32 2 6
## 33 3 6
## 34 4 6
## 35 5 6
## 36 6 6
3] <- abs(two.dice[,1] - two.dice[,2]) # add a column of the difference
two.dice[,# visualize two.dice
## Var1 Var2 V3
## 1 1 1 0
## 2 2 1 1
## 3 3 1 2
## 4 4 1 3
## 5 5 1 4
## 6 6 1 5
## 7 1 2 1
## 8 2 2 0
## 9 3 2 1
## 10 4 2 2
## 11 5 2 3
## 12 6 2 4
## 13 1 3 2
## 14 2 3 1
## 15 3 3 0
## 16 4 3 1
## 17 5 3 2
## 18 6 3 3
## 19 1 4 3
## 20 2 4 2
## 21 3 4 1
## 22 4 4 0
## 23 5 4 1
## 24 6 4 2
## 25 1 5 4
## 26 2 5 3
## 27 3 5 2
## 28 4 5 1
## 29 5 5 0
## 30 6 5 1
## 31 1 6 5
## 32 2 6 4
## 33 3 6 3
## 34 4 6 2
## 35 5 6 1
## 36 6 6 0
sum(two.dice[,3] <3) # count the number of elements in the event
## [1] 24
mean(two.dice[,3] <3) # calculate the proportion
## [1] 0.6666667
Simulation
<- replicate( 10000,{
results_3H <- sample(c("H","T"), 10, replace=TRUE)
cointoss sum(cointoss[8:10]=="H")==3
})mean(results_3H)
## [1] 0.128
Theoretical
Let’s simplify and only consider all possible options for 3 coins. Let h
mean heads was observed and t
means a tail was observed. The sample space then is:
\[S = \{hhh, hht, hth, thh, htt, tht, tth, ttt\}\]
and the event “all three coins are heads” is \(A = \{hhh\}\).
So the theoretical probability is 1/8 = .125. Our simulated result was pretty close!