Given a random experiment with an outcome sample space of \(S\). A function that assigns one and only one real number to each element in \(S\) is called a random variable.
Let \(Y=0\) denote a dog, and \(Y=1\) denote a cat.
When the sample space only has two outcomes, it is considered a binary random variable. Often the choice of numbers used for a binary random variable are 0 and 1.
When a probability distribution has been specified on the sample space of an experiment, we can determine a probability distribution for the possible values of each random variable \(X\).
This section is focused on probability distributions for discrete distributions. It is said that \(X\) has a discrete distribution if \(X\) can only take the values of a finite number \(k\) different numbers \(x_{1},x_{2},\dots,x_{k}\) or at most, an infinite sequence of different values \(x_{1},x_{2},\dots\). Random variables that can take any value in an interval are called continuous and will be discussed in a later chapter. Working with discrete random variables requires summation while continuous random variables required integration.
Discrete variables are integers and usually represent a count of something while continuous variables take values in an interval of real numbers and often measure something.
A discrete random variable is a variable that takes integer values and is characterized by a probability mass function (pmf). The pmf p of a random variable \(X\) is given by:
\[ p(x) = P(X=x) \]
The above equation can be read as: the probability that the random variable \(X\) is equal to some value, \(x\). Properties that the pmf satisfies:
The term probability distribution is a more generic term that describes the probabilities for each different value a random variable can take on. This holds for both discrete and continuous random variables. We will use the term probability distribution for all random variables, but the pmf is specific to discrete random variables, and pdf (chapter 4) is specific to continuous random variables.
Consider a crooked dice where the cube is shortened in the one-six direction. This has the effect that 1’s and 6’s have a probability of 1/4 of being rolled, where the other faces each have a probability of 1/8.
Let X be the number on the face up side of the die after it is rolled.
x | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
p(x) | 1/4 | 1/8 | 1/8 | 1/8 | 1/8 | 1/4 |
Yes because all probabilities are non-negative and it all sums to 1.
Suppose your roll 2 dice. Let \(X\) be the sum of the two die. Write out the pmf. Don’t forget you can refer to Example 2.8 in the textbook to visualize the sample space.
x | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|
p(x) | 1/36 | 2/36 | 3/36 | 4/36 | 5/36 |
7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|
6/36 | 5/36 | 4/36 | 3/36 | 2/36 | 1/36 |
In the cases we’ve encountered so far, the sample space and the values of the random variable have been discrete, that is, whole numbers. We will get into continuous random variables in the next chapter.
Suppose your roll 2 dice. Let \(X\) be the sum of the two die. Use simulation to estimate the probability distribution.
<- 1:6
die <- sample(die, 1000, replace = TRUE)
d1 <- sample(die, 1000, replace = TRUE)
d2 .2d6 <- d1 + d2 sum
The pmf of \(X\) is:
proportions(table(sum.2d6))
## sum.2d6
## 2 3 4 5 6 7 8 9 10 11 12
## 0.027 0.052 0.081 0.106 0.140 0.175 0.128 0.107 0.090 0.061 0.033
We can use the function plot
to plot the estimate of the pmf using the following code.
plot(proportions(table(sum.2d6)),
main="Sum of two dice", ylab="Probability")
Let \(X\) be the number of heads observed when three coins are tossed.
x | 0 | 1 | 2 | 3 |
---|---|---|---|---|
p(x) | 1/8 | 3/8 | 3/8 | 1/8 |
<- replicate(10000, {
nheads <- sample(c("H", "T"), size=3, replace=TRUE)
coin3 sum(coin3=="H")
})proportions(table(nheads))
## nheads
## 0 1 2 3
## 0.1226 0.3765 0.3764 0.1245
<- replicate(10000, {
sum2balls <- sample(1:7, 2, replace=FALSE)
balls <- sum(balls)
x
})
proportions(table(sum2balls))
## sum2balls
## 3 4 5 6 7 8 9 10 11 12 13
## 0.0439 0.0484 0.0917 0.0987 0.1411 0.1441 0.1409 0.0985 0.1001 0.0465 0.0461
What are the least likely outcomes of \(X\)?
The least likely outcomes are when \(X=4\) and \(X=12\).
Suppose you have a bag full of marbles; 50 are red and 50 are blue. You are standing on a number line, and you draw a marble out of the bag. If you get red, you go left one unit. If you get blue, you go right one unit. This is called a random walk. You draw marbles up to 100 times, each time moving left or right one unit. Let \(X\) be the number of marbles drawn from the bag until you return to 0 for the first time. The rv \(X\) is called the first return time since it is the number of steps it takes to return to your starting position.
Estimate the pmf of \(X\).
<- c(rep(-1, 50), rep(1, 50)) bag
<- sample(bag)
steps 1:10] steps[
## [1] 1 1 1 -1 -1 -1 -1 1 1 1
<- cumsum(steps)
walk 1:10] walk[
## [1] 1 2 3 2 1 0 -1 0 1 2
<- which(walk==0)) (where.zero
## [1] 6 8 12 48 52 62 64 66 68 82 84 86 88 100
min(where.zero)
## [1] 6
<- replicate(10000, {
x <- sample(bag)
steps <- cumsum(steps)
walk <- which(walk==0)
where.zero min(where.zero)
})
plot(proportions(table(x)))
Sometimes you know what the theoretical pmf of a random variable is, but have need to draw samples from the known distribution. We can still use the sample()
function to do so, we just provide it a vector of probabilities to use.
In the United states, human blood comes in four types: O,A,B,AB. Take a sample of thirty blood types with the following probabilities: \(P(O) = 0.45, P(A) = 0.4, P(B) = 0.11, P(AB) = 0.04\)
<- c("O","A","B","AB")
bloodtypes <- c(0.45,.4,.11,.04)
prob_bloodtypes <- sample(bloodtypes, size = 30, prob=prob_bloodtypes, replace=TRUE)
sample_blood 1:10] #quick peek to confirm sample_blood[
## [1] "A" "A" "O" "A" "O" "A" "B" "AB" "O" "O"
The estimated pmf is then:
proportions(table(sample_blood))
## sample_blood
## A AB B O
## 0.40000000 0.03333333 0.13333333 0.43333333
Suppose the proportion of M&Ms by color is: 14% yellow, 13% Red, 20% Orange, 12% Brown, 20% Green, and 21% Blue. Answer the following questions using simulation.
<- c("Y","R","O","Br","G","Bl")
colors <- c(.14,.13,.2,.12,.2,.21) prob_mnms
<- replicate(10000,{
not.green <- sample(colors, 50, prob=prob_mnms, replace=TRUE)
mnm_sample !="G"
mnm_sample
})
mean(not.green)
## [1] 0.799344
<- replicate(10000,{
roy <- sample(colors, 50, prob=prob_mnms, replace=TRUE)
mnm_sample %in% c("R", "O", "Y")
mnm_sample
})
mean(roy)
## [1] 0.470378