Random Variables

Given a random experiment with an outcome sample space of \(S\). A function that assigns one and only one real number to each element in \(S\) is called a random variable.

Example

  1. Consider an experiment that is the single roll of a die, where the number of spots on the face up side of the die when rolled is observed.
  • Outcome space is \(S=\{1,2,3,4,5,6\}\)
  • Let the random variable \(X\) indicate the number of spots on the side facing up.
  • The space of the random variable \(X\) is then \(\{1,2,3,4,5,6\}\).
  1. Dr. D has 2 dogs and 2 cats in her household, so \(S = \{cat, dog\}\). Let \(Y\) be a random variable that denotes the type of animal. \(Y\) then maps each element in \(S\) to one and only one real number:

Let \(Y=0\) denote a dog, and \(Y=1\) denote a cat.

When the sample space only has two outcomes, it is considered a binary random variable. Often the choice of numbers used for a binary random variable are 0 and 1.

Distribution of a Random Variable:

When a probability distribution has been specified on the sample space of an experiment, we can determine a probability distribution for the possible values of each random variable \(X\).

This section is focused on probability distributions for discrete distributions. It is said that \(X\) has a discrete distribution if \(X\) can only take the values of a finite number \(k\) different numbers \(x_{1},x_{2},\dots,x_{k}\) or at most, an infinite sequence of different values \(x_{1},x_{2},\dots\). Random variables that can take any value in an interval are called continuous and will be discussed in a later chapter. Working with discrete random variables requires summation while continuous random variables required integration.

Discrete variables are integers and usually represent a count of something while continuous variables take values in an interval of real numbers and often measure something.

Definition: PMF

A discrete random variable is a variable that takes integer values and is characterized by a probability mass function (pmf). The pmf p of a random variable \(X\) is given by:

\[ p(x) = P(X=x) \]

The above equation can be read as: the probability that the random variable \(X\) is equal to some value, \(x\). Properties that the pmf satisfies:

  1. \(p(x) \geq 0 \ \forall x\)
  2. \(\sum_{x}p(x) = 1\)

The term probability distribution is a more generic term that describes the probabilities for each different value a random variable can take on. This holds for both discrete and continuous random variables. We will use the term probability distribution for all random variables, but the pmf is specific to discrete random variables, and pdf (chapter 4) is specific to continuous random variables.

Example

Consider a crooked dice where the cube is shortened in the one-six direction. This has the effect that 1’s and 6’s have a probability of 1/4 of being rolled, where the other faces each have a probability of 1/8.

  • Define the random variable

Let X be the number on the face up side of the die after it is rolled.

  • Write out the probability distribution (pmf).
x 1 2 3 4 5 6
p(x) 1/4 1/8 1/8 1/8 1/8 1/4
  • Is this a valid pmf? Explain.

Yes because all probabilities are non-negative and it all sums to 1.

You try it

Suppose your roll 2 dice. Let \(X\) be the sum of the two die. Write out the pmf. Don’t forget you can refer to Example 2.8 in the textbook to visualize the sample space.

x 2 3 4 5 6
p(x) 1/36 2/36 3/36 4/36 5/36
7 8 9 10 11 12
6/36 5/36 4/36 3/36 2/36 1/36

Using simulation to estimate discrete probability distributions.

In the cases we’ve encountered so far, the sample space and the values of the random variable have been discrete, that is, whole numbers. We will get into continuous random variables in the next chapter.

Example

Suppose your roll 2 dice. Let \(X\) be the sum of the two die. Use simulation to estimate the probability distribution.

die <- 1:6 
d1  <- sample(die, 1000, replace = TRUE)
d2  <- sample(die, 1000, replace = TRUE)
sum.2d6   <- d1 + d2

The pmf of \(X\) is:

proportions(table(sum.2d6))
## sum.2d6
##     2     3     4     5     6     7     8     9    10    11    12 
## 0.027 0.052 0.081 0.106 0.140 0.175 0.128 0.107 0.090 0.061 0.033

Plotting the pmf

We can use the function plot to plot the estimate of the pmf using the following code.

plot(proportions(table(sum.2d6)),
     main="Sum of two dice", ylab="Probability")

You try it

  1. Three coins are tossed and the number of heads \(X\) is counted. Write out the theoretical pmf for \(X\) and confirm via simulation.

Let \(X\) be the number of heads observed when three coins are tossed.

x 0 1 2 3
p(x) 1/8 3/8 3/8 1/8
nheads <- replicate(10000, {
  coin3 <- sample(c("H", "T"), size=3, replace=TRUE)
  sum(coin3=="H")
})
proportions(table(nheads))
## nheads
##      0      1      2      3 
## 0.1226 0.3765 0.3764 0.1245
  1. Seven balls number 1-7 are in an urn. Two balls are drawn from the urn without replacement and the sum of \(X\) of the numbers is computed. Estimate via simulation the pmf of \(X\).
sum2balls <- replicate(10000, {
  balls <- sample(1:7, 2, replace=FALSE)
  x <- sum(balls)
})

proportions(table(sum2balls))
## sum2balls
##      3      4      5      6      7      8      9     10     11     12     13 
## 0.0439 0.0484 0.0917 0.0987 0.1411 0.1441 0.1409 0.0985 0.1001 0.0465 0.0461

What are the least likely outcomes of \(X\)?

The least likely outcomes are when \(X=4\) and \(X=12\).

Challenge Example

Suppose you have a bag full of marbles; 50 are red and 50 are blue. You are standing on a number line, and you draw a marble out of the bag. If you get red, you go left one unit. If you get blue, you go right one unit. This is called a random walk. You draw marbles up to 100 times, each time moving left or right one unit. Let \(X\) be the number of marbles drawn from the bag until you return to 0 for the first time. The rv \(X\) is called the first return time since it is the number of steps it takes to return to your starting position.

Estimate the pmf of \(X\).

  1. Create the sample space of red and blue marbles
bag <- c(rep(-1, 50), rep(1, 50))
  1. Draw all 100 marbles out of the bag w/o replacement to simulate the steps in the walk.
steps <- sample(bag)
steps[1:10]
##  [1]  1  1  1 -1 -1 -1 -1  1  1  1
  1. Calculate the cumulative sum of each step to see where you are at on the number line during the walk.
walk <- cumsum(steps)
walk[1:10]
##  [1]  1  2  3  2  1  0 -1  0  1  2
  1. Identify at which step along the walk you end back at 0
(where.zero <- which(walk==0))
##  [1]   6   8  12  48  52  62  64  66  68  82  84  86  88 100
  1. Let \(X\) be a random variable denoting the first return time.
min(where.zero)
## [1] 6
  1. Now that we have the RV defined, replicate all of the above 10000 times and create the pdf.
x <- replicate(10000, {
  steps <- sample(bag)
  walk <- cumsum(steps)
  where.zero <- which(walk==0)
  min(where.zero)
})

plot(proportions(table(x)))


Sampling from a theoretical pmf.

Sometimes you know what the theoretical pmf of a random variable is, but have need to draw samples from the known distribution. We can still use the sample() function to do so, we just provide it a vector of probabilities to use.

Example: Blood types

In the United states, human blood comes in four types: O,A,B,AB. Take a sample of thirty blood types with the following probabilities: \(P(O) = 0.45, P(A) = 0.4, P(B) = 0.11, P(AB) = 0.04\)

bloodtypes <- c("O","A","B","AB")
prob_bloodtypes <- c(0.45,.4,.11,.04)
sample_blood <- sample(bloodtypes, size = 30, prob=prob_bloodtypes, replace=TRUE)
sample_blood[1:10] #quick peek to confirm
##  [1] "A"  "A"  "O"  "A"  "O"  "A"  "B"  "AB" "O"  "O"

The estimated pmf is then:

proportions(table(sample_blood)) 
## sample_blood
##          A         AB          B          O 
## 0.40000000 0.03333333 0.13333333 0.43333333

You try it

Suppose the proportion of M&Ms by color is: 14% yellow, 13% Red, 20% Orange, 12% Brown, 20% Green, and 21% Blue. Answer the following questions using simulation.

colors <- c("Y","R","O","Br","G","Bl")
prob_mnms <- c(.14,.13,.2,.12,.2,.21)
  1. What is the probability that a randomly selected M&M is not green?
not.green <- replicate(10000,{
  mnm_sample <- sample(colors, 50, prob=prob_mnms, replace=TRUE)
  mnm_sample !="G"
})

mean(not.green)
## [1] 0.799344
  1. What is the probability that a randomly selected M&M is red, orange, or yellow?
roy <- replicate(10000,{
  mnm_sample <- sample(colors, 50, prob=prob_mnms, replace=TRUE)
  mnm_sample %in% c("R", "O", "Y")
})

mean(roy)
## [1] 0.470378