NOTE: We are going out of order from the textbook in Chapter 3.
Probability mass functions provide a global overview of a random variable’s behavior. Many times we don’t need to know everything about a variable. We often want to summarize the variable. One feature of a distribution which we might be interested in is the central tendency of a variable. One measure of central tendency is the expected value or mean of the observation. The term expected value and mean can be used interchangeably.
For a discrete random variable \(X\) with a pmf \(p\), the expected value of \(X\) is
\[ E[X]=\sum_{x}xp(x) \]
where the sum is taken over all possible values of the random variable \(X\).
Two books are assigned for a statistics class: a textbook costing $137 and its corresponding study guide costing $33. The university bookstore determined 20% of enrolled students do not buy either book, 55% buy the textbook only, and 25% buy both books, and these percentages are relatively constant from one term to another.
Let \(X\) be a random variable that denotes how much a single student will spend on their course materials. The pmf is:
Textbook only: $137
Textbook + Study guide: $173 + $33 = $170
Neither: $0
x | 0 | 137 | 170 |
p(x) | .20 | .55 | .25 |
Calculate E(X) and interpret this value in context:
0 * .20 + 137 *.55 + 170*.25
## [1] 117.85
On average, a single student will spend $117.85 on their course materials.
Confirm your results using simulation.
<- c(0, 137, 170)
dollars <- c(.2, .55, .25)
prob
<- sample(dollars, size=10000, prob=prob, replace=TRUE)
spend mean(spend)
## [1] 117.5886
A retirement portfolio’s value increases by 18% during a financial boom and by 9% during normal times. It decreases by 12% during a recession. What is the expected return on this portfolio if each scenario is equally likely?
Define a random variable.
Let \(X\) be the change in portfolio value.
Write down the pdf.
x | 18 | 9 | -12 |
p(x) | 1/3 | 1/3 | 1/3 |
Calculate the theoretical expected value. Write your answer in a full sentence in context of the problem.
18*1/3 + 9*1/3 - 12*1/3
## [1] 5
In the long run this portfolio’s value is expected to increase by 5%.
<- c(18, 9, -12)
value
<- sample(value, size=10000, replace=TRUE)
portfolio.change mean(portfolio.change)
## [1] 4.8567
Although the mean is a useful descriptive statistic, it only gives us an idea of where the center of the distribution is located. For instance, the following table gives the monthly temperature of New York City and San Francisco:
months | J | F | M | A | M | J | J | A | S | O | N | D |
NYC | 32 | 34 | 42 | 53 | 63 | 72 | 77 | 76 | 68 | 57 | 48 | 37 |
SF | 49 | 52 | 53 | 56 | 58 | 62 | 63 | 64 | 65 | 61 | 55 | 49 |
The mean temperature for San Francisco is about 57 degrees and the mean temperature for New York is around 55 degrees. So, there mean yearly temperature is about the same. Do you notice anything different about the two cities with regards to monthly temperatures?
The temperature range in NYC has higher highs, and lower lows compared to SF. SF has a lower range of temperatures compared to NYC.
To distinguish between 2 distributions with the similar means it might be useful to have a statistic that measures how spread out the distribution is. The variance and standard deviations are such measures.
Suppose \(X\) is a random variable with mean \(\mu=E(X)\). The variance of \(X\), denoted by Var(\(X\)) or \(\sigma^{2}\), is defined as follows:
\[ Var(X)=\sigma^{2}=E[(X-\mu)^{2}]=\sum_{all k}\left(k-\mu\right)^{2}*P(X=k) \]
The variance of a distribution provides a measure of the spread or dispersion of the distribution around its mean \(\mu\).
The standard deviation of a random variable \(X\) (\(SD(X)\)) is the square root of the variance. We denote the standard deviation by \(\sigma\) and the variance by \(\sigma^{2}\). E.g.: \(\sigma = \sqrt{\sigma^{2}}\)
par(mfrow=c(1,2))
plot(proportions(table(sample(1:5, size=1000, replace=TRUE))), ylab="probability")
plot(proportions(table(sample(1:10, size=1000, replace=TRUE))), ylab="probability")
Let’s return to the statistics book example and calculate \(Var(X)\) and \(SD(X)\). Recap: The textbook costs $137, the study guide costing $33. 20% of students don’t buy either book, 55% buy the textbook only, and 25% buy both books. Confirm your results using simulation.
<- c(0, 137, 170)
x <- c(.2, .55, .25)
p.x <- sum(x*p.x) mu
<- x - mu) (x.minus.mu
## [1] -117.85 19.15 52.15
<- sum(x.minus.mu^2 * p.x)) (var.dollars
## [1] 3659.327
sqrt(var.dollars)
## [1] 60.49238
<- sample(x, size=10000, prob=p.x, replace=TRUE)
spend var(spend)
## [1] 3595.947
Bonus: You may have noticed that the formula for \(\sum_{all k}\left(k-\mu\right)^{2}*P(X=k)\) has the same format as a dot product. You can perform vector multiplication like this in R using the %*%
operator.
^2 %*% p.x x.minus.mu
## [,1]
## [1,] 3659.327
Return to the retirement portfolio question (Recap: the value increases by 18% during a financial boom and by 9% during normal times, and decreases by 12% during a recession. Each scenario is equally likely). Calculate the variance and standard deviation.
<- c(18, 9, -12)
value.chg <- 1/3 p.chg
<- sum(value.chg * p.chg)) (mean.chg
## [1] 5
<- sum((value.chg-mean.chg)^2 * p.chg)) (var.chg
## [1] 158
<- sqrt(var.chg)) (sd.chg
## [1] 12.56981
<- sample(value.chg, size=10000, replace=TRUE)
portfolio.change var(portfolio.change)
## [1] 157.2113
sd(portfolio.change)
## [1] 12.53839