Bernoulli Distribution (Speegle 3.3)

Binomial Distribution (Speegle 3.3.1)

Geometric Distribution (Speegle 3.3.2)

Negative Binomial Distribution (Speegle 3.6.2)

Poisson Distribution (Speegle 3.6.1)

Hypergeometric Distribution (Speegle 3.6.3)

Now that the foundations of random variables, probability distributions expectation and variance are under our belt, let’s start to look at some special random variables that occur so commonly, or have such mathematically wonderful properties that they have specific names. We will look at 6 different types of discrete random variables. For each we will learn the following:

How to define the random variable
How to identify the parameters and write the distributional notation
The formula for the pmf, and how to find theoretical probabilities
Formulas for the theoretical mean and variance
How to calculate all of the above using R commands (both theoretical, and via simulation)

A note on the R commands

In R, the common distributions are defined by their root name with 3 different prefixes:

d to compute e.g.: dbinom, dgeom, dhyper, dnbinom
p to compute e.g.: pbinom, pgeom, phyper, pnbinom
r to randomly draw N samples from the specified distribution. e.g: rbinom, rgeom, rhyper, rnbinom

Bernoulli Distribution (Speegle 3.3)

Situation The simplest type of experiment is one in which there are only two outcomes (success/failure, live/die, true/false, yes/no etc.). When running simulations in Chapter 2, you wrote your experiment to get down to a single TRUE/FALSE. You were creating a Bernoulli random variable. This simple, yet fundamental random variable serves as the basis for the rest of the distributions in this chapter.

Random variable: Let be a random variable that denotes the outcome from a Bernoulli trial with probability of success . Specifically let denote a success, and denote a failure. (What is considered a success is entirely up to context. If you are interested in mortality rate for a certain disease, then “death” would be a success.)

Distributional Notation:

pmf:

Mean and variance:

R commands: There are no fancy named R commands for this distribution. You can simulate this random variable using sample(c(0,1), prob=c(1-p, p)) directly, or through a Binomial random variable with =1.

Example:

A beet seed has been planted, and will either germinate or not. The probability of germination is 0.8, and germination is considered a success.

Let X be whether or not a beet seed has germinated.

Let X=1 mean germination and X=0 not germination.

Binomial Distribution (Speegle 3.3.1)

Situation: If independent random variables all have the same Bernoulli distribution with probability of success , then their sum is equal to the number of ’s which equal 1, and the distribution of the sum is known as a Binomial distribution. Examples include:

Toss 5 coins and count the number of heads
The number of times in a week a person is late for work, whey they have a 10% chance of being late each day, independent of other days.

Random variable: Let be a random variable that represents the number of “success” in a series of independent Bernoulli trials each with probability success .

Distributional Notation:

pmf:

Mean and variance:

R commands:

dbinom(x,size=n,prob=p) to compute
pbinom(x,size=n,prob=p) to compute
rbinom(N,size=n ,prob=p) to randomly draw N samples from a distribution.

Example:

Plant 10 beet seeds, and assume that the germination of one seed is independent of the germination of another seed, and all seeds have a germination probability of . Let be the number of seeds that germinated.

Our parameters are: . The pmf is:

and is written in distributional notation like:

Theoretical

Simulation

x <- rbinom(10000, 10, .8)
mean(x)

## [1] 8.001

var(x)

## [1] 1.597359

A coin for which the probability of heads is .6 is tossed nine times. Find the probability of obtaining 3 heads.

Let be the number of heads that appear coins are tossed.

with pmf

Find

by hand using the pmf

choose(9, 3)*(0.6)^(3)*(0.4)^(6)

## [1] 0.07431782

theoretical using R commands

dbinom(3, 9, .6)

## [1] 0.07431782

using simulation

x <- rbinom(10000, 9, .6)
mean(x == 3)

## [1] 0.0732

You try it:

10 students are selected at random, each has a probability of 0.10 of being a Math major. What is the probability that at least one student is a math major?

Let be the number of Math majors selected.

with pmf

Find . Hint: Use the complement

by hand using the pmf

1-choose(10, 0)*(0.1)^(0)*(0.9)^(10)

## [1] 0.6513216

theoretical using R commands

1-dbinom(0, 10, .1)

## [1] 0.6513216

simulation

n.math <- rbinom(10000, 10, .1)
mean(n.math >= 1)

## [1] 0.656

What is the expected number of math majors in a random sample of 10 students?

theoretical

On average one one in ten students is a math major.

Simulation

mean(n.math)

## [1] 1.0082

What is ?

# Theoretical
10*.1*.9

## [1] 0.9

# Simulation
var(n.math)

## [1] 0.9018229

Geometric Distribution (Speegle 3.3.2)

Situation Given a series of independent Bernoulli trials, we are accustomed to thinking of and as fixed, and is considered the number of successes for a binomial distribution. Suppose that the problem is turned around though, and the question is asked, how many trials will be required in order to achieve the first success? Put this way, the number of trials is the random variable and number of successes is fixed.

How many free throws can Stephen Curry make before he misses?
The probability that a random person who smokes will develop a severe lung condition in their lifetime is about 0.3. How many people do you have to check on before you meet someone with a severe lung condition?

Random Variable: Let be the number of failures before the first success in a Bernoulli process with probability of success .

Distributional Notation:

pmf:

Mean and variance:

R commands:

dgeom(x,prob=p) to compute
pgeom(x,prob=p) to compute
rgeom(N,prob=p) to randomly draw N samples from a distribution.

Example:

Professional basketball player Steve Nash was a 90% free throw shooter over his career. Answer the following questions using the formulas and also simulation.

Let be the number of free throws before he misses one. with pmf

If Steve Nash starts shooting free throws, how many would he expect to make before missing one?

Theoretical

Simulation

num.made.shots <- rgeom(10000, .1)
mean(num.made.shots)

## [1] 8.9175

What is the probability that he could make 20 in a row before he misses?

Find:

by hand using the pmf

.9^20*.1

## [1] 0.01215767

theoretical using R commands

dgeom(20, .1)

## [1] 0.01215767

using simulation

mean(num.made.shots == 20)

## [1] 0.0118

You try it:

Complete the following using both theoretical and simulation methods

The 2010 American Community Survey estimates that 47.1% of women ages 15 years and over are married. We randomly select three women over the age of 15.

Let be the number of unmarried women before selecting the first married woman.

What is the probability that the third women selected is the only one that is married?

Theoretical probability using the pmf by hand

(1-.471)^2*(.471)

## [1] 0.1318051

Theoretical probability using the pmf from R functions

dgeom(2, .471)

## [1] 0.1318051

Simulation

num.unmarried <- rgeom(10000, .471)
mean(num.unmarried==2)

## [1] 0.1295

On average, how many women would you expect to sample before selecting a married woman? What is the standard deviation?

Find: and

Theoretical

p <- .471
(E_X <- (1-p)/p)

## [1] 1.123142

(SD_X <- sqrt((1-p)/(p^2)))

## [1] 1.544212

Simulation

mean(num.unmarried)

## [1] 1.1262

sd(num.unmarried)

## [1] 1.547745

A machine that produces a special type of transistor has a 2% defective rate. The production is considered a random process where each transistor is independent of the others.

What is the probability that the 10th transistor produced is the first with a defect? Let be the number of good transistors before the first defective, then .

p <- .02

Find:

use pmf

(1-p)^9*p

## [1] 0.01667496

using R commands

dgeom(9, p)

## [1] 0.01667496

simulation

good.transistors <- rgeom(10000, p)
mean(good.transistors == 9)

## [1] 0.0193

What is the probability that the first failure occurs after the 4th transistor was produced?

Find:

use pmf

1 - ((1-p)^0*p + (1-p)^1*p + (1-p)^2*p + (1-p)^3*p)

## [1] 0.9223682

using R commands

1 - pgeom(3, p)

## [1] 0.9223682

simulation

mean(good.transistors >=4)

## [1] 0.9268

Negative Binomial Distribution (Speegle 3.6.2)

Situation A random variable with a negative binomial distribution originates from a context much like the one that yields the geometric distribution. Again, we focus on independent and identical trials, each of which results in one of two outcomes, success or failure. The probability of success, , stays constant for each trial. The geometric case handles the number of cases until the first success occurs. What if we are interested in knowing the number of trials until the second, third, fourth, etc success occurs. Examples:

How many people do you have to meet at college before you meet the 4th person from your hometown?
Tire Mart has a lot of really cheap tires, but 20% of them are defective. How many tires do you need to go through to find 4 new tires?

Random Variable: Let denotes the number of failures before the th success, where the probability of success is .

Distributional Notation:

pmf:

We can think of the negative binomial distribution as the sum of geometric distributions. This simplifies the functions for the mean and variance.

Mean and variance:

R commands:

dnbinom(x, n, prob=p) to compute
pnbinom(x, n, prob=p) to compute
rnbinom(N, n, prob=p) to randomly draw N samples from a distribution.

Example: Oil!

A geological study indicates that an exploratory oil well drilled in a particular region should strike oil with probability 0.2. We are interested in when the third oil strike hits.

Let be the number of dry wells (no oil was found) before the third oil strike (oil was found).

p <- .2
n <- 3

Write down the pmf, and then calculate the mean and variance.

pmf:

Theoretical

(E_X <- n*(1-p)/p)

## [1] 12

(Var_X <- n*(1-p)/(p^2))

## [1] 60

Simulation

x <- rnbinom(10000, n, p)
mean(x)

## [1] 11.9084

var(x)

## [1] 61.23253

Find the probability that the third oil strike comes on the fifth well drilled.

Find:

by hand using the pmf

choose(4, 2)*.2^3*.8^2

## [1] 0.03072

theoretical using R commands

dnbinom(2, 3, .2)

## [1] 0.03072

using simulation

x <- rnbinom(100000, 3, .2)
mean(x==2)

## [1] 0.03061

You try it:

Ten percent of the engines manufactured on an assembly line are defective.

If engines are randomly selected one at a time and tested, what is the probability that the first non defective engine will be found on the second trial?

success = non-defective engine (good engine)
P(success) = p = .9
Let X be the number of defective engines before the 1st non defective engine is found.

p.good <- .9

But since –> This is also a Geometric Distribution. Why?

Find:

Theoretical using pmf

p.good*(1-p.good)

## [1] 0.09

Theoretical using R commands

dnbinom(1, 1, .9)

## [1] 0.09

dgeom(1, .9)

## [1] 0.09

Simulation

x.good <- rnbinom(10000, 1, p.good)
mean(x.good==1)

## [1] 0.0868

What is the probability that the third non defective engine will be found on the fifth trial?

n.good <- 3
p.good <- .9

Let Y be the number of defective engines before the third good engine is found.

In the first four tries, there are 2 good and 2 bad engines.

Find:

Theoretical using pmf

choose(4, 2)*.9^3*.1^2

## [1] 0.04374

Theoretical using R commands

dnbinom(2, 3, .9)

## [1] 0.04374

Using simulation

x.3good <- rnbinom(10000, 3, .9)
mean(x.3good == 2)

## [1] 0.0422

Find the mean and variance of the number of trials on which the first non defective engine is found.

1*(1-.9)/.9 # E(X)

## [1] 0.1111111

1*(1-.9)/.9^2 # Var(X)

## [1] 0.1234568

Find the mean and variance of the number of failures until the third non defective engine is found.

3*(1-.9)/.9 # E(Y)

## [1] 0.3333333

3*(1-.9)/.9^2 # Var(Y)

## [1] 0.3703704

Poisson Distribution (Speegle 3.6.1)

Situation: A Poisson process is one where events occur at random times during a fixed time period. The events occur independently from each other, but with a constant average rate over that time period. Examples include

Number of calls per hour at a call center
Number of hits on a webpage in a day
Number of meteor strikes on the surface of the moon annually

Random Variable: Let be the number of events occurring in a Poisson process with rate over one unit of time (e.g. per year, per second, per day).

Distributional Notation:

pmf:

Mean and variance:

R commands:

dpois(x,lambda) to compute
ppois(x,lambda) to compute
rpois(N,lambda) to randomly draw N samples from a distribution.

Example

The Taurids meteor shower is visible on clear nights in the Fall and can have visible meteor rates around five per hour. What is the probability that a viewer will observe exactly eight meteors in two hours?

Let be the number of observed meteors in two hours. So , and has pmf

Find:

by hand using the pmf

exp(-10)*10^8/factorial(8)

## [1] 0.112599

theoretical using R commands

dpois(8, 10)

## [1] 0.112599

using simulation

meteors <- rpois(10000, 10)
mean(meteors == 8)

## [1] 0.1184

You try it:

Suppose a typist makes typos at a rate of 3 typos per 10 pages. What is the probability that they will make at most one typo on a five page document?

Let be the number of typos on a five page document. So , with pmf

Find .

2.5*exp(-1.5)

## [1] 0.5578254

theoretical using R functions

ppois(1, 1.5)

## [1] 0.5578254

simulation

typo <- rpois(10000, 1.5)
mean(typo <= 1)

## [1] 0.5531

Hypergeometric Distribution (Speegle 3.6.3)

Situation: The hypergeometric distribution is a series of Bernoulli trials that are dependent. This occurs when we are sampling without replacement from a finite population. Examples include:

Capture some fish, tag them & release them. Then come back later and fish some more, counting how many tagged ones you catch again.
Create a bipartisan committee of 10 senators, and count the number of Republicans chosen.
Consider a population of voters, 40% of which favored candidate Jones. Ten voters are selected at random the number of voters favoring Jones is observed.

Random Variable: Let denote the number of success out of a sample size of when drawing without replacement from a pool where there are a total of successes and failures available.

Distributional Notation:

pmf:

Mean and variance:

R commands:

dhyper(x, m, n, k) to compute
phyper(x, m, n, k) to compute
rhyper(N, m, n, k) to randomly draw N samples from a distribution.

Example

An urn contains nine chips, five red and four white. Three are drawn out at random without replacement. Let denote the number of red chips in the sample. Identify the parameters, and find and .

n <- 4 # number of failures
m <- 5 # number of successes
k <- 3 # size of sample
total <- m+n

Theoretical

(E_X <- k*(m/total))

## [1] 1.666667

(Var_X <- k*(m/total)*(n/total)*((total-k)/(total-1)))

## [1] 0.5555556

Simulation

n.red.chips <- rhyper(10000, m, n, k)
mean(n.red.chips)

## [1] 1.6831

var(n.red.chips)

## [1] 0.5597304

In a small pond there are 50 fish, 10 of which have been tagged. If a fisherman’s catch consists of 7 fish, selected at random and without replacement, and denotes the number of tagged fish, what is the probability that exactly 2 tagged fish are caught?

k <- 7
m <- 10
n <- 40

Find:

Theoretical

via pmf

choose(10, 2)*choose(40, 5) / choose(50, 7)

## [1] 0.2964463

via R commands

dhyper(2, 10, 40, 7)

## [1] 0.2964463

simulation

n.tagged.fish <- rhyper(10000, 10, 40, 7)
mean(n.tagged.fish == 2)

## [1] 0.2981

You try it:

Suppose that there are 3 defective items in a lot of 50 items. A sample of size 10 is taken at random and without replacement. Let denote the number of defective items in the sample.

k = 10 (size of sample drawn)
m = 3 (number of successes)
n = 47 (number of failures)
n+m = 50 (size of population)

with pmf:

Find the probability that the sample contains

Exactly 1 defective item.

Theoretical

using pmf

choose(3, 1)*choose(47, 9) / choose(50, 10)

## [1] 0.3979592

using R commands

dhyper(1, 3, 47, 10)

## [1] 0.3979592

simulation

n.defective <- rhyper(10000, 3, 47, 10)
mean(n.defective==1)

## [1] 0.3857

At most 1 defective item.

using R commands

phyper(1, 3, 47, 10)

## [1] 0.9020408

simulation

mean(n.defective<=1)

## [1] 0.9014

A display case contains thirty-five diamonds, of which ten are real diamonds and twenty-five are fake diamonds. A burglar removes four gems at random, one at a time and without replacement. What is the probability that the last gem she steals is the second real diamond in the set of four?

Let X be the number of real diamonds in the first three stolen gems. So

If the 4th gem is the 2nd real gem, then we need to find the probability that 1 out of the first 3 gems is real.

p.1real.in.3.gems <- dhyper(1, 10, 25, 3)

Then we multiply this probability by the probability that the 4th gem is also real, which is 9/32 because there are 9 real gems left out of the 32 total gems left.

p.1real.in.3.gems*(9/32)

## [1] 0.1289152

Section 3.4: Named Discrete Distributions

A note on the R commands

Bernoulli Distribution (Speegle 3.3)

Example:

Binomial Distribution (Speegle 3.3.1)

Example:

You try it:

Geometric Distribution (Speegle 3.3.2)

Example:

You try it:

Negative Binomial Distribution (Speegle 3.6.2)

Example: Oil!

You try it:

Poisson Distribution (Speegle 3.6.1)

Example

You try it:

Hypergeometric Distribution (Speegle 3.6.3)

Example

Theoretical

simulation

You try it:

Theoretical