Introduction

  • Goal of Statistics is to describe the real world based on limited observations.
  • Observations are influenced by random, and non-random conditions (e.g. the weather, what you ate for breakfast)
  • Probability is a way to mathematically describe random events.

Vocabulary

  • Experiment is a process that produces an observation.
  • Outcome is a possible observation.
  • Sample space is the set of all possible outcomes.
  • Event is a subset of the sample space that describes a certain characteristics of the space.
  • Trial is a single running of an experiment.

Examples

  1. Roll a die and observe the number of dots on the top face.
  • Experiment
  • Six possible outcomes
  • The sample space is the set S={1,2,3,4,5,6}.
  • The event “roll higher than 3” is the set {4,5,6}.
  1. Stop a random person on the street and ask them what month they were born.
  • Twelve months of the year as possible outcomes.
  • Example event E might be that they were born in a summer month, E={June,July,August}
  1. Suppose a traffic light stays red for 90 seconds each cycle. While driving you arrive at this light, and observe the amount of time that you are stopped until the light turns green.
  • Sample space is the interval of real numbers [0,90].
  • The event “you didn’t have to stop” is the set {0}.

You try it:

For each of the following problems, identify the sample space and the events described in set notation.

  1. Observe eye color of a group of students.
    • Sample space: {green, blue, brown, hazel}
    • Event student does not have blue eyes: {green,brown,hazel}
  2. Number of credits a student can take:
    • Sample space: {0,1,2,…,22}
    • Event student takes less than 9 credits: {0,1,…,8}
  3. Toss a coin and roll a die.
    • Sample space:{1H, 2H, 3H, 4H, 5H, 6H,1T,2T,3T,4T,5T,6T}
    • Event that you get tails: {1T,2T,3T,4T,5T,6T}
  4. A soccer team is in the playoffs. The team will play three games and will either win (w) or lose (l) each game (assume ties are not allowed).
    • Sample Space: {www,wll,lwl,llw,lll,lww,wlw,wwl}
    • Event that at least 2 games are won: {www,lww,wlw,wwl}

Set Definitions

Let \(A\) and \(B\) be events in a sample space \(S\). Complete the following definitions and write an example of each using context situations above.

  • \(A\cap B\) is the set of outcomes that are in both \(A\) and \(B\) at the same time.
  • \(A\cup B\) is the set of outcomes that are in either \(A\) or \(B\) or both
  • The complement of \(A\) is \(\bar{A}=S\setminus A\) or \(A^{c}\). So, \(A^{c}\) is the set of outcomes in \(S\) not in \(A\).
  • The symbol \(\emptyset\) is the empty set, the set with no outcomes.
  • \(A\) and \(B\) are disjoint or mutually exclusive if and only if \(A\cap B=\emptyset\).
  • \(A\cap B^{c}\) elements that are in A, and also NOT in B.

Note that element and outcome can be used interchangeably.

Venn Diagram

The most common kind of picture to make to describe sample spaces and events within sample spaces is a Venn Diagram. A Venn diagram uses overlapping circles or other shapes to illustrate the logical relationships between two or more sets of items.

Use Venn Diagrams to visualize definitions in 2.5.

Example

Say 3 roommates are deciding on a pet. They use a Venn Diagram to determine which pet might be the best pick for them.

  • Sidney prefers: cat, bird, hamster,spider, goat.
  • Ralph prefers: dog, cat, fish,goat.
  • Gilbert prefers: horse, cat, dog, turtle, snake,goat,fish

What pet should they choose? Cat or Goat

You try it:

A single card is drawn from a standard deck of cards. (Not sure what that looks like? See here: https://en.wikipedia.org/wiki/Standard_52-card_deck). Let \(A\) be the event that an ace is selected, and let \(B\) be the event that a heart is drawn.

  1. Define \(A\) and \(B\) using set notation.
  • \(A\) = {ace of hearts, ace of diamonds, ace of clubs, ace of spades}
  • \(B\) = {2 of hearts, 3 of hearts, …, ace of hearts}
  1. Write the event space, and what the following mean in context of a deck of cards.
  • \(A\cup B\) is all hearts and all aces. {2 hearts,3 hearts,….,Ace hearts, ace diamonds, ace clubs, ace spades}
  • \(A\cap B\) is the ace of hearts {ace of hearts}
  • \(B^{c}\) everything in the dek that is not a heart.

Set operations in R

We can also rely on R to perform union and intersection calculations. The functions union and intersect and setdiff can be used in R to compute intersections and unions. Each function can only take into consideration 2 vectors.

Example

Lets revisit the deck of cards problem from above: A single card is drawn from a standard deck of cards. Let \(A\) be the event that an ace is selected, and let \(B\) be the event that a heart is drawn.

First create the sample space and event vectors. I recommend that when you do this on your own you print the vector to ensure that what’s being created is what is intended. Trust, but verify your code.

numbers <- rep(c(2:10, "J", "Q", "K", "A"), 4)
suits <- rep(c("H", "C", "D", "S"), each = 13)
deck <- paste0(numbers, suits) # Sample Space
aces <- c("AH", "AC", "AD", "AS") # Event A
hearts <- paste0(c(2:10, "J", "Q", "K", "A"), "H") # Event B

Then we can use R functions to find the following statements.

  • \(A\cup B\)
(aces.and.hearts <- union(aces, hearts))
##  [1] "AH"  "AC"  "AD"  "AS"  "2H"  "3H"  "4H"  "5H"  "6H"  "7H"  "8H"  "9H" 
## [13] "10H" "JH"  "QH"  "KH"
  • \(A\cap B\)
(ace.of.hearts <- intersect(aces, hearts))
## [1] "AH"
  • \(B^{c}\)
(no.hearts <- setdiff(deck, hearts))
##  [1] "2C"  "3C"  "4C"  "5C"  "6C"  "7C"  "8C"  "9C"  "10C" "JC"  "QC"  "KC" 
## [13] "AC"  "2D"  "3D"  "4D"  "5D"  "6D"  "7D"  "8D"  "9D"  "10D" "JD"  "QD" 
## [25] "KD"  "AD"  "2S"  "3S"  "4S"  "5S"  "6S"  "7S"  "8S"  "9S"  "10S" "JS" 
## [37] "QS"  "KS"  "AS"

You try it:

Suppose that one card is to be selected from a deck of 20 cards that contains 10 red cards numbered from 1 to 10 and 10 blue cards numbered from 1 to 10. Let \(A\) be the event that a card with an even number is selected, let \(B\) be the event that a blue card is selected, and let \(C\) be the event that a card with a number less than 5 is selected.

Define the sample space and each event in R.

S <- c(paste0(1:10, "R"), paste0(1:10, "B"))
A <- c(paste0(seq(2,10, by=2), "R"), paste0(seq(2,10, by=2), "B"))
B <- paste0(1:10, "B")
C <- c(paste0(1:4, "R"), paste0(1:4, "B"))
# print objects
S;A;B;C
##  [1] "1R"  "2R"  "3R"  "4R"  "5R"  "6R"  "7R"  "8R"  "9R"  "10R" "1B"  "2B" 
## [13] "3B"  "4B"  "5B"  "6B"  "7B"  "8B"  "9B"  "10B"
##  [1] "2R"  "4R"  "6R"  "8R"  "10R" "2B"  "4B"  "6B"  "8B"  "10B"
##  [1] "1B"  "2B"  "3B"  "4B"  "5B"  "6B"  "7B"  "8B"  "9B"  "10B"
## [1] "1R" "2R" "3R" "4R" "1B" "2B" "3B" "4B"

Alternatively, this code has the same result.

numbers <- rep(1:10, 2)
colors <- rep(c("R", "B"), each = 10)
deck <- paste0(numbers, colors)

A <- deck[seq(from=2, to=20, by=2)]
B <- deck[11:20]
C <- deck[c(1:5,11:15)]

Another alternative method:

sample_space <- c("1R","2R","3R","4R","5R","6R","7R","8R","9R","10R",
                "1B","2B","3B","4B","5B","6B","7B","8B","9B","10B")

A <- c("2R", "4R", "6R", "8R", "2B", "4B", "6B", "8B")
B <- c("1B", "2B", "3B", "4B", "5B", "6B", "7B", "8B", "9B", "10B")
C <- c("1R", "1B", "2R", "2B", "3R", "3B", "4R", "4B")
  1. \(A\cap B\cap C\)
(a_and_b   <- intersect(A,B))
## [1] "2B"  "4B"  "6B"  "8B"  "10B"
(a_and_b_and_c <- intersect(a_and_b,C))
## [1] "2B" "4B"
  1. \(B\cup C^{c}\)
C_complement <- setdiff(S,C)
(b_or_Cc <- union(B, C_complement))
##  [1] "1B"  "2B"  "3B"  "4B"  "5B"  "6B"  "7B"  "8B"  "9B"  "10B" "5R"  "6R" 
## [13] "7R"  "8R"  "9R"  "10R"
  1. \(A\cap (B\cup C)\)
b_or_c <- union(B, C)
a_and_b_or_c <- intersect(A, b_or_c)
  1. \(A^{c}\cap B^{c}\cap C^{c}\)
A_complement <- setdiff(S,A)
B_complement <- setdiff(S,B)
(Ac_and_Bc <- intersect(A_complement,B_complement))
## [1] "1R" "3R" "5R" "7R" "9R"
(Ac_and_Bc_and_Cc <- intersect(Ac_and_Bc, C_complement))
## [1] "5R" "7R" "9R"

Definition of Probability

The probability of an event describes the proportion of time we expect the event to occur if we observed the event an infinite number of times.

Let \(S\) be a sample space. A valid probability of events \(A\) is a number \(P(A)\) between 0 and 1 (inclusive), so \(0\leq P(A)\leq 1\), that satisfies the following probability axioms:

  1. The probability of the sample space is 1
  2. Probabilities are countably additive. If \(A_{1}\), \(A_{2}\),…,\(A_{n}\) are disjoint then \[ P(A_{1}\cup A_{2}\cdots\cup A_{n})=\sum P(A_{n}) \]

Probability rules

These are some important rules to memorize that come about as a result of the above axioms. Here are a few, there are more in the textbook.

Let \(A\) and \(B\) be events in the sample space \(S\).

  • The probability of \(\emptyset\) is 0.
  • If \(A\) and \(B\) are disjoint then \(P(A\cup B)=P(A)+P(B)\).
  • \(P(A)=1-P(A^{c})\).
  • \(P(A\cup B)=P(A)+P(B)-P(A\cap B)\)

Example

  1. These rules allow you to manipulate equations to find unknown quantities based on known ones using algebra. Let’s use these to show that \(P\left(A\bigcap B\right) \geq 1-P(A^{C})-P(B^{C})\) for any two events \(A\) and \(B\) defined on a sample space \(S\).

\[ \begin{align} P(A \cap B) & = P(A) + P(B) - P(A \cup B) \\ & = 1 - P(A^{c}) + 1 - P(B^{c}) - P(A \cup B) \\ & = [1 - P(A^{c})- P(B^{c})] + [1 - P(A \cup B)] \end{align} \]

Since \(0 \leq P(A \cup B) \leq 1 \quad \rightarrow \quad 1 - P(A \cup B) \geq 0\),

Then \(P(A \cap B) \geq [1 - P(A^{c})- P(B^{c})] +\) [something larger than 0].

  1. Events \(A\) and \(B\) are defined on a sample space \(S\) such that \(P((A\cup B)^{c})=0.5\) and \(P(A\cap B)=0.2\). What is the probability that either \(A\) or \(B\) but not both will occur?

Sometimes venn diagrams can be helpful to solve problems

You try it:

If 50 percent of the families in a certain city subscribe to the morning newspaper, 65 percent of the families subscribe to the afternoon newspaper, and 85 percent of the families subscribe to at least one of the two newspapers. Draw a Venn Diagram to represent this situation.

  • What percentage of the families subscribe to both newspapers?

\[ P(AM \cap PM) = P(AM) + P(PM) - P(AM \cup PM) \\ \]

.50 + .65 - .85
## [1] 0.3
  • What percentage of the families subscribe to only the afternoon paper?

\(P(PM \cap AM^{c})\) =

.65 - .3
## [1] 0.35
  • What percentage of the families don’t subscribe to any paper?

\(P(AM \cup PM)^{c}\)

1 - .85
## [1] 0.15

Example

David Diez was interested in exploring the factors that contribute to an email being flagged as spam by Gmail’s system. So they downloaded all their emails for a few months in 2012 and noted certain characteristics such as if it was flagged as spam (0 means no, and 1 means yes), and what size of a number it contained (none, small, or big). A two-way table of emails with these two characteristics are shown below.

##      Size of number
## Spam  none small  big  Sum
##   0    400  2659  495 3554
##   1    149   168   50  367
##   Sum  549  2827  545 3921

If you were to randomly select an email from this pool, calculate the following probabilities:

  • It is flagged as spam
367/3921
## [1] 0.09359857
  • It has a big number
545/3921
## [1] 0.1389952
  • It is not flagged as spam and has a small number
2659/3921
## [1] 0.6781433

You try it

The following data table describes the sex by species breakdown for 333 observed penguins on islands in the Palmer Archipelago, Antarctica.

##            Sex
## Species     female male Sum
##   Adelie        73   73 146
##   Chinstrap     34   34  68
##   Gentoo        58   61 119
##   Sum          165  168 333

If you were to select a penguin at random from these islands, what is the estimated probability that,

  • the penguin is female
165/333
## [1] 0.4954955
  • the penguin is a Gentoo species
119/333
## [1] 0.3573574
  • the penguin is a male Chinstrap
34/333
## [1] 0.1021021