Introduction

R, R Studio and R Markdown are very powerful and can do a lot of things. More than we could even cover in a single semester, so we’re not. We’re going to take it slowly and introduce new features over time. By the end of the semester your documents will be near professional looking. This is how you will submit assignments, so you will get lots of practice.

  • R: The programming language (the engine)
  • R Studio: The integrated development environment (IDE) that we will use R from. (the car)
  • R Markdown: This type of document that combines R code with Markdown formatting to create a literate, reproducible document. (the heated seats, dual climate control and shimmery paint job)

At this point it is expected that you have followed the class_setup assignment which walks you through downloading & installing R, R Studio, and \(\LaTeX\), getting your class folder setup and downloading your homework 1 assignment.


Programming with R

Terminology

The basis of programming is that we write down instructions for the computer to follow, and then we tell the computer to follow those instructions.

We write, or code, instructions in R because it is a common language that both the computer and we can understand.

We call the instructions commands and we tell the computer to follow the instructions by executing (also called running) those commands.

The console pane is the place where commands written in the R language can be typed and executed immediately by the computer. It is also where the results will be shown for commands that have been executed.

You can type commands directly into the console and press Enter to execute those commands, but they will be forgotten when you close the session.

Code appearance

In these notes, code is displayed like this:

2+2
## [1] 4

where the output or result of the code is displayed with two pound signs (##)


Arithmetic

R is an overgrown calculator.

Type the following in the console and hit Enter to run these commands one at a time.

1+1
4-3
3*7
8/3
2^3
pi^2

There are many built in functions in R such as log or exp. Using these functions in R is not much different than using them on your calculator. The function log has two arguments, one is required and one is set to a default, \(log_{b}(x)\) where b is the base. The default is b which is set to exp(1) (i.e. the natural logarithm).

You try it

Below are some examples of built in functions in R. Run them from your console.

exp(2)
log(8)
log(8,base=2)

Now lets try a more complicated equation.

2 + 5*(8^3)- 3*log10)
## Error: <text>:1:21: unexpected ')'
## 1: 2 + 5*(8^3)- 3*log10)
##                         ^

Uh oh, we got an Error. Nothing to worry about, errors happen all the time. Put a open parenthesis ( before log10 to fix it and try again.

> R is waiting on you…

In the console type the following code, then press Enter.

2 + 5*(8^3)- 3*log(10

Notice the console shows a + prompt. This means that you haven’t finished entering a complete command.

This is because you have not ‘closed’ a parenthesis or quotation, i.e. you don’t have the same number of left-parentheses as right-parentheses, or the same number of opening and closing quotation marks.

When this happens, and you thought you finished typing your command, click inside the console window and press Esc; this will cancel the incomplete command and return you to the > prompt.

Storing results

Most of the time we will want to save our results. To do so we use the assignment operator <- notation. This lets us save a value into an object, which we can then use later similar to a variable \(x\) in algebra.

height <- 62
  • Visually identify the height object in your Global Environment (top right panel)
  • Assigned objects aren’t printed automatically, you must either enter the name of the object, or wrap the command in parenthesis.
(height <- 62)
## [1] 62
height
## [1] 62

Naming Conventions Be creative, yet informative with your variable names. You will be writing code a lot this semester, and you want your code to be unique to you and not look like a carbon copy of your neighbor. While pants and nifty are valid variable names, they may not be the best to describe the results from a die roll.

You try it

Run each of the following commands in your console one at a time. Print out the value of height after each time. What happens?

(height <- height + 2)
(height <- 3 * height)

Integration

To do finite integration you first define a function:

myfun <- function(x){x+5}

Then pass it to the integrate function.

(myint <- integrate(myfun, lower=0, upper=3))
## 19.5 with absolute error < 2.2e-13

The result of this integration can be accessed using $value

myint$value
## [1] 19.5

R Markdown

Credit Allison Horst

How does it work?

Ideally, such analysis reports are reproducible documents: If an error is discovered, or if some additional subjects are added to the data, you can just re-compile the report and get the new or corrected results rather than having to reconstruct figures, paste them into a Word document, and hand-edit various detailed results.

This process is known as literate programming

Basic Components of a Rmarkdown file

  • YAML header
  • code chunks
  • Markdown
    • bullet , numbered lists
    • bold, italic
      • Bold: **text**
      • Italic: _text_
    • section headers
      • First level #text
      • Second level ##text

You try it.

  1. Create a new R Markdown document. Save this to your class folder as test.Rmd.
  2. Delete all of the R code chunks and write a bit of Markdown (some sections, some italicized text, and an itemized list).
  3. Knit this file when you are done.

Basic \(\LaTeX\)

  • In line \(\LaTeX\) code is surrounded by dollar signs $ so $x^{2} resolves as \(x^{2}\).
  • \(\LaTeX\) code does NOT go inside R code chunks. It’s not R code.
  • To create equations on their own line, surround your equation in two $$, and for readability put a blank line before and after your equation in your Markdown document, and put the $$ each on their own line. Example:
$$
k_{n+1} = n^2 + k_n^2 - k_{n-1}
$$

resolves as

\[ k_{n+1} = n^2 + k_n^2 - k_{n-1} \]

  • Superscript: ^
  • Subscript: _
  • Greek, both lower and upper case: \alpha, \Alpha, \beta, \Beta, \gamma, \Gamma
  • Sums and integrals: sum(\(\sum\)), \sum_{i=1}^{10} t_i (\(\sum_{i=1}^{10} t_i\)), \int(\(\int\)), \int_0^\infty (\(\int_0^\infty\))

More help writing math in \(\LaTeX\): https://en.m.wikibooks.org/wiki/LaTeX/Mathematics

Visual Editor

RStudio has a nice visual editor to help you see what your compiled document will look like, and to help you with making your work more nicely formatted. See here for more information.. This also has help for technical writing such as LaTeX

Follow the link above and switch to visual editor mode for your new Rmarkdown test document.

Then type out the Pythagorean theorem, and knit to PDF to make sure it looks right.


Vectors

A vector is a list of values in order to be able to work with them. For us they will usually represent data collected on a characteristic of the population. In general, we want to give the vector a name so that we can call it later when needed.

  • Vector of the first 5 primes
primes <- c(2,3,5,7,11)
primes
## [1]  2  3  5  7 11
  • Vector of the numbers 1 through 10
1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
  • Sequence of odd numbers from 1 to 10,
(odds <- seq(1, 10, by=2))
## [1] 1 3 5 7 9
  • We can use the rep function to repeat a sequence of numbers in varying patterns. Read ?rep for a text explanation of the differences.
rep(c(2,3), times=c(4,3))
## [1] 2 2 2 2 3 3 3
rep(c(2,3), each=2)
## [1] 2 2 3 3
rep(c(2, 3), length.out = 3)
## [1] 2 3 2
  • The vector does not have to be numerical if we use quotes around the word
rep(c("Bryan", "Darrin"), each = 2)
## [1] "Bryan"  "Bryan"  "Darrin" "Darrin"

Indexing vectors

We can index vectors to pull off values at a particular position on a vector.

Returns first and second number of the vector we called “primes” above

primes[1]
## [1] 2
primes[2]
## [1] 3

Returns first three numbers of a vector

primes[1:3]
## [1] 2 3 5

We can change a vector to TRUE and FALSE by writing a logical statement.

primes>6
## [1] FALSE FALSE FALSE  TRUE  TRUE

This is super useful to do things like identify the numbers in primes that are greater than 6

primes[primes>6]
## [1]  7 11

And count the number of primes greater than 6

sum(primes>6)
## [1] 2

It may not seem useful to you now, but these are fundamental to calculating probabilities in the next chapter.


Logical Operators

Often we are going to want to compare elements of a vector to a value, or perhaps another vector. The standard comparison operators that will return either TRUE or FALSE are =, !=, >, <, >=, and <=.

  • Is 4 a prime?
primes[primes == 4]
## numeric(0)
  • Another way to see if a value is contained inside a vector is using the %in% operator.
4 %in% primes
## [1] FALSE
  • This works to compare each elements of a vector. Here it’s returning TRUE and FALSE for each element in odds, on whether or not it is also an element of primes.
odds %in% primes
## [1] FALSE  TRUE  TRUE  TRUE FALSE

TRUE and FALSE as binary indicators

What’s cool about TRUE and FALSE, is that TRUE resolves as 1 and FALSE resolves as 0 when doing arithmetic.

  • How many odd numbers are prime (in our given vectors)
sum(odds %in% primes)
## [1] 3

This can be very useful if we want to count the number of elements in a vector that meet a certain criteria.

AND and OR

Often we will want to find out if two events are true at the same time, or if at least one of them is true. Using parenthesis to help keep our statements organized, we can ask multiple logical statements at the same time using either AND & or OR |

  • Is 9 an odd prime?
(9 %in% odds) & (9 %in% primes)
## [1] FALSE

The and & results in a TRUE only if both values are TRUE. Here, 9 is an odd, but not a prime. So (9 %in% odds) is TRUE and (9 %in% primes) is FALSE. The combined statement “TRUE & FALSE” resolves as FALSE.

You try it

Is 9 an odd or a prime?

(9 %in% odds) | (9 %in% primes)
## [1] TRUE

The or “|” results in a TRUE if either value is true. The combined statement “TRUE OR FALSE” is TRUE.


Data Types

All data in R has a data type, and certain functions only work on certain data types. Here are common ones you will see in this class.

  • int integer
  • num number
  • chr character. aka string, aka text
  • logi logical. Can only be TRUE or FALSE

Which data types do you think you could take the mean() of?


Simulation

This is a quick reference. Chapter 2 goes into more detail

Draw 10 samples from the numbers 1,2 or 3 with replacement.

sample(c(1,2,3), 10, replace=TRUE)
##  [1] 1 1 3 2 1 1 2 2 3 2

Conduct an experiment multiple times. Only the object last referenced will be saved out. E.g., x is not retained, only the value of mean(x).

replicate(5, {
  x <- sample(c(1,2,3), 10, replace=TRUE)
  mean(x)
})
## [1] 2.2 2.0 2.4 1.6 2.2

Plotting

For starters we are going to stick with the simple plotting method in R, namely the plot function. Short and simple, the arguments are plot(x, y)

x <- -10:10
y <- x^2
plot(x, y)

The default is to just plot the points, but sometimes we may want to connect those points with a line (l is a lower case L). See ?plot for more plotting types.

plot(x, y, type='l')

You try it

Create a plot of \(y = log(x+1)\) where \(x\) is a sequence of non-negative numbers from \(a\) to \(b\) and YOU get to choose \(a\) and \(b\).

x <- seq(from=0, to=10, by=.01)
y <- log(x + 1)
plot(x, y)

Visualizing Distributions of Random Variables

Discrete random variables

Frequency table

get.numbers <- sample(1:10, 1000, replace=TRUE) # generate fake data
table(get.numbers) # create the table
## get.numbers
##   1   2   3   4   5   6   7   8   9  10 
##  89 104  86 100  93 103 108 102 109 106

Proportions

proportions(table(get.numbers))
## get.numbers
##     1     2     3     4     5     6     7     8     9    10 
## 0.089 0.104 0.086 0.100 0.093 0.103 0.108 0.102 0.109 0.106

Plot the table of proportions.

plot(proportions(table(get.numbers)))

Continuous random variables

We can plot functions directly, or create histograms and density curves from simulated values.

  • Direct plotting of known functions
x <- seq(0,1, by=0.01) # create values in the domain
y <- 4*x^{3} # pdf
plot(x,y, type = 'l') # the lower case 'l' draws a line

  • simulating distributions
x <- rnorm(1000) # draw 1000 values from a standard normal distribution
hist(x, nclass=30) # create a frequency histogram with 30 bins

  • Adding a known distributional curve over a histogram. Need to use prob=TRUE to change the y axis to a density so it’s on the same scale as the curve.
hist(x, nclass=30, prob=TRUE) 
curve(dnorm(x), add=TRUE, col="red") # note, this always stays x