This is a short introduction to R programming by SuffolkEcon, the Department of Economics at Suffolk University.
Comments? Bugs? Drop us a line!
To go back to the SuffolkEcon website click here.
Some fundamentals of R.
At the end we will look at data on income and life expectancy and recreate this plot:
Say you want to make an omelette. It goes something like this:
Programming in R follows the same basic input \(\rightarrow\) action \(\rightarrow\) output framework:
Keep this framework in mind as you work through the tutorial.
The simplest way to use R is as a calculator.
For example, the code below calculates \(1+1\). Click the “Run Code” button in the chunk below:
1+1 # add one and one
and this code calculates \(2^3 + 4*5 + \frac{5}{6}\):
2^3 + 4*5 + 5/6 # two to the third plus four to the fifth plus five divided by six
Now you try. Calculate \(2^2 \times 3^3\):
2^2 * 3^3
Do you remember what you had for breakfast last Tuesday? Probably not. Same goes for coding.
You will look back at code you wrote last week and realize you don’t what it does, or why you wrote it.
Comments are notes you write in the code to help you (or somebody else) understand what’s going on.
Let’s look at that first chunk we ran:
1 + 1 # add 1 and 1
The code spits out “2”, not “# add 1 and 1.” That’s because “#” is a comment. And everything after a “#” is ignored by R.
Commenting is important – especially when you are starting out.
What should you comment? Everything.
When should you comment? Always.
Why should you comment? Because you want to be nice to future self.
x = 1 + 1
OK, we can use R to make calculations.
But what if we want to save the output of one calculation and use it in another calculation?
R has memory. What does it remember? Whatever you tell it to! This is called defining objects.
An object can be many things. Let’s try a simple one.
Suppose we want to calculate \(1+1\) and store it in an object. We have to give it a name. Let’s call it “x”. Run this code:
x = 1 + 1 # add 1 and 1, then assign the output to "x"
Notice two things:
=
is the assignment operator. It says: "take 1+1 and assign its output to x
.x = 1 + 1
, nothing spits out. That’s because the output is stored in x
.Let’s see what’s inside x
by printing it. We should this code spit out “2”:
x # print "x" to the screen
It worked!
R will now remember x
. That means we can use x
in another calculation.
For example, let’s calculate \(x+1\):
x + 1 # add 1 to each element of x
We can do math with multiple objects. Let’s define an object x2
equal to 2 and then calculate \(x + x_2\):
x2 = 2 # assign the number 2 to the object "x2"
x + x2 # add x and x2
Finally, you can rewrite R’s memory. For example, let’s redefine x
:
x = 2 + 2 # redefine "x" as the output of 2 + 2
x # print the new x to the screen
y
, assign \(2^2\) to it, and then print y
:y = 2^2
y
z
, assign \(y/2\) to it, and then print z
:y = 2^2
z = y/2
z
x = c(5,10,15,20)
So far so good. But science is about learning from many data points. So we need objects that can store many data points. The simplest are vectors.
Consider the vector \(\vec{x} = [5,10,15,20]\). To tell R to create this vector, we use the command c()
:
x = c(5,10,15,20) # create the vector [5,10,15,20] and assign it to "x"
x # print x to the screen
Once again we can do math with objects. Let’s calculate the mean of x
:
mean(x) # calculate the mean of x
Or let’s multiply each element of x
by 2:
x*2 # multiply each element of x by 2
We can work with all of x
. Or we can work with one or a few elements of it. That’s because every element in \(\vec{x}\) (and x
) has a numeric position:
To slice a vector we use x[position]
.
For example, let’s look at the first element of x
:
x[1] # show the first element of x
or let’s take the third element of x
– 15 – and add three:
x[3] + 3 # take the third element of x and add three
y = c(1,2,3,4)
y = c(1,2,3,4)
y/4
z
by the second element of y
.z = c(2,3,4,8)
z[4]*y[2]
x = c(1,2,3,4,5) # create the vector x
y = c(6,7,8,9,10) # create the vector y
df = data.frame(x,y) # join x and y into a dataframe called "df"
You’ve probably seen an Excel spreadsheet before. Something like this table below. It has columns, each column has a header, and each header has data below it:
In R this is called a dataframe
. And a dataframe
is just an object that chains vectors together.
dataframe
Let’s create two vectors \(\vec{x}\) and \(\vec{y}\). Run this chunk to create them (nothing will get spit out, since we assigned the outputs to x
and y
):
x = c(1,2,3,4,5) # create a vector x
y = c(6,7,8,9,10) # create a vector y
We can combine them with data.frame()
. As usual, we have to give our object a name. Let’s call it “df” for “dataframe”:
df = data.frame(x,y) # join x and y in a dataframe
df # print the dataframe to the screen
We see a column called “x” (that’s x
), a column called “y” (that’s y
), and on the left we see the number of rows. So df
is a 5 (row) by 2 (column) data frame.
Just like you can subset vectors, you can access specific columns in data frames.
To access a column, we use $
:
df$y # show the column named "y"
Let’s calculate the mean of the first column of df
:
mean(df$x) # calculate the mean of the column named "x"
Finally, let’s create a new variable called xy
that multiplies x
and y
and add it as a new column:
df$xy = df$x * df$y # multiply the elements of x with elements of y, then assign the output to a new column called "xy"
df # print the dataframe to the screen
Create a new variable called xsquared
that calculates \(x^2\), then add it to df
. Then print the new data frame:
df$xsquared = df$x^2
df
Remember this?
So far we’ve worked on objects – the inputs.
But we’ve also worked on actions , like mean()
.
In R, actions are functions. What is a function? Something that takes an input, transforms it, and then returns the output.
For example, mean()
. It takes an input (a standalone vector, or a data frame column), and it calculates and returns the sum of the observations divided by the number of observations:
\[ \bar{x} = \frac{\sum_{i=1}^n x_i}{n} \]
vec = c(1, 2, 3, 4, 5) # create a vector called "vec"
mean(vec) # calculate the mean
x
with the numbers 10, 100, and 1000. Then calculate the mean of x
.x = c(10, 100, 1000)
mean(x)
z
with the numbers 10, 11, 12, 13, 14, 15. Then use median()
to calculate the median of z
.z = c(10, 11, 12, 13, 14, 15)
median(z)
You’ve made it this far. Great job!
Don’t worry if you feel shaky about some of the material. We covered a lot of ground. So let’s review.
Remember the framework we started out with:
We learned about objects – the inputs.
Then we learned about functions – the actions.
And maybe you noticed that functions return new objects! Like when we added a column to a data frame.
We pull this all together and restate our 1-2-3 process in R as:
And that’s it!
We’re ready to do some meaningful stuff. Let’s recreate this plot:
We’ll do it in two steps:
gapminderSummary = gapminder %>% # 1. take the data, THEN
group_by(continent, year) %>% # 2. organize by continent and year, THEN
summarise( # 3. calculate:
avgLifeExp = mean(lifeExp), # average life expectancy
avgGdpPercap = mean(gdpPercap), # average GDP per capita
avgPop = mean(pop)) # average population
The Gapminder Project provides GDP per capita and life expectancy around the world from 1952 to 2007:
gapminder
gapminder
is a data frame. We know that each column in a data frame is a vector. And we know we can calculate a statistic on a vector, like average life expectancy:
mean(gapminder$lifeExp)
But that calculation pools over time and space. It’s the mean for all years (1952-2007) and all countries and continents.
What we would like to do is calculate, for example, average life expectancy within categories.
That is, average life expectancy by year, by continent, by year and by continent, and so on.
dplyr
: summarize data the easy waydplyr
is a package. A package is a plug-in that makes R easier to use. dplyr
is a plug-in that makes it easier to work with data frames. Moreover, dplyr
is part of a suite of packages for data science known as the tidyverse
. We can invoke these packages with the function library(tidyverse)
:
library(tidyverse)
Suppose we want to calculate average life expectancy, GDP per capita, and population – by year and by continent.
Let’s write down the steps, as if we were instructing a person:
OK. But how can we translate these to R?
Here are the translated instructions in dplyr
. Run this chunk (nothing will pop out, because the output is stored in the new object gapminderSummary
):
gapminderSummary = gapminder %>% # 1. take the data, THEN
group_by(continent, year) %>% # 2. organize by continent and year, THEN
summarise( # 3. calculate:
avgLifeExp = mean(lifeExp), # average life expectancy
avgGdpPercap = mean(gdpPercap), # average GDP per capita
avgPop = mean(pop)) # average population
Let’s view the output:
gapminderSummary
See that %>%
? That is the pipe operator. Notice how it takes the place of “THEN” in our written instructions.
%>%
lets you chain together small actions on a data frame:
gapminder
(take the data) THENgroupby(continent, year)
(organize by continent and year) THENsummarise(...)
(calculate statistics on columns)The nice thing about the pipe operator, and dplyr
in general, is that you can write code the way you would write a sentence: chaining small thoughts together one-by-one.
ggplot2
: visualize data the easy wayggplot2
is another package from the tidyverse
. This one makes it easy to visualize data frames. Like dplyr
, we can invoke it – and dplyr
, and the other tidyverse
packages – with
library(tidyverse)
The core action in ggplot2
is ggplot()
. Every ggplot()
follows three basic steps:
ggplot()
begins with an empty plot. You just have to feed it the data and the columns you want to visualize. Let’s put average GDP per capita on the x-axis and average life expectancy on the y-axis:gapminderSummary %>% # take the data, THEN
ggplot(aes(x=avgGdpPercap, y=avgLifeExp)) # create empty plot
ggplot2
are called “geoms”. You just pick the geom that corresponds to the plot you want. If we want a scatterplot, then we add (+
) the layer geom_point()
:gapminderSummary %>% # take the data, THEN
ggplot(aes(x=avgGdpPercap, y=avgLifeExp)) + # create empty plot, THEN (notice the +)
geom_point() # add a scatterplot layer
Every new layer in a ggplot
is added with +
.
gapminderSummary %>% # take the data, THEN
ggplot(aes(x=avgGdpPercap, y=avgLifeExp)) + # create empty plot, THEN (notice the +)
geom_point(aes(size = avgPop, color = continent), alpha = 0.6) + # add a scatterplot layer, with points sized by a continent's average population, colored by continent, and 60% transparency, THEN
guides(size = FALSE) + # turn off the point size legend
labs(title = "Live Long and Prosper", # add a title
subtitle = "Income and Life Expectancy, 1952 to 2007", # add a subtitle
caption = "Source: Gapminder.org | @traffordDataLab", # add a caption
x = "GDP per capita (US Dollars)", # x-axis title
y = "Age (Years)", # y-axis title
color="Continent") # legend title
Don’t worry if you don’t understand everything about the code. The bigger picture is:
dplyr
makes it easier to manipulate data – because you can write code like you would write sentences, and;ggplot2
makes it easier to visualize data, because any plot follows the same three steps.