This is a short introduction to summarizing and manipulating data in R by SuffolkEcon, the Department of Economics at Suffolk University.
Comments? Bugs? Drop us a line!
To go back to the SuffolkEcon website click here.
dplyr
is a package that makes it easier to work with data in R.
A package is like a plug-in. It ships a bunch of functions to your R so that you can use them whenever you want.
You can install dplyr
by running install.packages("dplyr")
in your R console. Or you can install the entire tidyverse – a family of data science packages that includes dplyr
– by running install.packages("tidyverse")
in your R console.
You can then invoke dplyr
and the other tidyverse
packages with
library(tidyverse)
We will demonstrate dplyr
using data from the Gapminder Project.
The gapminder
data shows life expectancy, income and population for each continent between the years 1952 - 2007:
gapminder %>%
slice_head(n=5)
Each column is a unique variable, and dplyr
provides functions to act on those columns.
Just like verbs act on nouns, functions act on objects, or columns of a data frame.
dplyr
provides functions that act on the R object dataframe
and tibble
(basically a fancier data frame).
A major function in dplyr
is summarise()
, used to summarise data:
data %>% # take your data, THEN
summarise() # do some calculation
There are a few other key functions that you do before or after summarise()
, like:
group_by()
to group observations by a category (or multiple categories)filter()
keep or filter data that meet certain criteriamutate()
create a new columnslice_
slice rows of the data frameselect()
slice columns of the data frameFor instance:
data %>% # take your data, THEN
filter() %>% # do some filtering, THEN
group_by() %>% # group by some category, THEN
mutate() %>% # create a new column, THEN
summarise() # do some calculation
%>%
When you write a sentence, you chain small thoughts together to form an idea.
When you analyze data, you chain small actions together to form an analysis.
Take the gapminder
data. Suppose you wanted to know average life expectancy by continent. You would:
dplyr
lets you write code in the same way. The key is the pipe operator, %>%
, which takes the place of “then”:
gapminder %>% # take the data, THEN
group_by(continent) %>% # organize the data by continent, THEN
summarise(avg_life_expectancy = mean(lifeExp)) # calculate average life expectancy.
So %>%
lets you chain together small actions on a data frame. This allows you to write code the way you would write a sentence: chaining small thoughts together one-by-one.
In general the pipe can be used to make nested code “linear” and thus easier to read. For instance, this:
# create a vector of integers 1 to 10, calculate the square root, then calculate the mean
mean(sqrt(1:10))
## [1] 2.246828
is the same as this:
1:10 %>% # create a vector of integers 1 to 10
sqrt() %>% # calculate the square root
mean() # calculate the mean
## [1] 2.246828
The slice_
operators let you slice (subset) rows:
slice_head(n=5)
: view the first 5 rowsslice_tail(n=5)
: view the last 5 rowsslice_sample(n=5)
: view 5 random rowsslice_min(column, n=5)
: view the 5 smallest values of a columnslice_max(column, n=5)
: view the 5 largest values of a columnThe first 3 rows of gapminder
:
gapminder %>%
slice_head(n=3)
The 6 largest values of gdpPercap
:
gapminder %>%
slice_max(gdpPercap, n=6)
Use select()
to view only certain columns:
gapminder %>%
select(year, gdpPercap) %>% # select year and gdpPercap
slice_max(gdpPercap, n = 1) # view the highest value of gdpPercap
Use select()
to select the columns for year (year
), continent (continent
) and life expectancy (lifeExp
), then use slice_min()
to find the two lowest values of life expectancy:
gapminder %>%
select(year, continent, lifeExp) %>%
slice_min(lifeExp, n = 2)
The gapminder
data has data on life expectancy (lifeExp
), population (pop
) and GDP per capita (gdpPercap
) by country, countinent and year:
gapminder %>%
slice_head(n=5)
We can calculate summary statistics for variables like lifeExp
, including:
mean()
the averagemedian()
the medianmin()
, max()
the min and maxvar()
, sd()
the variance and standard deviationn()
the count (number of observations)To do so we just pass the data to the dplyr
function summarise()
and then inside summarise()
make our calculation.
The general outline is:
data %>% # take your data, THEN
summarise() # do some calculation
Let’s calculate mean life expectancy lifeExp
:
gapminder %>%
summarise(mean(lifeExp))
Let’s give the statistic a name, like avg_life_expectancy
:
gapminder %>%
summarise(avg_life_expectancy = mean(lifeExp))
The median:
gapminder %>%
summarise(median_life_expectancy = median(lifeExp))
sd()
calculates the standard deviation:
gapminder %>%
summarise(sd_income = sd(gdpPercap))
Count the number of observations with n()
:
gapminder %>%
summarise(n_observations = n())
gapminder %>%
summarise(avg_life_expectancy = mean(lifeExp), # average life expectancy
avg_income = mean(gdpPercap), # average GDP per capita
avg_pop = mean(pop)) # average population
Calculate the mean, median and standard deviation of gdpPercap
. You don’t have to name the variables inside summarise()
.
gapminder %>%
summarise(mean(gdpPercap), median(gdpPercap), sd(gdpPercap))
What if we want to summarize a variable by a category or multiple categories?
For example, average life expectancy by continent, average population by year, median GDP per capita by country.
We can do this with group_by()
:
data %>% # take your data, THEN
group_by() %>% # group the data by some variable(s), THEN
summarise() # do some calculation
Average life expectancy by continent:
gapminder %>%
group_by(continent) %>%
summarise(avg_life_expectancy = mean(lifeExp))
Average population by year:
gapminder %>%
group_by(year) %>%
summarise(avg_pop = mean(pop))
Median GDP per capita by country:
gapminder %>%
group_by(country) %>%
summarise(med_income = median(gdpPercap))
Count the number of observations per continent:
gapminder %>%
group_by(continent) %>%
summarise(count = n())
Mean life expectancy by continent and by year:
gapminder %>%
group_by(continent, year) %>%
summarise(avg_life_expectancy = mean(lifeExp))
You can un-group a grouped data frame with ungroup()
.
Continuing with the previous example, this code will slice the largest value of avg_life_expectancy
for each continent/year:
gapminder %>%
group_by(continent, year) %>%
summarise(avg_life_expectancy = mean(lifeExp)) %>%
slice_max(avg_life_expectancy, n = 1)
If we ungroup()
after summarise()
then slice_max(avg_life_expectancy, n = 1)
will return the largest value in the entire data frame: (i.e., across all years and continents):
gapminder %>%
group_by(continent, year) %>%
summarise(avg_life_expectancy = mean(lifeExp)) %>%
ungroup() %>% # ungroup!
slice_max(avg_life_expectancy, n = 1)
gdpPercap
by continent. You don’t have to name the variables inside summarise()
.gapminder %>%
group_by(continent) %>%
summarise(mean(gdpPercap), median(gdpPercap), sd(gdpPercap))
gdpPercap
by continent and year. You don’t have to name the variables inside summarise()
.gapminder %>%
group_by(continent, year) %>%
summarise(mean(gdpPercap), median(gdpPercap), sd(gdpPercap))
filter()
allows you to filter rows of data with Boolean logic:
==
: “equal to”!=
: “not equal to”>
: “greater than”>=
: “greater than or equal to”<
: “less than”<=
: “less than or equal to”&
: “and”|
: “or”Boolean logic is any test that returns true or false:
2 == 3
## [1] FALSE
2 != 3
## [1] TRUE
2 < 3
## [1] TRUE
The general outline for filter()
is:
data %>% # take your data, THEN
filter() # filter the data
Average life expectancy in the year 1982:
gapminder %>%
filter(year == 1982) %>%
summarise(avg_life_expectancy = mean(lifeExp))
Average life expectancy in the years above 1980:
gapminder %>%
filter(year > 1980) %>%
summarise(avg_life_expectancy = mean(lifeExp))
Average life expectancy between the years 1990 and 2000, by continent and year. Notice the &
for “and”:
gapminder %>%
filter(year >= 1990 & year <= 2000) %>%
group_by(continent, year) %>% # by continent and year
summarise(avg_life_expectancy = mean(lifeExp))
Use |
for “or”. Like filtering for the continents Asia or Europe:
gapminder %>%
filter(continent == "Asia" | continent == "Europe") %>%
group_by(continent) %>% # by continent
summarise(avg_life_expectancy = mean(lifeExp))
Average life expectancy for Asia and Europe between 1990 and 2000:
gapminder %>%
filter(continent == "Asia" | continent == "Europe", year >= 1990 & year <= 2000) %>%
group_by(continent) %>%
summarise(avg_life_expectancy = mean(lifeExp))
gapminder %>%
filter(continent == "Europe") %>%
summarise(avg_life_expectancy = mean(lifeExp))
group_by()
after filter()
:gapminder %>%
filter(year >= 1990) %>%
group_by(continent, year) %>%
summarise(avg_life_expectancy = mean(lifeExp))
Create new variables or modify exisiting variables with mutate()
.
The general outline:
data %>% # take your data, THEN
mutate() # do a mutation
For example, create a column with average life expectancy:
gapminder %>%
mutate(avg_life_expectancy = mean(lifeExp))
Combine with group_by()
to get average life expectancy by continent and year:
gapminder %>%
group_by(continent, year) %>%
mutate(avg_life_expectancy = mean(lifeExp))
To save a newly created variable you have to reassign the dataframe.
The general outline:
data = data %>% # take your data, THEN
mutate() # do a mutation
For example, let’s save the variable with average life expectancy by year and continent:
gapminder = gapminder %>% # reassignment
group_by(continent, year) %>%
mutate(avg_life_expectancy = mean(lifeExp))
Now the gapminder
data frame has a new column avg_life_expectancy
:
gapminder = gapminder %>% # reassignment
group_by(continent, year) %>%
mutate(avg_life_expectancy = mean(lifeExp)) %>%
ungroup()
gapminder %>%
select(continent, year, avg_life_expectancy) %>%
slice_head(n=5)
gapminder %>%
group_by(continent, year) %>%
summarise(median_gdp = median(gdpPercap))
ungroup()
) and slice_max()
the five highest values of “deviation_gdp”.gapminder %>%
group_by(continent, year) %>%
summarise(deviation_gdp = gdpPercap - mean(gdpPercap)) %>%
ungroup() %>%
slice_max(deviation_gdp, n = 1)
Ready to visualize data? Do the tutorial on data visualization with ggplot2
dplyr