Data with dplyr

Welcome!

This is a short introduction to summarizing and manipulating data in R by SuffolkEcon, the Department of Economics at Suffolk University.

Comments? Bugs? Drop us a line!

To go back to the SuffolkEcon website click here.

Overview

dplyr is a package that makes it easier to work with data in R.

What is a package?

A package is like a plug-in. It ships a bunch of functions to your R so that you can use them whenever you want.

You can install dplyr by running install.packages("dplyr") in your R console. Or you can install the entire tidyverse – a family of data science packages that includes dplyr – by running install.packages("tidyverse") in your R console.

You can then invoke dplyr and the other tidyverse packages with

library(tidyverse)

Data

We will demonstrate dplyr using data from the Gapminder Project.

The gapminder data shows life expectancy, income and population for each continent between the years 1952 - 2007:

gapminder %>% 
  slice_head(n=5)

Each column is a unique variable, and dplyr provides functions to act on those columns.

Functions are verbs

Just like verbs act on nouns, functions act on objects, or columns of a data frame.

dplyr provides functions that act on the R object dataframe and tibble (basically a fancier data frame).

A major function in dplyr is summarise(), used to summarise data:

data %>% # take your data, THEN
  summarise() # do some calculation

There are a few other key functions that you do before or after summarise(), like:

  • group_by() to group observations by a category (or multiple categories)
  • filter() keep or filter data that meet certain criteria
  • mutate() create a new column
  • slice_ slice rows of the data frame
  • select() slice columns of the data frame

For instance:

data %>% # take your data, THEN
  filter() %>%  # do some filtering, THEN
  group_by() %>% # group by some category, THEN
  mutate() %>% # create a new column, THEN
  summarise() # do some calculation 

The pipe operator: %>%

When you write a sentence, you chain small thoughts together to form an idea.

When you analyze data, you chain small actions together to form an analysis.

Take the gapminder data. Suppose you wanted to know average life expectancy by continent. You would:

  1. Take the data, THEN
  2. Organize by continent, THEN
  3. Calculate average life expectancy.

dplyr lets you write code in the same way. The key is the pipe operator, %>%, which takes the place of “then”:

gapminder %>% # take the data, THEN
  group_by(continent) %>% # organize the data by continent, THEN
  summarise(avg_life_expectancy = mean(lifeExp)) # calculate average life expectancy.

So %>% lets you chain together small actions on a data frame. This allows you to write code the way you would write a sentence: chaining small thoughts together one-by-one.

In general the pipe can be used to make nested code “linear” and thus easier to read. For instance, this:

# create a vector of integers 1 to 10, calculate the square root, then calculate the mean 
mean(sqrt(1:10))
## [1] 2.246828

is the same as this:

1:10 %>% # create a vector of integers 1 to 10
  sqrt() %>% # calculate the square root
  mean() # calculate the mean
## [1] 2.246828

Slicing and selecting data

Slice rows

The slice_ operators let you slice (subset) rows:

  • slice_head(n=5): view the first 5 rows
  • slice_tail(n=5): view the last 5 rows
  • slice_sample(n=5): view 5 random rows
  • slice_min(column, n=5): view the 5 smallest values of a column
  • slice_max(column, n=5): view the 5 largest values of a column

The first 3 rows of gapminder:

gapminder %>% 
  slice_head(n=3)

The 6 largest values of gdpPercap:

gapminder %>% 
  slice_max(gdpPercap, n=6)

Select columns

Use select() to view only certain columns:

gapminder %>% 
  select(year, gdpPercap) %>% # select year and gdpPercap
  slice_max(gdpPercap, n = 1) # view the highest value of gdpPercap

Exercises

Use select() to select the columns for year (year), continent (continent) and life expectancy (lifeExp), then use slice_min() to find the two lowest values of life expectancy:

gapminder %>% 
  select(year, continent, lifeExp) %>% 
  slice_min(lifeExp, n = 2)

Summarizing data

The gapminder data has data on life expectancy (lifeExp), population (pop) and GDP per capita (gdpPercap) by country, countinent and year:

gapminder %>% 
  slice_head(n=5)

We can calculate summary statistics for variables like lifeExp, including:

  • mean() the average
  • median() the median
  • min(), max() the min and max
  • var(), sd() the variance and standard deviation
  • n() the count (number of observations)

To do so we just pass the data to the dplyr function summarise() and then inside summarise() make our calculation.

The general outline is:

data %>% # take your data, THEN
  summarise() # do some calculation

Examples

Let’s calculate mean life expectancy lifeExp:

gapminder %>% 
  summarise(mean(lifeExp))

Let’s give the statistic a name, like avg_life_expectancy:

gapminder %>% 
  summarise(avg_life_expectancy = mean(lifeExp))

The median:

gapminder %>% 
  summarise(median_life_expectancy = median(lifeExp))

Variation

sd() calculates the standard deviation:

gapminder %>% 
  summarise(sd_income = sd(gdpPercap))

Frequencies

Count the number of observations with n():

gapminder %>% 
  summarise(n_observations = n())

Summarizing multiple variables

gapminder %>% 
  summarise(avg_life_expectancy = mean(lifeExp), # average life expectancy
            avg_income = mean(gdpPercap),  # average GDP per capita
            avg_pop = mean(pop)) # average population 

Exercises

Calculate the mean, median and standard deviation of gdpPercap. You don’t have to name the variables inside summarise().

gapminder %>% 
  summarise(mean(gdpPercap), median(gdpPercap), sd(gdpPercap))

Grouping observations

What if we want to summarize a variable by a category or multiple categories?

For example, average life expectancy by continent, average population by year, median GDP per capita by country.

We can do this with group_by():

data %>% # take your data, THEN
  group_by() %>% # group the data by some variable(s), THEN
  summarise() # do some calculation

Examples

Average life expectancy by continent:

gapminder %>% 
  group_by(continent) %>% 
  summarise(avg_life_expectancy = mean(lifeExp))

Average population by year:

gapminder %>% 
  group_by(year) %>% 
  summarise(avg_pop = mean(pop))

Median GDP per capita by country:

gapminder %>% 
  group_by(country) %>% 
  summarise(med_income = median(gdpPercap))

Count the number of observations per continent:

gapminder %>% 
  group_by(continent) %>% 
  summarise(count = n())

Multiple groupings

Mean life expectancy by continent and by year:

gapminder %>% 
  group_by(continent, year) %>% 
  summarise(avg_life_expectancy = mean(lifeExp))

Ungrouping

You can un-group a grouped data frame with ungroup().

Continuing with the previous example, this code will slice the largest value of avg_life_expectancy for each continent/year:

gapminder %>% 
  group_by(continent, year) %>% 
  summarise(avg_life_expectancy = mean(lifeExp)) %>% 
  slice_max(avg_life_expectancy, n = 1)

If we ungroup() after summarise() then slice_max(avg_life_expectancy, n = 1) will return the largest value in the entire data frame: (i.e., across all years and continents):

gapminder %>% 
  group_by(continent, year) %>% 
  summarise(avg_life_expectancy = mean(lifeExp)) %>% 
  ungroup() %>% # ungroup!
  slice_max(avg_life_expectancy, n = 1)

Exercises

  1. Calculate the mean, median and standard deviation of gdpPercap by continent. You don’t have to name the variables inside summarise().
gapminder %>% 
  group_by(continent) %>% 
  summarise(mean(gdpPercap), median(gdpPercap), sd(gdpPercap))
  1. Calculate the mean, median and standard deviation of gdpPercap by continent and year. You don’t have to name the variables inside summarise().
gapminder %>% 
  group_by(continent, year) %>% 
  summarise(mean(gdpPercap), median(gdpPercap), sd(gdpPercap))

Filtering data

filter() allows you to filter rows of data with Boolean logic:

  • ==: “equal to”
  • !=: “not equal to”
  • >: “greater than”
  • >=: “greater than or equal to”
  • <: “less than”
  • <=: “less than or equal to”
  • &: “and”
  • |: “or”

Boolean logic is any test that returns true or false:

2 == 3
## [1] FALSE
2 != 3
## [1] TRUE
2 < 3
## [1] TRUE

The general outline for filter() is:

data %>% # take your data, THEN
  filter() # filter the data

Filtering one variable

Average life expectancy in the year 1982:

gapminder %>% 
  filter(year == 1982) %>% 
  summarise(avg_life_expectancy = mean(lifeExp))

Average life expectancy in the years above 1980:

gapminder %>% 
  filter(year > 1980) %>% 
  summarise(avg_life_expectancy = mean(lifeExp))

Average life expectancy between the years 1990 and 2000, by continent and year. Notice the & for “and”:

gapminder %>% 
  filter(year >= 1990 & year <= 2000) %>% 
  group_by(continent, year) %>% # by continent and year
  summarise(avg_life_expectancy = mean(lifeExp))

Use | for “or”. Like filtering for the continents Asia or Europe:

gapminder %>% 
  filter(continent == "Asia" | continent == "Europe") %>% 
  group_by(continent) %>% # by continent
  summarise(avg_life_expectancy = mean(lifeExp))

Filtering multiple variables

Average life expectancy for Asia and Europe between 1990 and 2000:

gapminder %>% 
  filter(continent == "Asia" | continent == "Europe", year >= 1990 & year <= 2000) %>% 
  group_by(continent) %>% 
  summarise(avg_life_expectancy = mean(lifeExp))

Exercises

  1. Calculate average life expectancy only for Europe:
gapminder %>% 
  filter(continent == "Europe") %>% 
  summarise(avg_life_expectancy = mean(lifeExp))
  1. Calculate average life expectancy between the years after and including 1990, by continent and year (hint: use group_by() after filter():
gapminder %>% 
  filter(year >= 1990) %>% 
  group_by(continent, year) %>% 
  summarise(avg_life_expectancy = mean(lifeExp))

Mutating data

Create new variables or modify exisiting variables with mutate().

Creating new variables

The general outline:

data %>% # take your data, THEN
  mutate() # do a mutation

For example, create a column with average life expectancy:

gapminder %>% 
  mutate(avg_life_expectancy = mean(lifeExp))

Combine with group_by() to get average life expectancy by continent and year:

gapminder %>% 
  group_by(continent, year) %>% 
  mutate(avg_life_expectancy = mean(lifeExp))

Saving new variables

To save a newly created variable you have to reassign the dataframe.

The general outline:

data = data %>% # take your data, THEN
  mutate() # do a mutation

For example, let’s save the variable with average life expectancy by year and continent:

gapminder = gapminder %>% # reassignment
  group_by(continent, year) %>% 
  mutate(avg_life_expectancy = mean(lifeExp))

Now the gapminder data frame has a new column avg_life_expectancy:

gapminder = gapminder %>% # reassignment
  group_by(continent, year) %>% 
  mutate(avg_life_expectancy = mean(lifeExp)) %>% 
  ungroup()
gapminder %>% 
  select(continent, year, avg_life_expectancy) %>% 
  slice_head(n=5)

Exercises

  1. Group the data by continent and year, then mutate a new variable called “median_gdp” that calculates median GDP per capita. Do not reassign the data:
gapminder %>% 
  group_by(continent, year) %>% 
  summarise(median_gdp = median(gdpPercap))
  1. Group the data by continent and year, then mutate a new variable called “deviation_gdp” that calculates the difference between GDP per capita and average GDP per capita by continent and year. Then ungroup the data (ungroup()) and slice_max() the five highest values of “deviation_gdp”.
gapminder %>% 
  group_by(continent, year) %>% 
  summarise(deviation_gdp = gdpPercap - mean(gdpPercap)) %>% 
  ungroup() %>% 
  slice_max(deviation_gdp, n = 1)

Other resources and next steps

Ready to visualize data? Do the tutorial on data visualization with ggplot2

To go back to the SuffolkEcon website click here.

suffolkecon.github.io