Visualization with ggplot2

Welcome!

This is a short introduction to visualizing data in R by SuffolkEcon, the Department of Economics at Suffolk University.

Comments? Bugs? Drop us a line!

To go back to the SuffolkEcon website click here.

Overview

ggplot2 is a package that makes it easier to work with data in R.

What is a package?

A package is like a plug-in. It ships a bunch of functions to your R that make your life easier.

You can install ggplot2 by running install.packages("ggplot2") in your R console. Or you can install the entire tidyverse – a family of data science packages that includes ggplot2 – by running install.packages("tidyverse") in your R console.

You can then invoke ggplot2 and the other tidyverse packages with

library(tidyverse)

Data

We will work with data from the Gapminder Project. The gapminder data shows life expectancy, income and population for each continent between the years 1952 - 2007:

gapminder

For some examples we will only use data from the United States:

gapminderUSA = gapminder %>% 
  filter(country == "United States")
gapminderUSA

And for other examples we will use data from North America:

gapminderNorthAmerica = gapminder %>% 
  filter(country == "United States" | country == "Mexico" | country == "Canada")
gapminderNorthAmerica

How ggplot() works

The main functon in ggplot2 is ggplot().

The gg stands for “Grammar of Graphics”. You’ll become more familiar with the system as you use it.

For now, here is the basic framework behind each visualization. We’ll build a scatter plot of GDP per capita over time in the US.

  1. Make an empty canvas. You feed ggplot() a dataframe, and the inside aes() (for “aesthetics”) you feed the names of the columns that will be on the \(x\) and \(y\):
gapminderUSA %>% 
  ggplot(aes(x=year, y=gdpPercap))  # create an empty plot
  1. Add a geom(s). A “geom” is a plot layer. Each “geom” is named after the type of visualization it creates. To make a scatterplot we use geom_point(), and to connect the points we use geom_line():
gapminderUSA %>% 
  ggplot(aes(x=year, y=gdpPercap))  + # create an empty plot, THEN
  geom_point() + # add a scatterplot layer, THEN
  geom_line()  # connect the points with a line
  1. Format. You can use other ggplot2 options to format the plot:
gapminderUSA %>% 
  ggplot(aes(x=year, y=gdpPercap))  + # create an empty plot, THEN
  geom_point() + # add a scatterplot layer, THEN
  geom_line() + # connect the points with a line, THEN
  labs(x="Year", y="GDP Per Capita", title="Income in the US", subtitle="1952-2007") + # add labels and titles, THEN
  theme_classic() # give the plot a clean theme

The consistent syntax means that each different type of plot will have its own geom. This tutorial will go over some of the most common ones.

Distributions

Histograms

geom_histogram() (Reference)

gapminder %>% 
  ggplot(aes(x=lifeExp)) + 
  geom_histogram()

The default number of bins is 30. You can manually set the bins inside geom_histogram():

gapminder %>% 
  ggplot(aes(x=lifeExp)) + 
  geom_histogram(bins=100) # 100 bins

Density plots

geom_density() (Reference)

gapminder %>% 
  ggplot(aes(x=lifeExp)) + 
  geom_density()

Multiple distributions

To plot overlaying distributions you just to set a fill color by the grouping category, e.g. country:

gapminder %>% 
  ggplot(aes(x=lifeExp, fill = continent, color = continent)) + # fill=continent means "fill the distributions with a unique color per continent"
  geom_density(alpha=0.65) # 65% transparency
  • fill sets the filled-in color of the histogram/density
  • color sets the border color of the histogram/density
  • alpha set the transparency so it is easier to visualize overlaps
    • alpha = 1 means no transparency
    • alpha = 0 means full transparency

You can use the same logic to color/fill other geoms.

Boxplots

geom_boxplot() (Reference)

Only difference here is we have to set an x-variable (the category, e.g. continent) and a y-variable (the measurement, e.g. gdpPercap):

gapminder %>% 
  ggplot(aes(x=continent, y=lifeExp)) + 
  geom_boxplot() 

Cumulative distributions

stat_ecdf()(Reference)

gapminder %>% 
  ggplot(aes(x=lifeExp)) + 
  stat_ecdf()

By continent:

gapminder %>% 
  ggplot(aes(x=lifeExp, color=continent)) + 
  stat_ecdf()

Exercises

  1. Plot a histogram of gdpPercap:
gapminder %>% 
  ggplot(aes(x = gdpPercap)) + 
  geom_histogram()
  1. Plot the cumulative distribution of gdpPercap by continent:
gapminder %>% 
  ggplot(aes(x = gdpPercap, color=continent)) + 
  stat_ecdf()

Scatterplots

geom_point()(Reference)

Basic scatterplot

GDP per capita over time in the United States:

gapminderUSA %>% 
  ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
  geom_point() # add a scatterplot layer

Multiple scatterplots

GDP per capita over time across North America:

gapminderNorthAmerica %>% 
  ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
  geom_point() # add a scatterplot layer

We can color the points so we can distinguish each country:

gapminderNorthAmerica %>% 
  ggplot(aes(x=year, y=gdpPercap, color = country)) + # create an empty plot, THEN
  geom_point() # add a scatterplot layer

Connect the points with geom_line():

gapminderNorthAmerica %>% 
  ggplot(aes(x=year, y=gdpPercap, color = country)) + # create an empty plot, THEN
  geom_point() + # add a scatterplot layer
  geom_line()

Exercise

Make a scatterplot of GDP per capita (x-axis) against life expectancy (y-axis) for each country in North America (gapminderNorthAmerica):

gapminderNorthAmerica %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp, color = country)) + 
  geom_point()

Lines

geom_line() and geom_step() (Reference)

Single line plot

GDP per capita over time with geom_line():

gapminderUSA %>% 
  ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
  geom_line() # add the line

Multiple lines

If we want to plot more than one line we need to color by some category, like country:

gapminderNorthAmerica %>% 
  ggplot(aes(x = year, y = gdpPercap, color=country)) + 
  geom_line()

Exercise

Plot life expectancy over time for each country in North America (gapminderNorthAmerica):

gapminderNorthAmerica %>% 
  ggplot(aes(x = year, y = lifeExp, color=country)) + 
  geom_line()

Statistical Plots

Summary statistics

stat_summary() (Reference)

Average GDP per capita in each country in North America:

gapminderNorthAmerica %>% 
  ggplot(aes(x=country, y=gdpPercap)) + # create an empty plot, THEN
  stat_summary(fun="mean", geom="point", size=5) 

Add error bars:

gapminderNorthAmerica %>% 
  ggplot(aes(x=country, y=gdpPercap)) + # create an empty plot, THEN
  stat_summary(fun="mean", geom="point", size=5) + 
   stat_summary(fun.data = mean_se, geom = "errorbar", width=0.15) 

mean_se calculates the standard error of the mean, \(\bar{x} \pm \frac{s}{\sqrt{n}}\), where \(\bar{x}\) is the sample mean, \(s\) is the standard deviation, and \(n\) is the number of observations.

width controls the width of the “hats” of the errorbars. You can turn them off with width=0. Read more about geom_errorbar() here.

You can also turn this plot into a barplot by switching the geom to “bar”:

gapminderNorthAmerica %>% 
  ggplot(aes(x=country, y=gdpPercap)) + # create an empty plot, THEN
  stat_summary(fun="mean", geom="bar") + 
  stat_summary(fun.data = mean_se, geom = "errorbar", width=0.15) 

Regression lines

geom_smooth() (Reference)

Here is a GDP per capita over time in the United States:

gapminderUSA %>% 
  ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
  geom_point() 

geom_smooth() by default will add a smooth loess curve:

gapminderUSA %>% 
  ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
  geom_point() + 
  geom_smooth()

To add a regression line just pass method="lm" (lm for “linear model”) into geom_smooth():

gapminderUSA %>% 
  ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
  geom_point() + 
  geom_smooth(method="lm")

Or add multiple regression lines:

gapminderNorthAmerica %>% 
  ggplot(aes(x=year, y=gdpPercap, color=country)) + # create an empty plot, THEN
  geom_point() + 
  geom_smooth(method="lm")

Exercises

  1. Use stat_summary() to plot average GDP per capita by continent in the data gapminder. Make it a bar graph. Include standard errors.
gapminder %>% 
  ggplot(aes(x=continent, y=lifeExp)) + # create an empty plot, THEN
  stat_summary(fun="mean", geom="bar") + 
  stat_summary(fun.data = mean_se, geom = "errorbar", width=0.15) 
  1. Use geom_point() and geom_smooth() to make a scatter plot of GDP per capita (x-axis) against life expectancy (y-axis) for each country in North American (gapminderNorthAmerica):
gapminderNorthAmerica %>% 
  ggplot(aes(x=gdpPercap, y=lifeExp, color = country)) + 
  geom_point() + 
  geom_smooth(method = "lm")

Facetting

facet_wrap() and facet_grid() (Reference)

Plotting all countries in North America in one panel:

gapminderNorthAmerica %>% 
  ggplot(aes(x=year, y=gdpPercap, color=country)) + # create an empty plot, THEN
  geom_point() + 
  geom_smooth(method="lm")

use facet_wrap() to create a unique panel for each country:

gapminderNorthAmerica %>% 
  ggplot(aes(x=year, y=gdpPercap, color=country)) + # create an empty plot, THEN
  geom_point() + 
  geom_smooth(method="lm") +
  facet_wrap(~country)

Exercises

Use facet_wrap() to plot a histogram of life expectancy for each continent in gapminder:

gapminder %>% 
  ggplot(aes(x = lifeExp)) + 
  geom_histogram() + 
  facet_wrap(~continent)

Themes

ggplot2 comes with a bunch of built-in themes.

The default is theme_grey():

gapminder %>% 
  ggplot(aes(x = lifeExp)) + 
  geom_histogram() + 
  theme_grey()

theme_bw() gives a black-and-white theme with gridlines:

gapminder %>% 
  ggplot(aes(x = lifeExp)) + 
  geom_histogram() + 
  theme_bw()

theme_minimal() keeps the grid but turns off the axes:

gapminder %>% 
  ggplot(aes(x = lifeExp)) + 
  geom_histogram()  + 
  theme_minimal()

theme_classic() keeps the axes but turns off the grid:

gapminder %>% 
  ggplot(aes(x = lifeExp)) + 
  geom_histogram() + 
  theme_classic()

Every single part of a theme is customizable. You can even make your own theme.

Exercise

Make a histogram of GDP per capita in gapminder using theme_minimal() and a separate facet for each continent:

gapminder %>% 
  ggplot(aes(x = gdpPercap)) + 
  geom_histogram()  + 
  facet_wrap(~continent) + 
  theme_minimal()

Other resources

  • ggplot2 homepage
  • Chapter on ggplot2 in R For Data Science
  • R Cookbook for ggplot2
  • ggplot2 book
  • ggplot2 cheatsheet

suffolkecon.github.io