This is a short introduction to visualizing data in R by SuffolkEcon, the Department of Economics at Suffolk University.
Comments? Bugs? Drop us a line!
To go back to the SuffolkEcon website click here.
ggplot2 is a package that makes it easier to work with data in R.
A package is like a plug-in. It ships a bunch of functions to your R that make your life easier.
You can install ggplot2 by running install.packages("ggplot2") in your R console. Or you can install the entire tidyverse – a family of data science packages that includes ggplot2 – by running install.packages("tidyverse") in your R console.
You can then invoke ggplot2 and the other tidyverse packages with
library(tidyverse)
We will work with data from the Gapminder Project. The gapminder data shows life expectancy, income and population for each continent between the years 1952 - 2007:
gapminderFor some examples we will only use data from the United States:
gapminderUSA = gapminder %>% 
  filter(country == "United States")
gapminderUSAAnd for other examples we will use data from North America:
gapminderNorthAmerica = gapminder %>% 
  filter(country == "United States" | country == "Mexico" | country == "Canada")
gapminderNorthAmericaggplot() worksThe main functon in ggplot2 is ggplot().
The gg stands for “Grammar of Graphics”. You’ll become more familiar with the system as you use it.
For now, here is the basic framework behind each visualization. We’ll build a scatter plot of GDP per capita over time in the US.
ggplot() a dataframe, and the inside aes() (for “aesthetics”) you feed the names of the columns that will be on the \(x\) and \(y\):gapminderUSA %>% 
  ggplot(aes(x=year, y=gdpPercap))  # create an empty plot
geom_point(), and to connect the points we use geom_line():gapminderUSA %>% 
  ggplot(aes(x=year, y=gdpPercap))  + # create an empty plot, THEN
  geom_point() + # add a scatterplot layer, THEN
  geom_line()  # connect the points with a line
ggplot2 options to format the plot:gapminderUSA %>% 
  ggplot(aes(x=year, y=gdpPercap))  + # create an empty plot, THEN
  geom_point() + # add a scatterplot layer, THEN
  geom_line() + # connect the points with a line, THEN
  labs(x="Year", y="GDP Per Capita", title="Income in the US", subtitle="1952-2007") + # add labels and titles, THEN
  theme_classic() # give the plot a clean theme
The consistent syntax means that each different type of plot will have its own geom. This tutorial will go over some of the most common ones.
geom_histogram() (Reference)
gapminder %>% 
  ggplot(aes(x=lifeExp)) + 
  geom_histogram()
The default number of bins is 30. You can manually set the bins inside geom_histogram():
gapminder %>% 
  ggplot(aes(x=lifeExp)) + 
  geom_histogram(bins=100) # 100 bins
To plot overlaying distributions you just to set a fill color by the grouping category, e.g. country:
gapminder %>% 
  ggplot(aes(x=lifeExp, fill = continent, color = continent)) + # fill=continent means "fill the distributions with a unique color per continent"
  geom_density(alpha=0.65) # 65% transparency
fill sets the filled-in color of the histogram/densitycolor sets the border color of the histogram/densityalpha set the transparency so it is easier to visualize overlaps
alpha = 1 means no transparencyalpha = 0 means full transparencyYou can use the same logic to color/fill other geoms.
geom_boxplot() (Reference)
Only difference here is we have to set an x-variable (the category, e.g. continent) and a y-variable (the measurement, e.g. gdpPercap):
gapminder %>% 
  ggplot(aes(x=continent, y=lifeExp)) + 
  geom_boxplot() 
stat_ecdf()(Reference)
gapminder %>% 
  ggplot(aes(x=lifeExp)) + 
  stat_ecdf()
By continent:
gapminder %>% 
  ggplot(aes(x=lifeExp, color=continent)) + 
  stat_ecdf()
gdpPercap:gapminder %>% 
  ggplot(aes(x = gdpPercap)) + 
  geom_histogram()
gdpPercap by continent:gapminder %>% 
  ggplot(aes(x = gdpPercap, color=continent)) + 
  stat_ecdf()
geom_point()(Reference)
GDP per capita over time in the United States:
gapminderUSA %>% 
  ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
  geom_point() # add a scatterplot layer
GDP per capita over time across North America:
gapminderNorthAmerica %>% 
  ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
  geom_point() # add a scatterplot layer
We can color the points so we can distinguish each country:
gapminderNorthAmerica %>% 
  ggplot(aes(x=year, y=gdpPercap, color = country)) + # create an empty plot, THEN
  geom_point() # add a scatterplot layer
Connect the points with geom_line():
gapminderNorthAmerica %>% 
  ggplot(aes(x=year, y=gdpPercap, color = country)) + # create an empty plot, THEN
  geom_point() + # add a scatterplot layer
  geom_line()
Make a scatterplot of GDP per capita (x-axis) against life expectancy (y-axis) for each country in North America (gapminderNorthAmerica):
gapminderNorthAmerica %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp, color = country)) + 
  geom_point()
geom_line() and geom_step() (Reference)
GDP per capita over time with geom_line():
gapminderUSA %>% 
  ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
  geom_line() # add the line
If we want to plot more than one line we need to color by some category, like country:
gapminderNorthAmerica %>% 
  ggplot(aes(x = year, y = gdpPercap, color=country)) + 
  geom_line()
Plot life expectancy over time for each country in North America (gapminderNorthAmerica):
gapminderNorthAmerica %>% 
  ggplot(aes(x = year, y = lifeExp, color=country)) + 
  geom_line()
stat_summary() (Reference)
Average GDP per capita in each country in North America:
gapminderNorthAmerica %>% 
  ggplot(aes(x=country, y=gdpPercap)) + # create an empty plot, THEN
  stat_summary(fun="mean", geom="point", size=5) 
Add error bars:
gapminderNorthAmerica %>% 
  ggplot(aes(x=country, y=gdpPercap)) + # create an empty plot, THEN
  stat_summary(fun="mean", geom="point", size=5) + 
   stat_summary(fun.data = mean_se, geom = "errorbar", width=0.15) 
mean_se calculates the standard error of the mean, \(\bar{x} \pm \frac{s}{\sqrt{n}}\), where \(\bar{x}\) is the sample mean, \(s\) is the standard deviation, and \(n\) is the number of observations.
width controls the width of the “hats” of the errorbars. You can turn them off with width=0. Read more about geom_errorbar() here.
You can also turn this plot into a barplot by switching the geom to “bar”:
gapminderNorthAmerica %>% 
  ggplot(aes(x=country, y=gdpPercap)) + # create an empty plot, THEN
  stat_summary(fun="mean", geom="bar") + 
  stat_summary(fun.data = mean_se, geom = "errorbar", width=0.15) 
geom_smooth() (Reference)
Here is a GDP per capita over time in the United States:
gapminderUSA %>% 
  ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
  geom_point() 
geom_smooth() by default will add a smooth loess curve:
gapminderUSA %>% 
  ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
  geom_point() + 
  geom_smooth()
To add a regression line just pass method="lm" (lm for “linear model”) into geom_smooth():
gapminderUSA %>% 
  ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
  geom_point() + 
  geom_smooth(method="lm")
Or add multiple regression lines:
gapminderNorthAmerica %>% 
  ggplot(aes(x=year, y=gdpPercap, color=country)) + # create an empty plot, THEN
  geom_point() + 
  geom_smooth(method="lm")
stat_summary() to plot average GDP per capita by continent in the data gapminder. Make it a bar graph. Include standard errors.gapminder %>% 
  ggplot(aes(x=continent, y=lifeExp)) + # create an empty plot, THEN
  stat_summary(fun="mean", geom="bar") + 
  stat_summary(fun.data = mean_se, geom = "errorbar", width=0.15) 
geom_point() and geom_smooth() to make a scatter plot of GDP per capita (x-axis) against life expectancy (y-axis) for each country in North American (gapminderNorthAmerica):gapminderNorthAmerica %>% 
  ggplot(aes(x=gdpPercap, y=lifeExp, color = country)) + 
  geom_point() + 
  geom_smooth(method = "lm")
facet_wrap() and facet_grid() (Reference)
Plotting all countries in North America in one panel:
gapminderNorthAmerica %>% 
  ggplot(aes(x=year, y=gdpPercap, color=country)) + # create an empty plot, THEN
  geom_point() + 
  geom_smooth(method="lm")
use facet_wrap() to create a unique panel for each country:
gapminderNorthAmerica %>% 
  ggplot(aes(x=year, y=gdpPercap, color=country)) + # create an empty plot, THEN
  geom_point() + 
  geom_smooth(method="lm") +
  facet_wrap(~country)
Use facet_wrap() to plot a histogram of life expectancy for each continent in gapminder:
gapminder %>% 
  ggplot(aes(x = lifeExp)) + 
  geom_histogram() + 
  facet_wrap(~continent)
ggplot2 comes with a bunch of built-in themes.
The default is theme_grey():
gapminder %>% 
  ggplot(aes(x = lifeExp)) + 
  geom_histogram() + 
  theme_grey()
theme_bw() gives a black-and-white theme with gridlines:
gapminder %>% 
  ggplot(aes(x = lifeExp)) + 
  geom_histogram() + 
  theme_bw()
theme_minimal() keeps the grid but turns off the axes:
gapminder %>% 
  ggplot(aes(x = lifeExp)) + 
  geom_histogram()  + 
  theme_minimal()
theme_classic() keeps the axes but turns off the grid:
gapminder %>% 
  ggplot(aes(x = lifeExp)) + 
  geom_histogram() + 
  theme_classic()
Every single part of a theme is customizable. You can even make your own theme.
Make a histogram of GDP per capita in gapminder using theme_minimal() and a separate facet for each continent:
gapminder %>% 
  ggplot(aes(x = gdpPercap)) + 
  geom_histogram()  + 
  facet_wrap(~continent) + 
  theme_minimal()
ggplot2