This is a short introduction to visualizing data in R by SuffolkEcon, the Department of Economics at Suffolk University.
Comments? Bugs? Drop us a line!
To go back to the SuffolkEcon website click here.
ggplot2
is a package that makes it easier to work with data in R.
A package is like a plug-in. It ships a bunch of functions to your R that make your life easier.
You can install ggplot2
by running install.packages("ggplot2")
in your R console. Or you can install the entire tidyverse – a family of data science packages that includes ggplot2
– by running install.packages("tidyverse")
in your R console.
You can then invoke ggplot2
and the other tidyverse
packages with
library(tidyverse)
We will work with data from the Gapminder Project. The gapminder
data shows life expectancy, income and population for each continent between the years 1952 - 2007:
gapminder
For some examples we will only use data from the United States:
= gapminder %>%
gapminderUSA filter(country == "United States")
gapminderUSA
And for other examples we will use data from North America:
= gapminder %>%
gapminderNorthAmerica filter(country == "United States" | country == "Mexico" | country == "Canada")
gapminderNorthAmerica
ggplot()
worksThe main functon in ggplot2
is ggplot()
.
The gg
stands for “Grammar of Graphics”. You’ll become more familiar with the system as you use it.
For now, here is the basic framework behind each visualization. We’ll build a scatter plot of GDP per capita over time in the US.
ggplot()
a dataframe, and the inside aes()
(for “aesthetics”) you feed the names of the columns that will be on the \(x\) and \(y\):gapminderUSA %>%
ggplot(aes(x=year, y=gdpPercap)) # create an empty plot
geom_point()
, and to connect the points we use geom_line()
:gapminderUSA %>%
ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
geom_point() + # add a scatterplot layer, THEN
geom_line() # connect the points with a line
ggplot2
options to format the plot:gapminderUSA %>%
ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
geom_point() + # add a scatterplot layer, THEN
geom_line() + # connect the points with a line, THEN
labs(x="Year", y="GDP Per Capita", title="Income in the US", subtitle="1952-2007") + # add labels and titles, THEN
theme_classic() # give the plot a clean theme
The consistent syntax means that each different type of plot will have its own geom
. This tutorial will go over some of the most common ones.
geom_histogram()
(Reference)
gapminder %>%
ggplot(aes(x=lifeExp)) +
geom_histogram()
The default number of bins is 30. You can manually set the bins inside geom_histogram()
:
gapminder %>%
ggplot(aes(x=lifeExp)) +
geom_histogram(bins=100) # 100 bins
To plot overlaying distributions you just to set a fill color by the grouping category, e.g. country
:
gapminder %>%
ggplot(aes(x=lifeExp, fill = continent, color = continent)) + # fill=continent means "fill the distributions with a unique color per continent"
geom_density(alpha=0.65) # 65% transparency
fill
sets the filled-in color of the histogram/densitycolor
sets the border color of the histogram/densityalpha
set the transparency so it is easier to visualize overlaps
alpha = 1
means no transparencyalpha = 0
means full transparencyYou can use the same logic to color/fill other geoms.
geom_boxplot()
(Reference)
Only difference here is we have to set an x-variable (the category, e.g. continent
) and a y-variable (the measurement, e.g. gdpPercap
):
gapminder %>%
ggplot(aes(x=continent, y=lifeExp)) +
geom_boxplot()
stat_ecdf()
(Reference)
gapminder %>%
ggplot(aes(x=lifeExp)) +
stat_ecdf()
By continent:
gapminder %>%
ggplot(aes(x=lifeExp, color=continent)) +
stat_ecdf()
gdpPercap
:gapminder %>%
ggplot(aes(x = gdpPercap)) +
geom_histogram()
gdpPercap
by continent:gapminder %>%
ggplot(aes(x = gdpPercap, color=continent)) +
stat_ecdf()
geom_point()
(Reference)
GDP per capita over time in the United States:
gapminderUSA %>%
ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
geom_point() # add a scatterplot layer
GDP per capita over time across North America:
gapminderNorthAmerica %>%
ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
geom_point() # add a scatterplot layer
We can color the points so we can distinguish each country:
gapminderNorthAmerica %>%
ggplot(aes(x=year, y=gdpPercap, color = country)) + # create an empty plot, THEN
geom_point() # add a scatterplot layer
Connect the points with geom_line()
:
gapminderNorthAmerica %>%
ggplot(aes(x=year, y=gdpPercap, color = country)) + # create an empty plot, THEN
geom_point() + # add a scatterplot layer
geom_line()
Make a scatterplot of GDP per capita (x-axis) against life expectancy (y-axis) for each country in North America (gapminderNorthAmerica
):
gapminderNorthAmerica %>%
ggplot(aes(x = gdpPercap, y = lifeExp, color = country)) +
geom_point()
geom_line()
and geom_step()
(Reference)
GDP per capita over time with geom_line()
:
gapminderUSA %>%
ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
geom_line() # add the line
If we want to plot more than one line we need to color by some category, like country:
gapminderNorthAmerica %>%
ggplot(aes(x = year, y = gdpPercap, color=country)) +
geom_line()
Plot life expectancy over time for each country in North America (gapminderNorthAmerica
):
gapminderNorthAmerica %>%
ggplot(aes(x = year, y = lifeExp, color=country)) +
geom_line()
stat_summary()
(Reference)
Average GDP per capita in each country in North America:
gapminderNorthAmerica %>%
ggplot(aes(x=country, y=gdpPercap)) + # create an empty plot, THEN
stat_summary(fun="mean", geom="point", size=5)
Add error bars:
gapminderNorthAmerica %>%
ggplot(aes(x=country, y=gdpPercap)) + # create an empty plot, THEN
stat_summary(fun="mean", geom="point", size=5) +
stat_summary(fun.data = mean_se, geom = "errorbar", width=0.15)
mean_se
calculates the standard error of the mean, \(\bar{x} \pm \frac{s}{\sqrt{n}}\), where \(\bar{x}\) is the sample mean, \(s\) is the standard deviation, and \(n\) is the number of observations.
width
controls the width of the “hats” of the errorbars. You can turn them off with width=0
. Read more about geom_errorbar()
here.
You can also turn this plot into a barplot by switching the geom
to “bar”:
gapminderNorthAmerica %>%
ggplot(aes(x=country, y=gdpPercap)) + # create an empty plot, THEN
stat_summary(fun="mean", geom="bar") +
stat_summary(fun.data = mean_se, geom = "errorbar", width=0.15)
geom_smooth()
(Reference)
Here is a GDP per capita over time in the United States:
gapminderUSA %>%
ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
geom_point()
geom_smooth()
by default will add a smooth loess curve:
gapminderUSA %>%
ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
geom_point() +
geom_smooth()
To add a regression line just pass method="lm"
(lm
for “linear model”) into geom_smooth()
:
gapminderUSA %>%
ggplot(aes(x=year, y=gdpPercap)) + # create an empty plot, THEN
geom_point() +
geom_smooth(method="lm")
Or add multiple regression lines:
gapminderNorthAmerica %>%
ggplot(aes(x=year, y=gdpPercap, color=country)) + # create an empty plot, THEN
geom_point() +
geom_smooth(method="lm")
stat_summary()
to plot average GDP per capita by continent in the data gapminder
. Make it a bar graph. Include standard errors.gapminder %>%
ggplot(aes(x=continent, y=lifeExp)) + # create an empty plot, THEN
stat_summary(fun="mean", geom="bar") +
stat_summary(fun.data = mean_se, geom = "errorbar", width=0.15)
geom_point()
and geom_smooth()
to make a scatter plot of GDP per capita (x-axis) against life expectancy (y-axis) for each country in North American (gapminderNorthAmerica
):gapminderNorthAmerica %>%
ggplot(aes(x=gdpPercap, y=lifeExp, color = country)) +
geom_point() +
geom_smooth(method = "lm")
facet_wrap()
and facet_grid()
(Reference)
Plotting all countries in North America in one panel:
gapminderNorthAmerica %>%
ggplot(aes(x=year, y=gdpPercap, color=country)) + # create an empty plot, THEN
geom_point() +
geom_smooth(method="lm")
use facet_wrap()
to create a unique panel for each country:
gapminderNorthAmerica %>%
ggplot(aes(x=year, y=gdpPercap, color=country)) + # create an empty plot, THEN
geom_point() +
geom_smooth(method="lm") +
facet_wrap(~country)
Use facet_wrap()
to plot a histogram of life expectancy for each continent in gapminder
:
gapminder %>%
ggplot(aes(x = lifeExp)) +
geom_histogram() +
facet_wrap(~continent)
ggplot2
comes with a bunch of built-in themes.
The default is theme_grey()
:
gapminder %>%
ggplot(aes(x = lifeExp)) +
geom_histogram() +
theme_grey()
theme_bw()
gives a black-and-white theme with gridlines:
gapminder %>%
ggplot(aes(x = lifeExp)) +
geom_histogram() +
theme_bw()
theme_minimal()
keeps the grid but turns off the axes:
gapminder %>%
ggplot(aes(x = lifeExp)) +
geom_histogram() +
theme_minimal()
theme_classic()
keeps the axes but turns off the grid:
gapminder %>%
ggplot(aes(x = lifeExp)) +
geom_histogram() +
theme_classic()
Every single part of a theme is customizable. You can even make your own theme.
Make a histogram of GDP per capita in gapminder
using theme_minimal()
and a separate facet for each continent:
gapminder %>%
ggplot(aes(x = gdpPercap)) +
geom_histogram() +
facet_wrap(~continent) +
theme_minimal()
ggplot2