This is a short demo of some of the methods in the SuffolkEcon
R package. The full function reference can be found on the package’s website.
The package is mainly meant to provide some useful visualizations for teaching introductory econometrics and optimization. You can make them on-the-fly during lecture or lab, or you can make them beforehand (e.g., for lecture slides/notes).
The package is not meant to substitute for other packages in the R ecosystem like the tidyverse
. In fact, SuffolkEcon
is meant to be used alongside the tidyverse. You can invoke both packages in the usual way:
library(tidyverse)
library(SuffolkEcon)
The package is very much experimental. Comments? Bugs? Drop us a line!
To go back to the SuffolkEcon website click here.
If you are new to R, we recommend you do the following tutorials before this one:
SuffolkEcon
ships with three data sets. Each one can be loaded with data("name of data set")
.
nhanes
Second National Health and Nutrition Examination Survey (NHANES II 1976-1980)
data("nhanes")
slice_head(nhanes, n=5)
penguins
Body measurements for three species of penguins on the Palmer Archipelago in Antarctica
data("penguins")
slice_head(penguins, n=5)
The function SuffolkEcon::what_if
allows you to bootstrap sample statistics and linear regression estimates in one line.
what_if
as in, “what if we ran an experiment many times”, using a data set as a population.
The core arguments to what_if
are experiments
(number of experiments or samples) and sample_size
(the sample size).
The other arguments depend on whether you are bootstrapping a sample statistic (e.g., the sample mean) or a regression coefficient.
In the chunks below you can modify the arguments to get a feel for what you can do by asking what_if
.
The main arguments are variable
and stat
.
Using nhanes
, bootstrap the sample mean height
from 100 samples (“experiments”), each one with 30 observations:
nhanes %>%
what_if(variable = height, stat = 'mean', experiments = 100, sample_size = 30)
Try bootstrapping the sample median weight for 500 experiments. You just have to
variable = weight
stat = 'median'
experiments = 100
nhanes %>%
what_if(variable = weight, stat = 'median', experiments = 500, sample_size = 30)
The main arguments are model
and p.values
, and for multiple regression focus
(as in which coefficient you want bootstrap or “focus” on).
To regress height on weight in R is lm(formula = weight ~ height, data = nhanes)
.
To bootstrap the coefficient on height
you pass the formula to model
:
nhanes %>%
what_if(model = weight ~ height, experiments = 100, sample_size = 30)
Switch on p.values = TRUE
to view the cumulative distribution of p-values. The plot indicates how many coefficients are significant at the 5% level:
nhanes %>%
what_if(model = weight ~ height, experiments = 100, sample_size = 30, p.values = TRUE)
Now you also have to set the focus
to choose which to coefficient to bootstrap.
For instance, in the model weight ~ height + sex
, you can focus on height
:
nhanes %>%
what_if(model = weight ~ height + sex, focus = height, experiments = 100, sample_size = 30)
or sex
:
nhanes %>%
what_if(model = weight ~ height + sex, focus = sex, experiments = 100, sample_size = 30)
If you have a grouping variable in the data set (e.g., sex
in nhanes
) you can use the group
argument to draw balanced samples from each group (in the example below, 15 observations from each sex, Male or Female):
nhanes %>%
what_if(model = weight ~ height + sex, focus = sex, group = sex, experiments = 100, sample_size = 30)
The function plot_distribution
will plot any of R’s built-in probability distributions (the functions that start with “r”).
The ...
(optional arguments) to plot_distribution
are passed to the distribution function (the argument model
). If nothing is passed to ...
then defaults are used for the given distribution function. So if you set model = "rnorm"
and nothing else, the defaults to rnorm()
are kept (mean = 0
and sd = 1
).
For instance, 1000 draws (the default n
) from a normal distribution with mean 4 and standard deviation 2, plotting the PDF and the CDF:
plot_distribution(model = "rnorm", mean = 4, sd = 2, n = 1000, cumulative = TRUE)
Use distribution
and details
to make the plot more informative:
plot_distribution(model = "rnorm", mean = 4, sd = 2, n = 1000, cumulative = TRUE, distribution = "Normal", details = "Mean = 4, SD = 2")
Here is a Poisson distribution with \(\lambda = 5\):
plot_distribution(model = 'rpois', lambda = 5, cumulative = TRUE, distribution = "Poisson", details = "Lambda = 5")
The function plot_probability
works very similarly, except it shades the area underneath a given distribution to visualize a cumulative probability.
For instance, plot \(P(0 \leq X \leq 1)\) for a standard normal distribution:
plot_probability(model = 'rnorm', mean = 0, sd = 1, lower = 0, upper =1, distribution = "Standard Normal", details = "Mean 0 SD 1")
or \(P(8 \leq X \leq \infty)\) for a Chi-squared distribution with 5 degrees of freedom:
plot_probability(model = 'rchisq', df = 5, lower = 8, distribution = "Chi-squared", details = "5 degrees of freedom")
profit = function(x) -10 - 48*x + 15*x^2 - x^3
The function plot_function
is basically a soup-up version of the curve
function that a) uses ggplot
and b) visualizes the optima and roots of a function as well as first-derivatives.
Suppose you have a messy profit function like:
\[ \pi_(Q) = -10 - 48Q + 15Q^2 - Q^3 \]
First define the function using standard R syntax:
profit = function(x) -10 - 48*x + 15*x^2 - x^3
Plot it over the range \(Q \in [0,10]\):
plot_function(profit, lower = 0, upper = 10)
You can see there are two optima, a minimum and a maximum.
The optima are the solutions to the first-order condition:
\[ \frac{d \pi}{d Q} = -48 + 30Q - 3Q^2 = 0 \]
which you can visualize using the derivative = TRUE
and root = TRUE
to show that optima happen where the first-derivative crosses zero:
plot_function(function(x) -10 - 48*x + 15*x^2 - x^3, lower = 0, upper = 10, derivative = TRUE, roots = TRUE)
You can also visualize the maximum point by switching off derivative
and setting optimize = TRUE
and maximum = TRUE
:
plot_function(profit, lower = 0, upper = 10, optimize = TRUE, maximum = TRUE)