The SuffolkEcon Package

Welcome!

This is a short demo of some of the methods in the SuffolkEcon R package. The full function reference can be found on the package’s website.

The package is mainly meant to provide some useful visualizations for teaching introductory econometrics and optimization. You can make them on-the-fly during lecture or lab, or you can make them beforehand (e.g., for lecture slides/notes).

The package is not meant to substitute for other packages in the R ecosystem like the tidyverse. In fact, SuffolkEcon is meant to be used alongside the tidyverse. You can invoke both packages in the usual way:

library(tidyverse)
library(SuffolkEcon)

The package is very much experimental. Comments? Bugs? Drop us a line!

To go back to the SuffolkEcon website click here.

Preliminaries

If you are new to R, we recommend you do the following tutorials before this one:

Data

SuffolkEcon ships with three data sets. Each one can be loaded with data("name of data set").

`nhanes`

Second National Health and Nutrition Examination Survey (NHANES II 1976-1980)

data("nhanes")
slice_head(nhanes, n=5)

`penguins`

Body measurements for three species of penguins on the Palmer Archipelago in Antarctica

data("penguins")
slice_head(penguins, n=5)

`hprice`

Housing prices and attributes from 1976

data("hprice")
slice_head(hprice, n=5)

Bootstrapping

The function SuffolkEcon::what_if allows you to bootstrap sample statistics and linear regression estimates in one line.

what_if as in, “what if we ran an experiment many times”, using a data set as a population.

The core arguments to what_if are experiments (number of experiments or samples) and sample_size (the sample size).

The other arguments depend on whether you are bootstrapping a sample statistic (e.g., the sample mean) or a regression coefficient.

In the chunks below you can modify the arguments to get a feel for what you can do by asking what_if.

Sample statistics

The main arguments are variable and stat.

Using nhanes, bootstrap the sample mean height from 100 samples (“experiments”), each one with 30 observations:

nhanes %>% 
  what_if(variable = height, stat = 'mean', experiments = 100, sample_size = 30)

Try bootstrapping the sample median weight for 500 experiments. You just have to

set variable = weight
set stat = 'median'
set experiments = 100

nhanes %>% 
  what_if(variable = weight, stat = 'median', experiments = 500, sample_size = 30)

Regression coefficients

The main arguments are model and p.values, and for multiple regression focus (as in which coefficient you want bootstrap or “focus” on).

Single linear regression

To regress height on weight in R is lm(formula = weight ~ height, data = nhanes).

To bootstrap the coefficient on height you pass the formula to model:

nhanes %>% 
  what_if(model = weight ~ height, experiments = 100, sample_size = 30)

Switch on p.values = TRUE to view the cumulative distribution of p-values. The plot indicates how many coefficients are significant at the 5% level:

nhanes %>% 
  what_if(model = weight ~ height, experiments = 100, sample_size = 30, p.values = TRUE)

Multiple linear regression

Now you also have to set the focus to choose which to coefficient to bootstrap.

For instance, in the model weight ~ height + sex, you can focus on height:

nhanes %>% 
  what_if(model = weight ~ height + sex, focus = height, experiments = 100, sample_size = 30)

or sex:

nhanes %>% 
  what_if(model = weight ~ height + sex, focus = sex, experiments = 100, sample_size = 30)

Grouped samples

If you have a grouping variable in the data set (e.g., sex in nhanes) you can use the group argument to draw balanced samples from each group (in the example below, 15 observations from each sex, Male or Female):

nhanes %>% 
  what_if(model = weight ~ height + sex, focus = sex, group = sex, experiments = 100, sample_size = 30)

Distributions and Probabilities

Distributions

The function plot_distribution will plot any of R’s built-in probability distributions (the functions that start with “r”).

The ... (optional arguments) to plot_distribution are passed to the distribution function (the argument model). If nothing is passed to ... then defaults are used for the given distribution function. So if you set model = "rnorm" and nothing else, the defaults to rnorm() are kept (mean = 0 and sd = 1).

For instance, 1000 draws (the default n) from a normal distribution with mean 4 and standard deviation 2, plotting the PDF and the CDF:

plot_distribution(model = "rnorm", mean = 4, sd = 2, n = 1000, cumulative = TRUE)

Use distribution and details to make the plot more informative:

plot_distribution(model = "rnorm", mean = 4, sd = 2, n = 1000, cumulative = TRUE, distribution = "Normal", details = "Mean = 4, SD = 2")

Here is a Poisson distribution with \(\lambda = 5\):

plot_distribution(model = 'rpois', lambda = 5, cumulative = TRUE, distribution = "Poisson", details = "Lambda = 5")

Probabilities

The function plot_probability works very similarly, except it shades the area underneath a given distribution to visualize a cumulative probability.

For instance, plot \(P(0 \leq X \leq 1)\) for a standard normal distribution:

plot_probability(model = 'rnorm', mean = 0, sd = 1, lower = 0, upper =1,  distribution = "Standard Normal", details = "Mean 0 SD 1")

or \(P(8 \leq X \leq \infty)\) for a Chi-squared distribution with 5 degrees of freedom:

plot_probability(model = 'rchisq', df = 5, lower = 8, distribution = "Chi-squared", details = "5 degrees of freedom")

Optimization

profit = function(x) -10 - 48*x + 15*x^2 - x^3

The function plot_function is basically a soup-up version of the curve function that a) uses ggplot and b) visualizes the optima and roots of a function as well as first-derivatives.

Suppose you have a messy profit function like:

\[ \pi_(Q) = -10 - 48Q + 15Q^2 - Q^3 \]

First define the function using standard R syntax:

profit = function(x) -10 - 48*x + 15*x^2 - x^3

Plot it over the range \(Q \in [0,10]\):

plot_function(profit, lower = 0, upper = 10)

You can see there are two optima, a minimum and a maximum.

The optima are the solutions to the first-order condition:

\[ \frac{d \pi}{d Q} = -48 + 30Q - 3Q^2 = 0 \]

which you can visualize using the derivative = TRUE and root = TRUE to show that optima happen where the first-derivative crosses zero:

plot_function(function(x) -10 - 48*x + 15*x^2 - x^3, lower = 0, upper = 10, derivative = TRUE, roots = TRUE)

You can also visualize the maximum point by switching off derivative and setting optimize = TRUE and maximum = TRUE:

plot_function(profit, lower = 0, upper = 10, optimize = TRUE, maximum = TRUE)