STA 9750
  • Syllabus
  • Pre-Assignments
  • Labs
  • Mini Projects
  • Course Project
  • Additional Resources and Course Policies
  • Learning Objectives
  • Prior Offerings
    • Fall 2025
    • Spring 2025
    • Fall 2024

On this page

  • Slides
  • Review Exercise
  • Classical Statistical Models in R
  • Computational Inference
    • Refresher
    • Boostrap Testing in Action
  • Tidymodels
    • Tidymodels Basics
    • Prediction Models
    • LOCO Inference
    • Going Deeper with tidymodels
  • Edit this page
  • View source
  • Report an issue

STA 9750 Lecture #11 In-Class Activity: Predictive Modeling - Inference and Prediction in R

Class #14: Thursday May 14, 2026 (Lecture #11: Statistical Modeling in R)

Slides

Review Exercise

For this final warm-up exercise, you are going to use a combination of the skills learned over the semester to create a map of the countries of the world, colored by population density.

  1. Using the data at https://www.scrapethissite.com/pages/simple/, determine the capital of the most densely populated country in the world.

    To answer this:

    • Scrape the webpage into a four-column data frame with columns: name, capital, population, area
    • Ensure that the population and area columns are properly converted to numeric values
    • Compute population density as population/area
    • Select the country with the highest density and determine its capital
TipSolution
library(tidyverse)
library(httr2)
library(rvest)

parse_country <- function(country_element){
    name <- country_element |> 
        html_element("h3") |> 
        html_text2()
    
    capital <- country_element |> 
        html_element(".country-capital") |> 
        html_text2()
    
    population <- country_element |> 
        html_element(".country-population") |> 
        html_text2() |> 
        as.numeric()
    
    area <- country_element |> 
        html_element(".country-area") |> 
        html_text2() |> 
        as.numeric()
    
    data.frame(name, capital, population, area)
}

COUNTRIES <- read_html("https://www.scrapethissite.com/pages/simple/") |>
    html_elements(".country") |>
    map(parse_country) |>
    list_rbind() |>
    mutate(density = population / area)
    

COUNTRIES |> slice_max(density, n=1) |> pull(capital)
[1] "Monaco"
  1. When dealing with countries, it is dangerous to use raw names as these are not entirely standardized. E.g., do we live in the “United States” or the “United States of America”? To address this, we want to use a standard format like the ISO 3-digit country codes.

    The website https://peric.github.io/GetCountries/ provides this information interactively, but is a bit difficult to access programmatically. This page actually loads a JSON object that will provide all of the information we need. Use the Network tools in your web browser’s developer tools to identify the request that loads this data. (You may need to hit the Generate button again after opening the Network tools to see the request being made.)

    Read this data into R and create a data frame with (at least) columns for the 3 digit ISO country code, the continent, the country name, the population, and the area in square kilometers. Make sure the population and area have been properly converted into numeric values.

TipSolution
library(tidyverse)
library(jsonlite)

COUNTRIES <- read_json("https://secure.geonames.org/countryInfoJSON?username=dperic", 
          simplify=TRUE) |>
    pluck("geonames") |>
    select(iso_code = isoAlpha3, 
           name = countryName, 
           continent = continentName,
           area = areaInSqKm, 
           population = population) |>
    mutate(area = as.numeric(area), 
           population = as.numeric(population)) |> 
   filter(area > 0) # Ignore countries / political entities without land area
  1. Next, we are going to use an API to access data on country boundaries provided by the GeoBoundaries project. Specifically, we want to download shape files for all countries of the world. We do not want to use the high-resolution shapefiles as they are highly detailed and will slow down processing. Thankfully, the GeoBoundaries project already provides simplified representations of each country more suitable for plotting.

    Review the API documentation at https://www.geoboundaries.org/api.html#api and make a call for gbOpen data for ALL countries at ADM0 level (country boundaries). The result will be a JSON object with elements for each country. Parse this JSON object in R and pull out (at least) the boundaryName, boundaryISO, and simplifiedGeometryGeoJSON columns into a data frame.

TipSolution
library(tidyverse)
library(jsonlite)

API_REFERENCE <- read_json("https://www.geoboundaries.org/api/current/gbOpen/ALL/ADM0/", 
                           simplify = TRUE) |>
    select(country = boundaryName, 
           iso_code = boundaryISO,
           continent = Continent, 
           simple_geojson = simplifiedGeometryGeoJSON)
  1. Join the result of the previous two sections together using an inner_join. How many countries were in each data set? Which countries are lost in the inner join? What is their total population?

    Hint: An anti_join may be useful here.

    In real work, you would want to do some additional manual cleaning to better align these two data sets, but we are not going to do so today to save time. There is at least one major missing country in this merge.

TipSolution
library(tidyverse)

JOINED <- inner_join(COUNTRIES, API_REFERENCE, join_by(iso_code))

UNJOINED <- anti_join(COUNTRIES, API_REFERENCE, join_by(iso_code))

UNJOINED |> summarize(n_missing = n(), 
                      pop_missing = sum(population))
  n_missing pop_missing
1        19  1364326330
  1. Optional In the previous section, we found that one major country was missing. After some poking around, it turns out that the missing country is linked in the ADM1 listing and not the ADM0 listing. This appears to be just a manual error, but the related GitHub issue has not yet been addressed, so we need to handle it ourselves.

    Do this by making a second request to request the ADM1 listings and then filtering out any duplicates. Next join this to the response you had from before, now using a left_join() to avoid dropping results from intermediate tables. Finally, use the coalesce() function to handle redundant columns. Note that the coalesce() column takes a set of column names and gets the first non-NA value so it is great for things like this:

    coalesce_df <- tribble(
    ~a, ~b, ~c, 
    1, 2, 3, 
    4, NA, 6, 
    NA, NA, 9)
    coalesce_df |> mutate(fixed_val = coalesce(a, b, c))
    # A tibble: 3 × 4
          a     b     c fixed_val
      <dbl> <dbl> <dbl>     <dbl>
    1     1     2     3         1
    2     4    NA     6         4
    3    NA    NA     9         9
TipSolution
library(tidyverse)

API_REFERENCE_ADM0 <- "https://www.geoboundaries.org/api/current/gbOpen/ALL/ADM0/" |>
    read_json(simplify = TRUE) |>
    select(country = boundaryName, 
           iso_code = boundaryISO,
           simple_geojson = simplifiedGeometryGeoJSON) |>
    distinct()


API_REFERENCE_ADM1 <- "https://www.geoboundaries.org/api/current/gbOpen/ALL/ADM1/" |>
    read_json(simplify = TRUE) |>
    select(country = boundaryName, 
           iso_code = boundaryISO,
           simple_geojson = simplifiedGeometryGeoJSON) |>
    distinct()

JOINED <- COUNTRIES |> 
   left_join(API_REFERENCE_ADM0, 
             join_by(iso_code)) |> 
   left_join(API_REFERENCE_ADM1, 
             join_by(iso_code)) |>
   mutate(country = coalesce(country.x, country.y), 
          simple_geojson = coalesce(simple_geojson.x, simple_geojson.y)) |>
   select(-country.x, -country.y, -simple_geojson.x, -simple_geojson.y) |>
   filter_out(is.na(simple_geojson))
  1. Make a series of web requests to read the geojson files for each of the countries using the URLs obtained from the API. Since you will want to make this request sequentially, use the map() function to call
    sf::st_read repeatedly. You will also want to use unnest() to remove the resulting list-column.
TipSolution
library(tidyverse)
library(sf)

JOINED_GEOMETRY <- JOINED |>
    mutate(geometry = map(simple_geojson, st_read, quiet=TRUE)) |>
    unnest(geometry) |>
    ungroup()
  1. Create a chloropleth visualization with each country colored according to its population density.

    Note that this will take a little while due to the high-resolution shapefiles used above. If you want a faster plot, you can use sf::st_simplify() to reduce the resolution of the boundaries.

TipSolution
JOINED_GEOMETRY |>
    mutate(density = population / area) |>
    ggplot(aes(fill=density, geometry=geometry)) + 
    geom_sf() + 
    theme_bw() + 
    theme(legend.position="bottom") + 
    scale_fill_distiller(name="Population Density", 
                         palette=4, 
                         direction=1,
                         trans=scales::pseudo_log_trans(), 
                         breaks=c(0, 3, 10, 30, 100, 300, 1000, 3000, 10000, 30000)) + 
    guides(fill = guide_colorbar(barwidth=unit(10, "cm")))

Classical Statistical Models in R

Using the countries data downloaded above, use statistical models like lm() and mgcv::gam() to answer the following:

  1. Recall the flights data from the nycflights13 package. Load this package and fit a linear model to predict air time as a function of distance.

    The intercept term of this model can be interpreted as the “fixed cost” of a flight - the unavoidable time taken to travel between the gate and the runway. What is the estimate of this cost based on the model you fit?

TipSolution
library(tidyverse)
library(nycflights13)
library(broom)
air_time_model <- lm(air_time ~ distance, data=flights) 

air_time_model |> tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)   18.5   0.0389         475.       0
2 distance       0.126 0.0000304     4154.       0

From this, we see that the ‘fixed cost’ of air travel is approximately 18.5 minutes.

  1. The same linear model can be used to estimate flight speed. In particular, the distance coefficient (which has units of minutes / mile) can be extracted and transformed into miles per hour. What is the average speed of a plane leaving NYC?
TipSolution
library(tidyverse)
library(nycflights13)
library(broom)
air_time_model <- lm(air_time ~ distance, data=flights) 

air_time_model |> 
    tidy() |> 
    filter(term == "distance") |> 
    mutate(hours_per_mile = estimate / 60, 
           miles_per_hour = 1/hours_per_mile) |> 
    pull(miles_per_hour) |>
    round(2) |>
    paste("miles per hour")
[1] "475.74 miles per hour"
  1. Modify your analysis above to estimate a different fixed cost for each of the three NYC-area airports. (Maintain a single estimate of speed for all airports.) Which seems to be the most efficient?

    Use broom::tidy() and some regular expressions to create a table that looks like this:

    Airport Estimated Fixed Cost
    EWR NNN.NN
    LGA NNN.NN
    JFK NNN.NN

    Hint: Your estimated coefficients will be easier to interpret if you remove the intercept term from your model by including -1 at the end of the formula.

TipSolution
library(tidyverse)
library(nycflights13)
library(broom)
library(gt)
lm(air_time ~ distance + origin - 1, data=flights) |> 
    tidy() |> 
    filter(str_detect(term, "origin")) |> 
    mutate(term=str_remove(term, "origin")) |> 
    gt() |>
    cols_hide(c(statistic, p.value)) |>
    cols_label(term="Airport", 
               estimate="Estimated Fixed Transit Cost (Minutes)") |>
    cols_merge(c(estimate, std.error), 
               pattern="{1}±{2}") |>
    fmt_number(c(estimate, std.error))
Airport Estimated Fixed Transit Cost (Minutes)
EWR 18.80±0.05
JFK 17.27±0.06
LGA 18.70±0.05
  1. Now modify your model to allow different estimated flight speeds for each airport. Which airport seems to have the highest average flight speed?

    Hint: Use the * operator to allow for interactions between airport and distance. This will estimate different slopes for each group.

    Hint: If you find your coefficients hard to interpret, consider adding -distance to the end of your formula. What does this do? Why does it help?

TipSolution
library(tidyverse)
library(nycflights13)
library(broom)
library(gt)
lm(air_time ~ distance * origin - 1 - distance,
    data=flights) |>
    tidy() |>
    mutate(airport = str_sub(term, start=-3), 
           effect = if_else(str_detect(term, "distance"), "speed", "fixed")) |>
    select(estimate, airport, effect) |>
    pivot_wider(names_from=effect, values_from=estimate) |>
    mutate(speed = 60 / speed) |>
    gt() |>
    cols_label(airport=md("**Origin Airport**"),
               fixed=md("Gate Taxi Time (Minutes)"),
               speed=md("Estimated Air Speed (mph)")) |>
   fmt_number(c(fixed, speed), decimals = 1)
Origin Airport Gate Taxi Time (Minutes) Estimated Air Speed (mph)
EWR 19.1 476.2
JFK 17.7 476.2
LGA 15.9 462.0
  1. Use the augment function to compute both the RMSE (Root Mean Squared Error) and the MedAE (Median Absolute Error) of the model you fit in the previous section.
TipSolution
library(tidyverse)
library(nycflights13)
library(broom)
lm(air_time ~ distance * origin - 1 - distance, data=flights) |>
    augment() |>
    summarize(rmse = sqrt(mean(.resid^2)), 
              medae = median(abs(.resid)))
# A tibble: 1 × 2
   rmse medae
  <dbl> <dbl>
1  12.7  7.07
  1. In which month did the model have the lowest RMSE?

    Hint: Pass the newdata argument to augment to get back additional columns you need.

TipSolution
lm(air_time ~ distance * origin - 1 - distance, data=flights) |>
    augment(newdata=flights) |>
    group_by(month) |>
    summarize(rmse = sqrt(mean(.resid^2, na.rm=TRUE))) |>
    slice_min(rmse)
# A tibble: 1 × 2
  month  rmse
  <int> <dbl>
1    10  10.6

So our model seems to be most predictive in October.

  1. The yardstick package provides tools for evaluating model performance. The rsq function computes \(R^2\) statistics. It takes two column names as arguments (in addition to the data frame being analyzed): truth, the observed values (\(y\)), and estimate, the model predictions (\(\hat{y}\)). It can be used like this:
library(broom)
library(yardstick)
lm(body_mass ~ flipper_len * species, data=penguins) |> 
    augment() |> 
    rsq(truth=body_mass, estimate=.fitted)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.790

Use it to compute the \(R^2\) of your model on a month-by-month and per-airport basis. At what airport and in what month does this model exhibit the lowest \(R^2\)? What does that tell you? How does this compare to the overall (full-sample) \(R^2\) of the model that glance provides?

TipSolution
lm(air_time ~ distance * origin - 1 - distance, data=flights) |>
    augment(newdata=flights) |>
    group_by(month, origin) |>
    rsq(truth=air_time,
        estimate=.fitted) |>
    slice_min(.estimate)
# A tibble: 1 × 5
  month origin .metric .estimator .estimate
  <int> <chr>  <chr>   <chr>          <dbl>
1     1 LGA    rsq     standard       0.956

Our model performs (relatively) worst at LGA in January, suggesting that other effects (e.g. weather) play the largest role here, but the effect is quite minimal.

lm(air_time ~ distance * origin - 1 - distance, data=flights) |>
    glance()
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df    logLik     AIC    BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>     <dbl>   <dbl>  <dbl>
1     0.995         0.995  12.7 10527222.       0     6 -1297511.  2.60e6 2.60e6
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

So the drop in \(R^2\) is 99.5% to 95.6%: this isn’t nothing…

Computational Inference

Using the countries data downloaded above, test the null hypothesis that the standard deviation of population density is equal in African and non-African countries.

Before doing so, let’s review the concept of bootstrapping, a variant of the computationally-powered inference you reviewed in this week’s pre-assignment that attempts to relax the requirement that we specify a null distribution precisely.

Refresher

As discussed in this week’s pre-assignment, the core of computationally-intensive approach to statistical inference is to get many ‘pseudo-samples’ from a null distribution, to compute the test statistic on those ‘pseudo-samples’, and to compare the distribution of the test statistic against the observed value. Doing so requires us to have some sense of a null distribution: it can be a parametric posited a priori or, as we do here, it can be based on re-sampling the data at hand.

In this case, we don’t want to posit a full model for country density: we only want to see if our data supports the claim that the standard deviation of population density is the same within Africa and outside of Africa. Because we don’t require a full model to define our null hypothesis, you might be tempted to use a ‘working model’ based on a normal distribution. A quick plot of the data suggests that this data is quite heavy tailed (there are some very small, very densely populated areas like Monaco, Hong Kong, and Singapore) so this strategy of ‘get a good enough’ model isn’t likely to be productive.

But we don’t exactly need a model for country density - we really just need the ability to sample from such a null model that doesn’t distinguish Africa and non-Africa. Is there a way we can do that instead without the full overhead of a ‘proper’ model?

We turn to the technique of bootstrapping. In essence, we use the data we have as a rough model of the (unknown) distribution of interest. How does this work?

First, note that IID sampling + something like density estimation is a universal estimator. No matter the distribution (within reason), if we have enough data the sample density estimate (or histogram or empirical CDF) will converge to the corresponding true distribution. While this seems intuitive, not everything in statistics is like this: e.g., no matter how much data you have, linear regression will never get the ‘true’ relationship if it is actually non-linear. Here, we’re replying the fact that things like a density estimate or the ECDF are non-parametric and can fit essentially any shape.

Let \(\hat{\mathbb{P}}_n\) be the ‘estimated CDF’ based on the observed data: we can evaluate \(\hat{\mathbb{P}}_n\) by:

\[\hat{\mathbb{P}}_n(x) = \frac{1}{n} \sum_{i=1}^n 1_{x \leq x_i} \]

That is, to estimate the CDF at any point \(x\), we simply look at the fraction of points that fall below \(x\). The law of large numbers1 tells us that

\[\hat{\mathbb{P}}_n(x) \to \mathbb{P}(x) \text{ for all } x\] or just

\[\hat{\mathbb{P}}_n \to \mathbb{P}\]

where \(\mathbb{P}\) is the true distribution used to generate the data. This suggests to us that, while we might be interested in the variance of our test statistic \(T\) under distribution \(\mathbb{P}\), we can approximate it using \(\hat{\mathbb{P}}_n\). That is, we hope that

\[\text{Variance}_{\hat{\mathbb{P}}_n}(T) \to \text{Variance}_{\mathbb{P}}(T)\]

for any test statistic \(T\) as the sample size \(n \to \infty\). Notably, this should work for almost any distribution \(\mathbb{P}\) and any test statistic \(T\). Similar results hold for quantiles of \(T\) as well:

\[q\text{-Quantile}_{\hat{\mathbb{P}}_n}(T) \to q\text{-Quantile}_{\mathbb{P}}(T) \text{ for all } q \in (0, 1)\]

So this means that we can approximate \(\text{Quantile}_{\mathbb{P}}(T)\) if we can estimate \(\text{Quantile}_{\hat{\mathbb{P}}_n}(T)\). This leaves us with just one question: how do we sample from \(\hat{\mathbb{P}}_n\)?

Note that \(\hat{\mathbb{P}}_n\) is the CDF of a discrete distribution with equal jumps at each observation: that is, it’s uniform over the observed samples. So how can we sample from a uniform distribution? We simply pick one of its values at random. And how do we get a second sample, we just get another value at random once again.

Bootstrap sampling is often described as sampling with replacement. This is true, but I don’t like it as an explanation because it seems a bit magical.

The true idea of bootstrapping is that we are just getting \(n\) samples from a (uniform) distribution over the original data. When we take any number of samples from a discrete distribution, there is always a possibility of ties, regardless of how we came up with that distribution. “Bootstrap sampling with replacement” is then just weirdly redundant: regular sampling of any distribution always allows with replacement.”2 (You would never specify ‘flipping a coin three times with replacement’!)

To recap: in bootstrapping

  • We build a discrete uniform distribution from the original data
  • We sample from that distribution to estimate quantities of interest

The fact we get samples with replacement is not because we are bootstrapping per se, it’s just because we are IID sampling from a distribution and the values that distribution don’t change based on what we saw previously.

Before we apply this to statistical hypothesis testing, let’s just use this idea to estimate the quantiles of the standard deviation of the country density. The point estimate is straightforward:

COUNTRIES |> 
   mutate(density = population / area) |> 
   summarize(sd(density))
  sd(density)
1    1476.741

But how can we get a sense of the variance of this quantity? Ideally, we would have many sets of countries and could compute standard deviation within each set and then compute a variance across sets. Resampling will get us there:

COUNTRIES |> 
   mutate(density = population / area) |> 
   mutate(density_resampled = sample(density, replace=TRUE)) |> 
   summarize(sd(density_resampled))
  sd(density_resampled)
1              1372.579

A bit different - but not too much. But one ‘alternate world value’ isn’t enough to get quantiles. We need many! There are several ways to code this in dplyr, but one of the easiest is to use the expand_grid() function to induce replicates:

COUNTRIES |> 
    filter(area > 0) |> 
    mutate(density = population / area) |> 
    expand_grid(replicate=1:100) |> 
    mutate(density_resampled = sample(density, replace=TRUE)) |> 
    group_by(replicate) |> 
    summarize(density_sd_resampled = sd(density_resampled)) |> 
    summarize(density_sd_q95 = quantile(density_sd_resampled, 0.95), 
              density_sd_q05 = quantile(density_sd_resampled, 0.05))
# A tibble: 1 × 2
  density_sd_q95 density_sd_q05
           <dbl>          <dbl>
1          2246.           619.

To understand this better, let’s apply it to a smaller example with 9 penguins:

small_penguins <- penguins |> 
   drop_na() |> 
   group_by(species) |> 
   slice_sample(n=3)

small_penguins
# A tibble: 9 × 8
# Groups:   species [3]
  species   island bill_len bill_dep flipper_len body_mass sex     year
  <fct>     <fct>     <dbl>    <dbl>       <int>     <int> <fct>  <int>
1 Adelie    Dream      42.3     21.2         191      4150 male    2007
2 Adelie    Biscoe     37.9     18.6         172      3150 female  2007
3 Adelie    Dream      36.9     18.6         189      3500 female  2008
4 Chinstrap Dream      45.2     16.6         191      3250 female  2009
5 Chinstrap Dream      47.6     18.3         195      3850 female  2008
6 Chinstrap Dream      50       19.5         196      3900 male    2007
7 Gentoo    Biscoe     45.2     14.8         212      5200 female  2009
8 Gentoo    Biscoe     46.5     14.4         217      4900 female  2008
9 Gentoo    Biscoe     45.8     14.6         210      4200 female  2007

If we use expand_grid(), we induce copies of each data set:

small_penguins_eg <- small_penguins |> expand_grid(replicate = 1:3)

small_penguins_eg
# A tibble: 27 × 9
   species  island bill_len bill_dep flipper_len body_mass sex    year replicate
   <fct>    <fct>     <dbl>    <dbl>       <int>     <int> <fct> <int>     <int>
 1 Adelie   Dream      42.3     21.2         191      4150 male   2007         1
 2 Adelie   Dream      42.3     21.2         191      4150 male   2007         2
 3 Adelie   Dream      42.3     21.2         191      4150 male   2007         3
 4 Adelie   Biscoe     37.9     18.6         172      3150 fema…  2007         1
 5 Adelie   Biscoe     37.9     18.6         172      3150 fema…  2007         2
 6 Adelie   Biscoe     37.9     18.6         172      3150 fema…  2007         3
 7 Adelie   Dream      36.9     18.6         189      3500 fema…  2008         1
 8 Adelie   Dream      36.9     18.6         189      3500 fema…  2008         2
 9 Adelie   Dream      36.9     18.6         189      3500 fema…  2008         3
10 Chinstr… Dream      45.2     16.6         191      3250 fema…  2009         1
# ℹ 17 more rows

We then use sample(..., replace=TRUE) to perform bootstrap sampling on the body_mass column:

small_penguins_eg_bs <- small_penguins_eg |> 
   mutate(body_mass_bs = sample(body_mass, replace=TRUE))

And finally, we use the replicate column to get multiple samples we can use to get a sense of variance:

small_penguins_eg_bs |> 
   group_by(replicate) |> 
   summarize(avg_body_mass = mean(body_mass_bs)) |> 
   summarize(sd_avg_body_mass = sd(avg_body_mass))
# A tibble: 1 × 1
  sd_avg_body_mass
             <dbl>
1             247.

With only three samples, this doesn’t give us much strength, so let’s just let the number of bootstrap samples go to 5000.

small_penguins |> 
   expand_grid(replicate = 1:5000) |>
   mutate(body_mass_bs = sample(body_mass, replace=TRUE)) |>
   group_by(replicate) |> 
   summarize(avg_body_mass = mean(body_mass_bs)) |> 
   summarize(sd_avg_body_mass = sd(avg_body_mass))
# A tibble: 1 × 1
  sd_avg_body_mass
             <dbl>
1             221.

Usually, around 1000 bootstrap samples is enough to account for the randomness from the sample() in the third line.

Returning to our population density SD example from above:

COUNTRIES |> 
    filter(area > 0) |> 
    mutate(density = population / area) |> 
    expand_grid(replicate=1:2000) |> 
    mutate(density_resampled = sample(density, replace=TRUE)) |> 
    group_by(replicate) |> 
    summarize(density_sd_resampled = sd(density_resampled)) |> 
    summarize(density_sd_q95 = quantile(density_sd_resampled, 0.95), 
              density_sd_q05 = quantile(density_sd_resampled, 0.05))
# A tibble: 1 × 2
  density_sd_q95 density_sd_q05
           <dbl>          <dbl>
1          2284.           547.

This gives us a sense of the range of plausible standard deviations - or, more precisely, the range of standard deviations that are consistent with the observations we have seen.

We can plot these as well:

COUNTRIES |> 
    filter(area > 0) |> 
    mutate(density = population / area, 
           true_density_sd = sd(density)) |> 
    expand_grid(replicate=1:200) |> 
    mutate(density_resampled = sample(density, replace=TRUE)) |> 
    group_by(replicate) |> 
    summarize(density_sd_resampled = sd(density_resampled), 
              true_density_sd = first(true_density_sd)) |> 
    mutate(density_sd_q95 = quantile(density_sd_resampled, 0.95), 
           density_sd_q05 = quantile(density_sd_resampled, 0.05)) |> 
    ggplot(aes(x=density_sd_resampled)) + 
    geom_density() + 
    theme_bw() + 
    xlab("Standard Deviation of Density") + 
    ylab("Bootstrap Sampling Density") + 
    geom_area(stat="density", 
              aes(fill = after_stat(x) > quantile(x, 0.05) & 
                         after_stat(x) < quantile(x, 0.95))) + 
    scale_fill_manual(name="Central 90%", 
                      values=c("TRUE"="#4682B477",
                               "FALSE"=NA)) + 
    guides(fill="none") + 
    geom_vline(aes(xintercept=true_density_sd), 
               color="red4", 
               linewidth=2, 
               linetype=2) + 
    ggtitle("Sampling Distribution of Standard Deviation of Country Densities", 
            "True value (Red) lies in the middle of the sampling distribution")

We see here that, unsurpsingly, the bootstrap distribution is located around the ‘true’ value, with a second mode that likely corresponds to the times that the highest- and lowest-density countries are randomly excluded. This second mode provides us strong evidence that country populations and areas are not normally distributed (if that weren’t already obvious).

In this case, if we, for whatever reason, had a null belief that the ‘true’ standard deviation was near 3000, it would be clear that this is quite incompatible with the observed value (red line) but also with the variance around that observed value (blue area).

Boostrap Testing in Action

Now it’s your turn! Using the countries data downloaded above, test the null hypothesis that the standard deviation of population density is equal in African and non-African countries.

  1. Create a new (Boolean) variable to determine whether a country is in Africa or not.
TipSolution
COUNTRIES <- COUNTRIES |>
   mutate(in_africa = (continent == "Africa"), 
          density = population / area)
  1. Compute the standard deviation of density for both African and non-African countries, then compute the difference in standard deviations. This will be the ‘observed’ value we want to compare against the bootstrap samples.
TipSolution
delta_sd_observed <- COUNTRIES |> 
    group_by(in_africa) |> 
    summarize(sd_density = sd(density)) |> 
    mutate(in_africa = if_else(in_africa, "africa", "nonaf")) |>
    pivot_wider(names_from=in_africa,
                values_from=sd_density) |> 
    mutate(delta_sd = nonaf - africa) |> 
    pull(delta_sd)
  1. Modify the strategy above to compute standard deviations for African and non-African countries in each replicate.

    Hint: When forming replicates, we are assuming that the distribution of African and non-African countries is the same, so we do not need to modify our bootstrap sampling in any significant way. Here, the two-group structure is only used to define our test statistic.

TipSolution
COUNTRIES |> 
    expand_grid(replicate=1:1000) |> 
    mutate(density_resampled = sample(density, replace=TRUE)) |> 
    group_by(replicate, in_africa) |> 
    summarize(density_sd_resampled = sd(density_resampled))
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by replicate and in_africa.
ℹ Output is grouped by replicate.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(replicate, in_africa))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.
# A tibble: 2,000 × 3
# Groups:   replicate [1,000]
   replicate in_africa density_sd_resampled
       <int> <lgl>                    <dbl>
 1         1 FALSE                     878.
 2         1 TRUE                     2597.
 3         2 FALSE                     694.
 4         2 TRUE                      281.
 5         3 FALSE                    1465.
 6         3 TRUE                      401.
 7         4 FALSE                     959.
 8         4 TRUE                      494.
 9         5 FALSE                     937.
10         5 TRUE                      248.
# ℹ 1,990 more rows
  1. Now compute the difference in SDs for each replicate and construct a density plot for the middle 95% of the distribution of differences.
TipSolution
COUNTRIES |> 
    expand_grid(replicate=1:1000) |> 
    mutate(density_resampled = sample(density, replace=TRUE)) |> 
    group_by(replicate, in_africa) |> 
    summarize(sd_density_resampled = sd(density_resampled)) |>
    mutate(in_africa = if_else(in_africa, "africa", "nonaf")) |>
    pivot_wider(names_from=in_africa,
                values_from=sd_density_resampled) |> 
    mutate(delta_sd_resampled = nonaf - africa) |>
    ggplot(aes(x=delta_sd_resampled)) + 
    geom_density() + 
    theme_bw() + 
    xlab("Standard Deviation of Difference in Population Densities") + 
    ylab("Bootstrap Sampling Density") + 
    geom_area(stat="density", 
              aes(fill = after_stat(x) > quantile(x, 0.05) & 
                         after_stat(x) < quantile(x, 0.95))) + 
    scale_fill_manual(name="Central 90%", 
                      values=c("TRUE"="#4682B477",
                               "FALSE"=NA)) + 
    guides(fill="none")
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by replicate and in_africa.
ℹ Output is grouped by replicate.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(replicate, in_africa))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

  1. Add a vertical line for your observed value (computed above). If this falls outside the shaded area, we can reject the null of no difference between African and non-African countries.
TipSolution
COUNTRIES |> 
    expand_grid(replicate=1:1000) |> 
    mutate(density_resampled = sample(density, replace=TRUE)) |> 
    group_by(replicate, in_africa) |> 
    summarize(sd_density_resampled = sd(density_resampled)) |>
    mutate(in_africa = if_else(in_africa, "africa", "nonaf")) |>
    pivot_wider(names_from=in_africa,
                values_from=sd_density_resampled) |> 
    mutate(delta_sd_resampled = nonaf - africa) |>
    ggplot(aes(x=delta_sd_resampled)) + 
    geom_density() + 
    theme_bw() + 
    xlab("Standard Deviation of Difference in Population Densities") + 
    ylab("Bootstrap Sampling Density") + 
    geom_area(stat="density", 
              aes(fill = after_stat(x) > quantile(x, 0.05) & 
                         after_stat(x) < quantile(x, 0.95))) + 
    scale_fill_manual(name="Central 90%", 
                      values=c("TRUE"="#4682B477",
                               "FALSE"=NA)) + 
    guides(fill="none") + 
    geom_vline(xintercept=delta_sd_observed, 
               color="red4", 
               linewidth=2)
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by replicate and in_africa.
ℹ Output is grouped by replicate.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(replicate, in_africa))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

From here, we see that while there is a wider spread among non-African countries than there is among African countries, it is not inconsistent with ‘pure randomness’ that we could get if we randomly swapped the labels on this data. In particular, we see that we can get differences in standard deviations of \(\pm 2000\), so the observed difference of 1529 is not that big in context.3

TipAll In One Go

If you want to avoid having delta_sd_observed as its own variable, this example shows how it can be handled as its own column to get the same results:

COUNTRIES |> 
    expand_grid(replicate=1:1000) |> 
    mutate(density_resampled = sample(density, replace=TRUE)) |> 
    group_by(replicate, in_africa) |> 
    summarize(sd_density_resampled = sd(density_resampled), 
              sd_density = sd(density), 
              .groups="drop") |>
    mutate(in_africa = if_else(in_africa, "africa", "nonafr")) |>
    pivot_wider(names_from=in_africa, 
                values_from=c(sd_density, sd_density_resampled)) |>
    mutate(delta_sd_resampled = sd_density_resampled_nonafr - sd_density_resampled_africa, 
           delta_sd = sd_density_nonafr - sd_density_africa ) |>
    ggplot(aes(x=delta_sd_resampled)) + 
    geom_density() + 
    theme_bw() + 
    xlab("Standard Deviation of Difference in Population Densities") + 
    ylab("Bootstrap Sampling Density") + 
    geom_area(stat="density", 
              aes(fill = after_stat(x) > quantile(x, 0.05) & 
                         after_stat(x) < quantile(x, 0.95))) + 
    scale_fill_manual(name="Central 90%", 
                      values=c("TRUE"="#4682B477",
                               "FALSE"=NA)) + 
    guides(fill="none") + 
    geom_vline(aes(xintercept=delta_sd), 
               color="red4", 
               linewidth=2) + 
   ggtitle("Comparison of Population Density Variance Among African and Non-African Countries", 
           "Insignificant Evidence that Non-African Countries Exhibit Higher Density Variance")

I like to code these types of analyses ‘by hand’ and think that it is a useful exercise to clarify the logic of statistical testing; sometimes, particularly if you want to stay closer to a ‘standard’ test with standard test statitics, it may be easier to rely on a pre-packaged implementation. The infer package makes this type of computational inference quite easy and supports both parametric, bootstrap, and permutation based approaches to inference. See this vignette for more.

Tidymodels

The tidymodels package provides an elegant interface for applying many “black box” machine learning models in a standardized manner. This is especially useful when you want to compare many different models to see which performs the best on a prediction task.4

In these notes, we’re going to use tidymodels for two tasks that may be useful as you finish your final projects:

  1. Building a model that takes in several features and uses them to predict an output of interest
  2. Using both feature importance scores and a leave-one-covariate-out (LOCO) strategy to determine which features are most important.

Tidymodels Basics

tidymodels provides a unified framework for building predictive models. In this context, we are solely focused on out-of-sample prediction and aren’t paying attention to things like coefficients or statistical significance. This single-minded focus makes it possible for tidymodels to “abstract” over different types of machine-learning models, using them only as prediction machines.5

We will demonstrate the use of the tidymodels ecosystem6 to build a predictive model on a data set of NYS public school test results. I have prepared two files containing relevant data:

  • School-Level Covariates (schools_covariates.csv)
  • School-Level Test Results (schools_test_results.csv)

The first file, schools_covariates.csv, has data on 4754 schools, covering information about class size, the number of teachers and support staff, the percentage of students who quality for discounted or free lunches, and funding levels.7 The latter file, school_test_results.csv, has 144921 test results, indexed by the school, the student subgroup (e.g.., all male students or all non-white students), and the test subject. The variable we seek to predict is PERCENT_PROFICIENT, the fraction of students who achieved a ‘proficient’ score on that test.8

Persuant to the Every Student Succeeds Act (ESSA) of 2015 and its predecessor, the No Child Left Behind Act (NCLB) of 2001, all public schools in New York State are required to administer standardized examinations annually to their students. Students’ results on these tests are used for a variety of purposes, including assessing teacher and school performance, providing additional recognition to high-performing schools, and identifying struggling students or subgroups. The New York State Education Department (NYSED) makes aggregate test results available through its annual “School Report Card” data releases. For this exercise, we will predict the percentage of students at each school who score in the “Proficient” range (i.e., at or above grade level) on a wide battery of standardized tests during the 2024 testing period.

New York administers two primary sets of standardized exams:9

  • Grade 3-8 Exams:
    • English Language Arts (“ELA”) given every year
    • Mathematics given every year
    • Science, typically given only in grade 5 and grade 8
  • Regents Exams for high school students

The Regents exams cover a broader variety of topics and, historically, have been a graduation requirement in NYS. For each exam, a “proficiency” level is determined: e.g., if a exam gives scores from “1” to “4”, levels “3” and “4” may be considered “proficient”. Your task is to predict the fraction of students who achieve proficiency.

The organization of public schools in New York State is, like everything in New York State, rather confusing. As of the time of writing, NYSED reports:

  • 730 Districts
  • 4,394 Public Schools
  • 373 Charter Schools
  • 215,701 Public School Teachers
  • 2,421,491 Public School Students

in the state. These districts range from very small, e.g., Newcomb CSD with 47 students to very large, e.g., NYC Public Schools with just under a million students.

Outside of NYC, public schools are organized into local school districts, which typically have (at least) one elementary school, one middle school, and one high school.10 These districts are further organized into Boards of Cooperative Educational Serices “BOCES” which provide shared resources to groups of smaller districts. Local school districts are principally funded by property taxes, with more affluent districts having a larger tax base and providing higher per-student funding levels. Local school districts are run by an elected school board, following statewide standards. To address differences in funding, NYSED provides supplemental funding to districts based on their assessed “Needs/Resource Capacity”, though work is ongoing to update and refine the current funding formula.

Within NYC, public schools are run by the NYC Department of Education (NYC DOE), under a commissioner and school board appointed by the mayor and borough presidents. NYCDOE is subdivided into 32 geographic school districts and two citywide districts providing specialized services for students with disabilities or students in alternative education programs.

Let’s read these files into R, join them together, and remove some identifiers that aren’t relevant to us:

library(tidyverse)
COVARIATES <- read_csv("schools_covariates.csv")
RESULTS <- read_csv("schools_test_results.csv")

DATA <- inner_join(RESULTS, 
                   COVARIATES, 
                   join_by(SCHOOL == SCHOOL)) |> 
    select(-ASSESSMENT_ID)

In the interest of speed, we’re only going to use a few features and a subset of the entire data set, though all of the code shown here can be scaled up to use all of the rows and columns in this data set:

DATA <- DATA |> 
    mutate(
        # Lunch discount as a proxy for family income
        LUNCH_DISCOUNT = PERCENT_FREE_LUNCH + PERCENT_REDUCED_LUNCH, 
        # Student-teacher-ratio
        STUDENT_TEACHER_RATIO = N_PUPILS / NUMBER_OF_TEACHERS, 
        # Does more funding actually help?
        FUNDING_PER_STUDENT = LOCAL_FUNDING_PER_PUPIL + 
            FEDERAL_FUNDING_PER_PUPIL) |>
    select(
        # Outcome of interest
        PERCENT_PROFICIENT, 
        # Population of interest: which test and which group of students?
        SUBGROUP_NAME, 
        ASSESSMENT_NAME, 
        # Three constructed features
        LUNCH_DISCOUNT, 
        STUDENT_TEACHER_RATIO, 
        FUNDING_PER_STUDENT) |> 
    slice_sample(n=10000)
CautionSimplified Modeling

The model demonstrated here ignores much of the interesting structure (schools within districts within regions, etc) of this data that a more effective model would take advantage of. The simple model we show here is intended to demonstrate the use of tidymodels, not to be the last word in modeling this type of data.

Prediction Models

For this demonstration, we are going to build a random forest (RF) predictor. Random forests are a powerful and flexible machine learning tool that can be fit efficiently and don’t require too much tuning.

tidymodels requires us to specify a few separate components of our model fitting pipeline:

  • A recipe describing how the prediction task is specified and any data pre-processing that is performed. This is the stage in which we specify which variables are to be used as predictors and which is our outcome. We also perform steps like data standardization, imputing missing values, handling categorical features, removing ‘low information features’, etc. here.
  • A model specification indicating what type of model we want to use. At this stage, in addition to specifying the type of model, we also have the option to specify any ‘knobs’ (hyperparameters) that control model behavior and, for models with multiple implementations, specify which package we want to use.
  • The range of tuning parameters (if any) to consider
  • Data used to tune and evaluate the model

Let us build these each in turn. We first build a recipe() for our task, a set of steps that define the roles of each variable (predictor or response) and any pre-processing, if necessary. In this case, we don’t need much pre-processing as we’re using a model family that’s quite robust.

library(tidymodels)

schools_proficiency_task <- recipe(PERCENT_PROFICIENT ~ ., data=DATA)

Here, we simply specify that we want to predict one column (PERCENT_PROFICIENT) using all other features (.). At this point, we have specified our goal, but we still need to specify the manner (model) by which we hope to achieve that goal.

For this task, we’re going to specify a random forest model:

rf_model <- rand_forest() |> 
    set_mode("regression") |> 
    set_engine("ranger")

Here, we have specified a basic random forest and set two properties:

  • the mode specifies what type of random forest we are fitting. Here we are predicting a continuous response rather than a categorical or discrete one, so our task is ‘regression’ in the jargon of machine learning.
  • the engine specifies which R package we want to use. Here, we’re using the ranger package which provides a high-performance and high-quality implementation of the random forest framework.

At this point, we can define a workflow() object that will put these two together:

schools_rf <- workflow() |> 
    add_recipe(schools_proficiency_task) |> 
    add_model(rf_model)

Finally, we can use this workflow to fit our model to data:

rf_fitted <- schools_rf |> fit(data=DATA)

At this point, we have a model that could be used to make predictions:

augment(rf_fitted, new_data=DATA) |> head(n=5)
# A tibble: 5 × 7
  .pred PERCENT_PROFICIENT SUBGROUP_NAME          ASSESSMENT_NAME LUNCH_DISCOUNT
  <dbl>              <dbl> <chr>                  <chr>                    <dbl>
1  92.8                 96 Female                 Regents NF Glo…              2
2  89.0                 87 Male                   Regents US His…             38
3  39.0                 47 Male                   Science8                    41
4  56.6                 50 Not Economically Disa… MATH3                       80
5  82.9                 85 Economically Disadvan… Regents Phy Se…             16
# ℹ 2 more variables: STUDENT_TEACHER_RATIO <dbl>, FUNDING_PER_STUDENT <dbl>

Indeed, we could even compute a set of accuracy measures for this problem:11

metrics <- metric_set(
    # Root Mean Squared Error
    rmse, 
    # R-Squared
    rsq, 
    # Mean Absolute Error
    mae, 
    # Mean Absolute Percentage Error, 
    mape)

augment(rf_fitted, new_data=DATA) |> 
    metrics(truth=PERCENT_PROFICIENT, 
            estimate=.pred)
# A tibble: 4 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      10.2  
2 rsq     standard       0.880
3 mae     standard       7.96 
4 mape    standard     Inf    

So that last row – “percent error” – goes a bit janky just because we have 105 tests on which 0% of students were at grade level. We can exclude these from our analysis and use other tidyverse techniques to see how our model performs on different tests:

library(gt)
rf_fitted |> 
    augment(new_data=DATA) |> 
    filter(PERCENT_PROFICIENT > 0) |> 
    group_by(ASSESSMENT_NAME) |> 
    metrics(truth=PERCENT_PROFICIENT, 
            estimate=.pred) |> 
    select(-.estimator) |> 
    pivot_wider(names_from=.metric, values_from=.estimate) |>
    mutate(mape = mape/100) |> 
    gt(id="tbl_exams_1", rowname_col = "ASSESSMENT_NAME") |>
    cols_label(rmse="Root Mean Squared Error", 
               rsq="R^2", 
               mae="Mean Absolute Error", 
               mape="Mean Absolute Percent Error") |>
    fmt_percent(c(rsq, mape)) |> 
    fmt_number(c(rmse, mae))
Root Mean Squared Error R^2 Mean Absolute Error Mean Absolute Percent Error
Combined6Math 11.63 82.96% 9.28 25.65%
Combined7Math 9.95 81.62% 7.86 22.30%
Combined8Math 9.78 87.85% 7.99 19.98%
CombinedScience 8.47 93.57% 6.92 25.39%
ELA3 10.95 81.46% 8.71 31.21%
ELA4 9.91 86.10% 8.10 26.12%
ELA5 9.76 85.53% 7.77 29.77%
ELA6 9.62 86.11% 7.57 25.83%
ELA7 9.69 86.63% 7.44 21.48%
ELA8 8.08 88.49% 6.40 15.61%
MATH3 11.24 84.82% 9.04 25.46%
MATH4 11.58 81.36% 9.30 23.95%
MATH5 11.53 83.30% 9.10 33.97%
MATH6 11.07 84.45% 8.86 27.02%
MATH7 11.29 81.90% 8.81 23.02%
MATH8 10.99 85.79% 8.80 38.37%
Regents Algebra I 10.94 87.53% 8.72 20.58%
Regents Common Core Algebra I 9.56 90.64% 7.98 38.58%
Regents Common Core Algebra II 8.68 91.86% 6.51 24.32%
Regents Common Core English Language Art 7.43 85.20% 5.62 8.30%
Regents Common Core Geometry 10.79 91.44% 8.49 51.51%
Regents Living Environment 9.90 88.62% 7.73 17.83%
Regents NF Global History 7.97 85.41% 6.22 9.72%
Regents Phy Set/Chemistry 9.05 91.17% 6.98 34.82%
Regents Phy Set/Earth Sci 9.28 89.44% 7.19 20.20%
Regents Phy Set/Physics 9.48 87.18% 7.86 13.96%
Regents US History&Gov't (Framework) 7.92 86.79% 5.88 10.48%
RegentsMath7 9.12 44.14% 7.63 7.85%
RegentsMath8 9.86 88.53% 6.63 19.68%
RegentsScience8 8.37 86.42% 5.97 10.93%
Science5 8.26 86.88% 6.45 29.09%
Science8 8.26 86.13% 6.46 39.53%

This is good, as far as it goes, but we have violated several ML best practices. In fact, the new_data argument to augment() should have been a bit of a warning sign. Here, we are evaluating the model on the data used to train it.

A better practice is to hold some of our data aside and use it for model evaluation. This can be done in several ways, but here we’re going to use a technique known as cross-validation wherein we split the data into several pieces and fit the model several times, holding out one piece for evaluation each time.

This is conceptually straightforward, but the book-keeping can be a bit of a pain. Thankfully, tidymodels is available to help us:

schools_cv <- vfold_cv(DATA, v=5)

schools_rf_cv <- schools_rf |> fit_resamples(schools_cv)

Note that here we use fit_resamples() instead of a plain fit() to account for this repeated fitting and evaluation structure. In this case, we also see that several metrics were automatically computed for us

schools_rf_cv |> select(id, .metrics) |> unnest(.metrics) |> head(n=5)
# A tibble: 5 × 5
  id    .metric .estimator .estimate .config        
  <chr> <chr>   <chr>          <dbl> <chr>          
1 Fold1 rmse    standard      18.7   pre0_mod0_post0
2 Fold1 rsq     standard       0.480 pre0_mod0_post0
3 Fold2 rmse    standard      18.6   pre0_mod0_post0
4 Fold2 rsq     standard       0.522 pre0_mod0_post0
5 Fold3 rmse    standard      18.9   pre0_mod0_post0

If we want to use all four metrics from before, we can pass the metrics argument to fit_resamples(). This gives us a sense of error bars on our metrics, which we can compute by hand or with the help of the collect_metrics():

schools_rf |> fit_resamples(schools_cv, metrics = metrics) |> collect_metrics()
# A tibble: 4 × 6
  .metric .estimator    mean     n   std_err .config        
  <chr>   <chr>        <dbl> <int>     <dbl> <chr>          
1 mae     standard    14.7       5   0.146   pre0_mod0_post0
2 mape    standard   Inf         5 NaN       pre0_mod0_post0
3 rmse    standard    18.6       5   0.180   pre0_mod0_post0
4 rsq     standard     0.504     5   0.00852 pre0_mod0_post0

If we want to get group-wise statistics as before, we still need to do a bit of work by hand.

schools_rf |> 
    fit_resamples(schools_cv, 
                  metrics = metrics, 
                  control = control_resamples(save_pred = TRUE)) |>
    collect_predictions() |> 
    arrange(.row) |> 
    inner_join(DATA |> mutate(.row=row_number()), 
               join_by(.row, 
                       PERCENT_PROFICIENT)) |> 
    filter(PERCENT_PROFICIENT > 0) |> 
    group_by(ASSESSMENT_NAME, id) |>
    metrics(truth=PERCENT_PROFICIENT, 
            estimate=.pred) |> 
    group_by(ASSESSMENT_NAME, 
             .metric) |> 
    summarize(mean_metric = mean(.estimate), 
              sd_metric = sd(.estimate), 
              se_metric = sd_metric / sqrt(n())) |>
    ungroup() |>
    select(-sd_metric) |> 
    pivot_wider(values_from=c(mean_metric, se_metric), 
                names_from=.metric) |> 
    mutate(mean_metric_mape = mean_metric_mape / 100, 
           se_metric_mape = se_metric_mape / 100) |>
    gt(id="tbl_exams_2", 
       rowname_col = "ASSESSMENT_NAME") |> 
    cols_label(mean_metric_rmse="Root Mean Squared Error", 
               mean_metric_rsq="R^2",
               mean_metric_mae="Mean Absolute Error", 
               mean_metric_mape="Mean Absolute Percent Error") |>
    fmt_percent(c(mean_metric_rsq, se_metric_rsq, 
                  mean_metric_mape, se_metric_mape)) |>
    fmt_number(c(mean_metric_rmse, se_metric_rmse, 
                  mean_metric_mae, se_metric_mae)) |>
    cols_merge(c(mean_metric_rmse, se_metric_rmse), pattern="{1}±{2}") |>
    cols_merge(c(mean_metric_rsq, se_metric_rsq), pattern="{1}±{2}") |>
    cols_merge(c(mean_metric_mae, se_metric_mae), pattern="{1}±{2}") |>
    cols_merge(c(mean_metric_mape, se_metric_mape), pattern="{1}±{2}") |> 
    sub_missing() |>
    tab_header("Prediction Error Metrics by Exam", 
               "CV Prediction Error of a Random Forest Model Predicting Student Exam Proficiency Using Funding, SES, and Student-Teacher Ratios")
Prediction Error Metrics by Exam
CV Prediction Error of a Random Forest Model Predicting Student Exam Proficiency Using Funding, SES, and Student-Teacher Ratios
Mean Absolute Error Mean Absolute Percent Error Root Mean Squared Error R^2
Combined6Math 14.98±0.89 41.32%±8.45% 18.17±0.91 41.79%±9.02%
Combined7Math 12.95±0.55 38.92%±8.47% 15.73±0.49 39.41%±6.75%
Combined8Math 15.29±1.34 38.14%±4.81% 18.14±1.69 48.79%±7.02%
CombinedScience 12.17±0.98 44.98%±6.25% 14.91±1.29 67.33%±5.25%
ELA3 14.48±0.53 53.94%±4.81% 17.92±0.66 30.72%±2.35%
ELA4 14.70±0.33 47.83%±2.80% 17.85±0.53 33.51%±3.19%
ELA5 13.89±0.32 54.20%±6.13% 17.33±0.56 32.62%±5.44%
ELA6 13.97±0.76 49.58%±4.69% 17.24±0.75 35.19%±2.68%
ELA7 13.52±0.35 38.96%±1.70% 17.14±0.56 32.97%±1.95%
ELA8 12.15±0.46 29.97%±2.51% 15.22±0.57 39.35%±3.69%
MATH3 16.04±0.41 45.41%±6.06% 19.65±0.45 25.42%±2.13%
MATH4 15.15±0.48 39.17%±1.77% 18.63±0.61 32.85%±3.09%
MATH5 15.29±0.35 58.56%±4.91% 19.26±0.43 32.14%±3.09%
MATH6 15.30±0.23 47.25%±5.26% 18.78±0.45 35.12%±2.85%
MATH7 15.47±0.55 40.77%±4.30% 19.35±0.48 28.15%±2.98%
MATH8 17.66±1.21 81.95%±12.04% 21.90±1.76 16.36%±6.04%
Regents Algebra I 17.82±0.31 42.82%±4.10% 22.08±0.40 37.00%±2.90%
Regents Common Core Algebra I 16.59±0.31 78.45%±7.91% 19.89±0.39 38.37%±5.87%
Regents Common Core Algebra II 13.01±0.90 47.31%±8.48% 16.75±1.25 65.66%±4.81%
Regents Common Core English Language Art 10.94±0.34 16.01%±0.70% 13.91±0.40 49.61%±4.33%
Regents Common Core Geometry 16.75±0.95 97.12%±9.08% 20.58±1.12 57.41%±4.29%
Regents Living Environment 15.71±0.34 36.54%±3.14% 19.90±0.49 39.31%±4.13%
Regents NF Global History 12.11±0.42 18.79%±1.20% 15.20±0.41 46.96%±4.48%
Regents Phy Set/Chemistry 14.55±0.65 65.14%±19.61% 18.31±0.65 51.85%±4.85%
Regents Phy Set/Earth Sci 14.71±0.60 41.38%±8.72% 18.90±0.83 43.17%±5.02%
Regents Phy Set/Physics 15.90±0.92 27.59%±3.12% 19.19±0.85 29.19%±7.40%
Regents US History&Gov't (Framework) 11.47±0.57 20.57%±2.15% 15.21±0.87 46.85%±4.49%
RegentsMath7 15.00±7.37 15.45%±7.80% 15.15±7.52 —±—
RegentsMath8 14.27±1.48 39.91%±10.37% 19.49±2.61 34.44%±14.83%
RegentsScience8 12.48±0.99 22.79%±2.86% 17.04±1.65 43.66%±11.72%
Science5 12.09±0.45 54.95%±7.73% 15.40±0.49 34.57%±3.49%
Science8 12.67±0.49 76.05%±14.26% 16.05±0.60 18.70%±2.43%

Not surprisingly, our performance here is worse than we saw above, because we are now getting ‘out-of-sample’ results.

This is rather messy, but it gives us a good sense of how well our model can do on different exams. In general, we see that we are 10-15% off (RMSE) using these three features, so our model isn’t great yet.

LOCO Inference

With our model in hand, we now want to ask which of our three big features is most important. In general, when assessing feature importance we have a choice between methods intrinsic to the model (e.g., regression coefficients for OLS) or methods extrinsic to the model. Extrinsic methods are necessarily less specific, but useful because they can be compared across different models.

Here, we’re going to demonstrate the use of the extrinsic model known as Leave-One-Covariate-Out (LOCO) feature importance. In brief, LOCO compares the model with and without some feature and compares the predictic performance. If the performance changes significantly, the feature was important; if the performance doesn’t change much, either the model didn’t really use the feature or the value of that feature could be replaced by other correlated features; either way, a small change (low LOCO score) indicates that the feature is not that important.

The following code can be used to compute the LOCO scores here:

calc_loco <- function(workflow_obj, data, metrics = NULL, 
                      loco_groups = NULL, v = 5) {
    # Step 1: Extract LOCO Features
    # 
    # We want loco-groups to be a list of features (or groups of features) we can
    # iterate over. This does a bit of pre-processing to automatically produce
    # the list if not provided or to convert a vector to a list of singletons
    # if a character vector is provided
    if (is.null(loco_groups)) {
        # If the loco_groups aren't provided, let's just pull out each predictor
        # from the recipe component of the workflow object and use those as our
        # features to LOCO
        loco_groups <- workflow_obj |> 
            extract_recipe(estimated=FALSE) |> 
            summary() |> 
            filter(role == "predictor") |> 
            pull(variable) |> 
            set_names() |> 
            as.list()
    } else if (is.character(loco_groups)) {
        loco_groups <- loco_groups |> 
            set_names() |> 
            as.list()
    }
    
    # Step 2 - Set up resampling. If given an rset (resampling set) object, just
    # use that. Otherwise, automatically perform v-fold CV (v from above) internally
    folds <- if (inherits(data, "data.frame") && !inherits(data, "rset")) {
        vfold_cv(data, v = v)
    } else {
        data
    }
    
    # Step 3. Implementing LOCO. 
    #
    # This function will take a list of variables (if provided) and temporarily
    # remove them from the predictors (marking them as 'LOCO' role instead) 
    # and re-fit the model. tidymodels will automaticaly ignore LOCO-marked 
    # variables, giving us the predictions we need to get our LOCO scores. 
    # 
    # This function only takes variables to drop - everything else is captured
    # dynamically from the enclosing environment
    run_loco <- function(drop_vars) {
        if(is.null(drop_vars)) {
            loco_wflow <- workflow_obj
        } else {
            wflow_recipe <- workflow_obj |> 
                extract_recipe(estimated = FALSE) |> 
                update_role(all_of(drop_vars),
                            new_role="LOCO")
            loco_wflow <- workflow_obj |> update_recipe(wflow_recipe)
        }
        
        # Fit model to resampled data and get results
        loco_wflow |>
            fit_resamples(resamples = folds, 
                          metrics = metrics) |> 
            collect_metrics() |> 
            select(-.estimator, -.config) |> 
            mutate(direction = map_chr(.metric, \(m) get(m) |> attr("direction")))
    }
    
    # Step 4. Showtime. First fit the full model, then iterate through loco_groups
    # performing each LOCO
    full_results <- run_loco(NULL) |> 
        mutate(LOCO_feature="FULL_MODEL") |>
        select(-n, -std_err, -LOCO_feature)
    loco_results <- imap(loco_groups, \(x, y) run_loco(x) |> mutate(LOCO_feature=y)) |>
        list_rbind() |> 
        select(-n, -std_err)
    
    inner_join(loco_results, 
               full_results, 
               join_by(.metric, direction), 
               suffix=c(".LOCO", ".FULL")) |> 
        mutate(loco_value = mean.FULL - mean.LOCO, 
               loco_value = if_else(direction == "minimize", -1, 1) * loco_value) |> 
    rename(loco_score = loco_value, 
           metric=.metric, 
           full_model=mean.FULL, 
           loco_model=mean.LOCO)
}

We can apply this to get LOCO scores for this model:

schools_loco <- calc_loco(schools_rf, schools_cv)

which gives us:

Dropped Feature Root Mean Squared Error R^2
SUBGROUP_NAME 1.09 (18.58 -> 19.67) 0.06 (0.5 -> 0.44)
ASSESSMENT_NAME 3.88 (18.58 -> 22.46) 0.22 (0.5 -> 0.28)
LUNCH_DISCOUNT 4.22 (18.58 -> 22.8) 0.25 (0.5 -> 0.26)
STUDENT_TEACHER_RATIO 0.79 (18.58 -> 19.37) 0.04 (0.5 -> 0.46)
FUNDING_PER_STUDENT 0.29 (18.58 -> 18.87) 0.02 (0.5 -> 0.49)

From this, we see that knowing the subject and grade (via ASSESSMENT_NAME) is the most important information in predicting performance, followed very closely by family socioeconomic status (LUNCH_DISCOUNT). By contrast, student demographics (via SUBGROUP_NAME) or school properties (FUNDING_PER_STUDENT and STUDENT_TEACHER_RATIO) seem much less important.

While powerful - this LOCO framework has a few limitations to be aware of:

  1. It can only measure importance vis-a-vis the type of model being fit, here a random forest and is not an abstract ‘model-free’ measure of importance.12 Theoretically, some other model could take better advantage of a feature that random forests can’t really make much use of in this data.
  2. LOCO scores are only meaningful insofar as the model itself is predictive. Knowing how useful a feature is to a useless model is not actually valuable.
  3. Fitting LOCO can be quite computationally intensive: if we have \(p\) features and use 5-fold CV, we have to fit \(5(p+1)\) models. If these are slow, this can start to add up. To improve performance, we can either parallelize at the model fitting level (set_engine("ranger", num.threads=4)) or at the cross-validation level by registering a parallelization system and letting tidymodels pick it up automatically.

Going Deeper with tidymodels

To learn more about tidymodels, review the introductory case study at https://www.tidymodels.org/start/.

Footnotes

  1. More precisely, a uniform law of large numbers called the Glivenko-Cantelli theorem.↩︎

  2. Even sampling from a continuous distribution theoretically allows us to get repeated values; in practice, these are ‘probability zero’ types of events and never happen. But it’s not actually impossible - just never going to happen!↩︎

  3. We also could have permuted the labels of Africa / non-Africa here to get a similar type of test. In finite samples, this test is likely to be just a bit more reliable (since we don’t rely on the ‘enough data to estimate a true distribution’ argument that powers bootstrapping), but the differences aren’t huge. In practice, this comes down to shuffling rather than resampling, which can also be done with the sample function, but now with sample(..., replace=FALSE). Permutation testing is a bit simpler and better when possible, but not everything is easily done with a permutation test.↩︎

  4. As someone who has dedicated large part of my career to the subtle differences between these models, this sort of ‘easy swappability’ rubs me the wrong way, but it is immensely useful in applied work. If you do want to learn what these models actually do - and when one should be preferred over the other - I’d encourage you to take STA 9890 and STA 9891, which cover these types of “ML” models in more detail.↩︎

  5. This type of framework uses models as ‘black-boxes’ because the user is not required to, nor able to, interrogate the parameters of the model. Indeed, for some of the most powerful and useful models we will consider, the model does not have parameters in the classical sense.↩︎

  6. Like the tidyverse packages, we will typically load the tidymodels packages through a helper package tidymodels that loads several packages for us. Running library(tidymodels) is functionally equivalent to something like:

    library(dials)
    library(broom)
    library(hardhat)
    library(parsnip)
    library(recipes)
    library(rsample)
    library(tune)
    library(workflows)
    library(yardstick)
    ↩︎
  7. There are 52 school covariates in total, but we won’t use them all in this exercise.↩︎

  8. While these are percentages, they are scaled up by 100, so a value of “50” indicates that half (50%) of students were proficient.↩︎

  9. We are ignoring alternative assessments like the NYSAA for this exercise.↩︎

  10. Very small districts may consolidate into two (or even one) school serving a wider variety of grades or arrange for their students to be educated within a neighboring district.↩︎

  11. It is not clear to me why we have to specify the estimate and truth columns automatically here. It seems like the kind of thing that should be inferred from the recipe, but it’s the way it is (for now).↩︎

  12. Scores like LOCO are sometimes called ‘model-free’, but I prefer to call them ‘model-agnostic’ or ‘model-flexible’. They can be applied to any single model at a time, but their results still depend on the model choice.↩︎

  • Edit this page
  • View source
  • Report an issue