STA 9750 Mini-Project #04: Going for the Gold

\[\newcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\R}{\mathbb{R}}\]

Due Dates

  • Released to Students: 2026-04-23
  • Initial Submission: 2026-05-15 11:59pm ET on GitHub and Brightspace
  • Peer Feedback:
    • Peer Feedback Assigned: 2026-05-18 on GitHub
    • Peer Feedback Due: 2026-05-24 11:59pm ET on GitHub

Estimated Time to Complete: 13-15 Hours

Estimated Time for Peer Feedback: 1 Hour


Introduction

Welcome to Mini-Project #04!

This project will bring together all of the skills you have developed in this course:

  • Accessing data from the web using HTTP requests (httr2)
  • Extracting data from HTML and cleaning it (rvest, stringr)
  • Joining together data from different tables (multi-table dplyr)
  • “Pivoting” and preparing data for analysis (single-table dplyr)
  • Exploring and visualizing data (ggplot2)
  • Statistical analysis and inference (infer or stats)
  • Communicating your findings using web-based reproducible research tools (Quarto)

You will use these skills to deliver a very early prediction of Team USA’s medal haul from the upcoming 2028 Los Angeles Summer Olympics.

Note that, compared to previous mini-projects, the scope of this project is a bit smaller: in light of this, and the more advanced skills you have spent the past 3 months developing, this mini-project should be the least demanding of the course. (It is still likely to be the most difficult simply because you are responsible for more steps of the data analysis than any prior project.) At this point in the course, you should be spending the majority of your out-of-class hours on your Course Project.

The Final Mini-Project

This mini-project completes our whirlwind tour of several different forms of data-driven writing:

  • MP01: Op-Ed Writing
  • MP02: Market Research
  • MP03: Targeted Advertising, Political Action
  • MP04: Popular Media, Fundraising Pitch

Within these projects, you have developed and showcased a diverse array of skills:

  • MP01: Data Summarization, Exploratory Analysis, Table Formatting
  • MP02: Combining Data, Data Visualization
  • MP03: Data Import, API Usage, File Parsing
  • MP04: Web Scraping, Data Cleaning, Forecasting

The skills you have developed are powerful and flexible and can be used across a variety of application areas:

  • MP01: Official Statistics, Demography
  • MP02: Survey Date, Social Sciences
  • MP03: Census Data, Migration Flows
  • MP04: Sports Analytics

There are, of course, many other ways that data can be used to generate and communicate insights and many other topics where your skills can be applied, but hopefully this “hit parade” has given you a sense of just how widely your new skills can take you. In each of these domains, you have seen how sophisticated and thoughtful analysis - not simply computing a single mean or regression coefficient - has allowed deeper understanding of the complexities of real world problems. You have learned to think beyond the simple binary “correct/incorrect” of an analysis and to see that even superficially simple questions - “is this true?”, “what is the average?”, “is there a correlation?” - can lead down deep and rewarding rabbit holes of complexity. But those rabbit holes are where your real power as a data analyst can be found.

The tools of quantitative analysis and communication you have developed in this course can be used in essentially infinite contexts– we have only scratched the surface–and I’m excited to see what you do in the remainder of this course, in your remaining time at Baruch, and in your future careers.

Student Responsibilities

Recall our basic analytic workflow and table of student responsibilities:

  • Data Ingest and Cleaning: Given a data source, read it into R and transform it to a reasonably useful and standardized (‘tidy’) format.
  • Data Combination and Alignment: Combine multiple data sources to enable insights not possible from a single source.
  • Descriptive Statistical Analysis: Take a data table and compute informative summary statistics from both the entire population and relevant subgroups
  • Data Visualization: Generate insightful data visualizations to spur insights not attainable from point statistics
  • Inferential Statistical Analysis and Modeling: Develop relevant predictive models and statistical analyses to generate insights about the underlying population and not simply the data at hand.

In this course, our primary focus is on the first four stages: you will take other courses that develop analytical and modeling techniques for a variety of data types. As we progress through the course, you will eventually be responsible for the first four steps. Specifically, you are responsible for the following stages of each mini-project:

Students’ Responsibilities in Mini-Project Analyses
Ingest and Cleaning Combination and Alignment Descriptive Statistical Analysis Visualization
Mini-Project #01
Mini-Project #02 ½
Mini-Project #03 ½
Mini-Project #04

In this mini-project, you are in charge of the whole pipeline, from data acquisition to statistical inference. The rubric below evaluates your work on all aspects of this project.

Rubric

STA 9750 Mini-Projects are evaluated using peer grading with meta-review by the course staff (GTAs and the instructor). The following basic rubric will be used for all mini-projects:

Course Element Excellent (9-10) Great (7-8) Good (5-6) Adequate (3-4) Needs Improvement (1-2)
Written Communication Report is very well-written and flows naturally. Motivation for key steps is clearly explained to reader without excessive detail. Key findings are highlighted and appropriately given sufficient context, including reference to related work where appropriate. Report has no grammatical or writing issues.1 Writing is accessible and flows naturally. Key findings are highlighted and clearly explained, but lack suitable motivation and context. Report has no grammatical or writing issues. Key findings are present but insufficiently highlighted or unclearly explained. Writing is intelligible, but has some grammatical errors. Key findings are difficult to discern. Report exhibits significant weakness in written communication. Key points are nearly impossible to identify.
Project Skeleton Code completes all instructor-provided tasks correctly. Responses to open-ended tasks are especially insightful and creative. Code completes all instructor-provided tasks satisfactorily. Responses to open-ended tasks are insightful, creative, and do not have any minor flaws. Response to one instructor provided task is skipped, incorrect, or otherwise incomplete. Responses to open-ended tasks are solid and without serious flaws. Responses to two instructor provided tasks are skipped, incorrect, or otherwise incomplete. Responses to open-ended tasks are acceptable, but have at least one serious flaw. Response to three or more instructor provided tasks are skipped, incorrect, or otherwise incomplete. Responses to open-ended tasks are seriously lacking.
Tables & Document Presentation Tables go beyond standard publication-quality formatting, using advanced features like color formatting, interactivity, or embedded visualization. Tables are well-formatted, with publication-quality selection of data to present, formatting of table contents (e.g., significant figures) and column names. Tables are well-formatted, but still have room for improvement in one of these categories: subsetting and selection of data to present, formatting of table contents (e.g., significant figures), column names. Tables lack significant ‘polish’ and need improvement in substance (filtering and down-selecting of presented data) or style. Document is difficult to read due to distracting formatting choices. Unfiltered ‘data dump’ instead of curated table. Document is illegible at points.
Data Visualization Figures go beyond standard publication-quality formatting, using advanced features like animation, interactivity, or advanced plot types implemented in ggplot2 extension packages. Figures are ‘publication-quality,’ with suitable axis labels, well-chosen structure, attractive color schemes, titles, subtitles, and captions, etc. Figures are above ‘exploratory-quality’ and reflect a moderate degree of polish, but do not reach full ‘publication-quality’ in one-to-two ways. Figures are above ‘exploratory-quality’ and reflect a moderate degree of polish, but do not reach full ‘publication-quality’ in three or more distinct ways. Figures are suitable to support claims made, but are ‘exploratory-quality,’ reflecting zero-to-minimal effort to customize and ‘polish’ beyond ggplot2 defaults.
Exploratory Data Analysis Deep and ‘story-telling’ EDA identifying non-obvious patterns that are then used to drive further analysis in support of the project. All patterns and irregularities are noted and well characterized, demonstrating mastery and deep understanding of all data sets used. Meaningful ‘story-telling’ EDA identifying non-obvious patterns in the data. Major and pinor patterns and irregularities are noted and well characterized at a level sufficient to achieve the goals of the analysis. EDA demonstrates clear understanding of all data sets used. Extensive EDA that thoroughly explores the data, but lacks narrative and does not deliver a meaningful ‘story’ to the reader. Obvious patterns or irregularities noted and well characterized, but more subtle structure may be overlooked or not fully discussed. EDA demonstrates competence and basic understanding of the data sets used. Solid EDA that identifies major structure to the data, but does not fully explore all relevant structure. Obvious patterns or irregularities ignored or missed. EDA demonstrates familiarity with high-level structure of the data sets used. Minimal EDA, covering only standard summary statistics, and providing limited insight into data patterns or irregularities. EDA fails to demonstrate familiarity with even the most basic properties of the data sets being analyzed.

Code Quality

Code is (near) flawless. Intent is clear throughout and all code is efficient, clear, and fully idiomatic.

Code passes all styler and lintr type analyses without issue.

Comments give context and structure of the analysis, not simply defining functions used in a particular line. Intent is clear throughout, but code can be minorly improved in certain sections.

Code has well-chosen variable names and basic comments. Intent is generally clear, though some sections may be messy and code may have serious clarity or efficiency issues.

Code executes properly, but is difficult to read. Intent is generally clear and code is messy or inefficient.

Code fails to execute properly.

Data Preparation Data import is fully-automated and efficient, taking care to only download from web-sources if not available locally. All data cleaning steps are fully-automated and robustly implemented, yielding a clean data set that can be widely used. Data is imported and prepared effectively, in an automated fashion with minimal hard-coding of URLs and file paths. Data cleaning is fully-automated and sufficient to address all issues relevant to the analysis at plan. Data is imported and prepared effectively, though source and destination file names are hard-coded. Data cleaning is rather manual and hard-codes most transformations. Data is imported in a manner likely to have errors. Data cleaning is insufficient and fails to address clear problems. Data is hard-coded and not imported from an external source.
Analysis and Findings Analysis demonstrates uncommon insight and quality, providing unexpected and subtle insights. Analysis is clear and convincing, leaving essentially no doubts about correctness. Analysis clearly appears to be correct and passes the “sniff test” for all findings, but a detailed review notes some questions remain unanswered. Analysis is not clearly flawed at any point and is likely to be within the right order of magnitude for all findings. Analysis is clearly incorrect in at least one major finding, reporting clearly implausible results that are likely off by an order of magnitude or more.

Note that the “Excellent” category for most elements applies only to truly exceptional “above-and-beyond” work. Most student submissions will likely fall in the “Good” to “Great” range.

For this mini-project, students are responsible for all elements of the analysis and will be evaluated on all rubric elements.

For this mini-project, no more than 7 total points of extra credit can be be awarded. Opportunities for extra credit exist for students who go above and beyond the instructor-provided scaffolding. Specific opportunities for extra credit can be found below.

Students pursuing careers in data analytics are strongly encouraged to go beyond the strict ambit of the mini-projects to

  1. further refine their skills;
  2. learn additional techniques that can be used in the final course project; and
  3. develop a more impressive professional portfolio.

Because students are encouraged to use STA 9750 mini-projects as the basis for a professional portfolio, the basic skeleton of each project will be released under a fairly permissive usage license. Take advantage of it!

Submission Instructions

After completing the analysis, write up your findings, showing all of your code, using a dynamic quarto document and post it to your course repository. The qmd file should be named mp04.qmd (lower case!) so the rendered document can be found at docs/mp04.html in the student’s repository and will be served at the URL:2

https://YOUR_GITHUB_ID.github.io/STA9750-2026-SPRING/mp04.html

You can use the helper function mp_start available at in the Course Helper Functions to create a file with the appropriate name and some meta-data already included. Do so by running the following command at the R Console:

source("https://michael-weylandt.com/STA9750/load_helpers.R"); mp_start(N=04)

After completing this mini-project, upload your rendered output and necessary ancillary files to GitHub to make sure your site works. The mp_submission_ready function in the Course Helper Functions can perform some of these checks automatically. You can run this function by running the following commands at the R Console:

source("https://michael-weylandt.com/STA9750/load_helpers.R"); mp_submission_ready(N=04)

Once you confirm this website works (substituting YOUR_GITHUB_ID for the actual GitHub username provided to the professor in MP#00 of course), open a GitHub issue on the instructor’s repository to submit your completed work.

The easiest way to do so is by use of the mp_submission_create function in the Course Helper Functions, which can be used by running the following command at the R Console:

source("https://michael-weylandt.com/STA9750/load_helpers.R"); mp_submission_create(N=04)

Alternatively, if you wish to submit manually, open a new issue at

https://github.com/michaelweylandt/STA9750-2026-SPRING/issues/new .

Title the issue STA 9750 YOUR_GITHUB_ID MiniProject #04 and fill in the following text for the issue:

Hi @michaelweylandt!

I've uploaded my work for MiniProject #**04** - check it out!

<https://YOUR_GITHUB_ID.github.io/STA9750-2026-SPRING/mp04.html>

At various points before and after the submission deadline, the instructor will run some automated checks to ensure your submission has all necessary components. Please respond to any issues raised in a timely fashion as failing to address them may lead to a lower set of scores when graded.

Additionally, a PDF export of this report should be submitted on Brightspace. To create a PDF from the uploaded report, simply use your browser’s ‘Print to PDF’ functionality.

NB: The analysis outline below specifies key tasks you need to perform within your write up. Your peer evaluators will check that you complete these. You are encouraged to do extra analysis, but the bolded Tasks are mandatory.

NB: Your final submission should look like a report, not simply a list of facts answering questions. Add introductions, conclusions, and your own commentary. You should be practicing both raw coding skills and written communication in all mini-projects. There is little value in data points stated without context or motivation.

Mini-Project #04: Going for the Gold

With the 2026 Winter Olympics complete, the world is now preparing for the 2028 Summer Olympics, to be held in Los Angeles, California.3 At the same time, sports betting–and indeed, gambling on essentially everything–seems to be an ever-more common society phenomenon.4 As such, serious money is likely to be wagered on Team USA’s performance in the 2028 Olympics. Your task in this final mini-project is to build a (relatively simple) model to predict the total amount of gold medals Team USA is likely to win in 2028.

Your model will need to consider at least three factors. For extra credit, you will be able to consider other factors to build a (hopefully!) more accurate forecast; see below for more. Your model needs to incorporate three primary effects:

  1. The baseline performance of Team USA. As a large and rich nation that spans several different different climates (sun and snow), the United States have long performed well in both winter and summer olympics.

  2. The Host Nation effect. Most analyses suggest that the host nation overperforms relative to its performance in the Olympiad immediately before or afterwards. The reasons for this are not entirely clear: they seem to be some combination of:

    1. the psychological benefits of having more supporters in the stands (the “home turf” effect);
    2. the benefit of receiving automatic qualification to every event. The host nation is basically guaranteed to have at least one athelete qualify in every event. (The specifics vary by sport.) Given that these “automatic qualifiers” have to compete against athletes who qualify based on their strong performance in non-Olympic international competition, the impact of this factor is likely small, but there are lots of small events and every extra entry increases the probability of a ‘fluke’ victory.
    3. the pre-competition enthusiasm for the games leading to more national support and interest in Olympic competition.

    (Note that some authors dispute whether a host nation benefit exists at all.)

  3. The new sport effect. The host nation is able to introduce up to six new sports to the Olympics. Given that these have to be sports that are not already included in the “standard” Olympic schedule, these sports are often (relatively) niche events that are particularly popular in the host nation. As such, the host nation is often expected to perform well in these sports.

    In the 2020 Olympics (actually held in 2021), Japan introduced several sports, including Karate in which Japanese atheletes won five of twelve possible gold medals and baseball/softball in which the Japanese won both gold medals. The 2024 Paris Olympics introduced the (seemingly short-lived) Olympic Breaking competition to mixed reviews. The 2028 Olympics are (re-)introducing baseball/softball, cricket, flag football, lacrosse, and squash. While the US is not known as a cricket powerhouse, it is expected to be quite strong in the other sports.

To determine the medal impact of these three factors, you will perform a series of statistical analyses intended to estimate confidence intervals for each effect. Before we can do so, however, it’s time to acquire some data.

Data Acquisition

The Olympedia website contains detailed records of all modern Olympic games. You will use web-scraping techniques to extract the following information:

  • Modern Olympic Games: for both modern5 summer and winter Olympics, extract the following from https://www.olympedia.org/editions:

    • The year in which the Olympics were held

    • the Olympiad ID for that games (e.g., Chamonix 1924 was the 1st Winter Olympiad while Paris 1924 was the 8th Summer Olympiad)

      Note that Olympiads are traditionally written using Roman numerals. The following R snippet may be useful in translating these back to more standard (decimal) integers.

      olympiad_ids <- c("I", "III", "IV", "V", "VI", "XII", "XXV")
      roman_to_integer <- function(x){
          setNames(as.integer(as.roman(olympiad_ids)), x)  
      }
      roman_to_integer(olympiad_ids)
        I III  IV   V  VI XII XXV 
        1   3   4   5   6  12  25 
    • The Olympedia internal ID for that games; Olympedia refers to these as “editions”.

    • The host nation for that games. Do this by parsing the URL associated with the flag in the Country column. E.g., the Greek flag used to identify the 1896 Olympics is loaded from the URL https://olympedia-flags.s3.eu-central-1.amazonaws.com/GRE.png; by parsing the three letter country code (GRE), we can identify this as Greece.

      Note that, since you want to extract some information from the URLs in the table, not just the text of the table, you will need to use more than just rvest::html_table().

  • For each games, identify the table of medal disciplines (sports) at that games. E.g. at the 1896 Paris Olympics, we can see that 10 medal disciplines were represented. For each of these, extract a link to the relevant page, e.g., https://www.olympedia.org/editions/1/sports/ATH for Athletics.

  • For each medal discipline + edition combination, extract the medal table.

NB: Olympedia is a free resource, made available at the expense of a small group of dedicated volunteers, and we do not want to abuse it. One way to avoid abusing their servers is to only download pages one; we can save a copy of the page locally the first time we access it and use that instead of accessing the URL repeatedly. The following code will do this for you:

library(rvest)

slow_download <- purrr::slowly(download.file, purrr::rate_delay(pause=5))

read_olympedia <- function(url, ...){
    if(stringr::str_starts(url, "/")) url <- stringr::str_remove(url, "/")
    cache_dir <- fs::path("data", "mp04", "olympedia_cache")
    if(!fs::dir_exists(cache_dir)) fs::dir_create(cache_dir)

    cache_path <- fs::path(cache_dir, 
                           url |> as.character() |> stringr::str_replace_all("/", "_"))
    if(!fs::file_exists(cache_path)){
        src_url <- paste0("https://www.olympedia.org/", url)
        slow_download(src_url, cache_path, mode="wb")
    }
    rvest::read_html(cache_path)
}

Use this code to avoid taxing Olympedia’s servers. If you access the site too often or too quickly, you will be automatically blocked for a short period. If this happens to you, take a short break and come back to it later. To use this function, simply pass the path part of the URL and the hostname will be appended automatically; e.g., to read the contents of https://www.olympedia.org/editions into R, you would simply run read_olympedia("editions"). Note that this function is designed to be slow the first time you access a site (pause=5 indicating that it won’t hit Olympedia more than once every 5 seconds) but it will be faster each subsequent time since the HTML is now cached on your computer.

Task 1: Acquire Olympedia Data

Using the user provided function above, download all necessary Olympedia data. Once your have acquired all of the relevant pages and parsed them into a single data frame, combine all of your results into a single tidy table. With this data in hand, you should have everything you need to create a table like this:

Olympic Medal Results by Discipline
Games Discipline Competitor
Medal Count
Gold Silver Bronze Total
Summer Games of the I Olympiad (1896 Athina ) (Details) Shooting (Details) United States ( ) 2 1 0 3
Summer Games of the II Olympiad (1900 Paris ) (Details) Golf (Details) United States ( ) 2 0 0 2
Summer Games of the II Olympiad (1900 Paris ) (Details) Swimming (Details) Denmark ( ) 0 0 1 1
Summer Games of the III Olympiad (1904 St. Louis ) (Details) Artistic Gymnastics (Details) Germany ( ) 1 1 2 4
Summer Games of the III Olympiad (1904 St. Louis ) (Details) Golf (Details) Canada ( ) 1 0 0 1
Data from Olympedia. 5 randomly selected rows - not the full data set.

Note that you do not need to actually make a table like this - just check that you have everything you would need to make this table.

Hint: See the source code for this Mini-Project on GitHub to see how I formatted this table. You do not need to use the same column names I did, but you are always encouraged to use syntactic (valid R) names to make your code easier to read and write.

Some general tips:

  • This is a difficult scraping exercise, but don’t go overboard. I did this whole thing in about 80 lines of clear well-formatted code (not including read_olympedia above). If you find yourself needing more than 200 lines of code to read everything, reconsider your strategy.

  • Make sure to DRY your code. Write a function to parse a page and extract the relevant data; since the pages are formatted consistently, this will be far easier than copy-and-pasting the same code over and over.

    If your function takes in a URL and parses the results, but is not vectorized, remember that you can use the functional idiom:

    library(tidyverse)
    ... |>
      mutate(results = map(url, read_results_page))

    To apply read_results_page to each URL in sequence. If the result of read_results_page is a tidy data.frame, the results column will be a “list column”, which is not something we have used in class. Use tidyr::unnest() to convert it to a more standard tabular format. So the whole idiom would be something like:

    library(tidyverse)
    ... |>
      mutate(results = map(url, read_results_page)) |>
      unnest(results)
  • If you want to pull out a specific column of a table, consider using a CSS selector like tr > td:nth-of-type(3): this will pull the third td within each tr. Because each tr is a table row and each td is a table datum or cell, this gives the third cell of each row, i.e., the third column.

  • For some pages, you might want to get the first table after a section header with a specific text. This cannot be done with simple CSS selectors, but an xpath selector will suffice here. E.g.,

    ... |> 
       html_element(xpath = "//h1[normalize-space() = 'Section Header']/following-sibling::table[3]") |>
       html_table()

    This will look for an h1 header element whose text reads Section Header; once it finds it, it will get the third table that follows and convert it to a table using our trusty html_table() function. Modify this example as needed to work with Olympedia.

Use Olympedia Only

You must use Olympedia as your data source for this mini-project. While Olympic results are found on many websites, including Wikipedia, the point of this Mini-Project is to practice scraping a new website you have not previously used. (We discuss scraping tables from Wikipedia in class.)

Submissions that acquire data from alternate sources will be penalized heavily.

Task 2: Standardize and Clean Countries

Note that Olympedia often uses a three-letter code for each country, but these do not always correspond with the three letter ISO country codes, e.g., GRC vs GRE for Greece. You will need to identify and fix these mismatches (see below). The replace_values() function from (recent versions of) the dplyr package will be helpful to do these replacements and the anti_join() operator will be useful in finding where they are needed.

Furthermore, at least one code, URS for the Soviet Union (USSR), is no longer applicable. For simplicity, we are going to replace URS with RUS (Russia) to ensure continuity of our analysis, though this choice is not geopolitically un-fraught.

Given the spatial structure of this data, we will want to augment it with sf data for geospatial visualization. The countries() function from the necountries package returns an sf data frame of 199 countries that contains their two- and three-digit country codes, their name, and geometry column that can be used to visualize them on a map.

Join your Olympedia data to this object. (This is where you will start running into pain from the mismatch between Olympedia’s three-digit codes and the official ISO codes.) Design your join to identify country codes not present in the countries() data frame and, where possible, perform ‘spot fixes’ to standardize these onto countries present in the data set. (E.g., the URS to RUS standardization described above.)

Once your data is standardized, create at least one geospatial data visualization of this data to highlight an interesting fact.

Data Integration and Exploration

Now that you have acquired your data, it is time to explore it and further familiarize yourself with it. For this mini-project, you are on your own for EDA. Explore your data, find and note any irregularities (using them to improve your data import if appropriate) and to get baseline facts relevant to your analysis.

Task 3: Exploratory Data Analysis

Perform EDA on your data set. Document your results in (at least) 12 findings. These should include at least:

  • one inline value (scalar statistic);
  • one data visualization; and
  • one table

for each of the three effects we plan to study: US baseline performance, host country effect, new sport effect. You can choose the topic and presentation of your final three EDA findings however you wish.

Statistical Analysis

While Exploratory Data Analysis (EDA) can inspire questions, it remains useful to perform formal statistical inference to assess whether observed differences are larger than can be explained by pure randomness. The infer package can be used to neatly integrate \(t\)-tests and binomial proportion tests into a tidyverse workflow.

We will use the t_test() and prop_test() functions from this package to complete our analyses. These can be used as demonstrated below.

  • Example 01: Suppose we want to test whether Gentoo penguins are heavier than Adelie penguins on average. Since we have a numerical response and two different groups, this is a two-sample \(t\) test:

    library(infer)
    library(tidyverse)
    penguins_ok <- penguins |>
        drop_na() |>
        # Things will work better if we ensure species is a character vector 
        # and not a factor vector
        # 
        # You probably don't need to copy this step since your import is unlikely
        # to 'accidentally' make a factor
        mutate(species = as.character(species)) 
    
    penguins_ok |>
        filter(species != "Chinstrap") |>
        t_test(body_mass ~ species, 
               order = c("Adelie", "Gentoo"))
    # A tibble: 1 × 7
      statistic  t_df  p_value alternative estimate lower_ci upper_ci
          <dbl> <dbl>    <dbl> <chr>          <dbl>    <dbl>    <dbl>
    1     -23.3  242. 1.22e-63 two.sided     -1386.   -1504.   -1269.

    From this, we see that Adelie penguins are about 1375 grams lighter on average than Gentoos and the \(p\)-value is far less than 0.01, so the difference is unlikely to be solely due to sampling variability. Further more, the lower_ci and upper_ci columns give us a confidence interval (95% by default, but this can be adjusted by changing the conf_level argument to t_test().) for the difference in means.

    Here, we perform the test by specifying the response (quantity of interest) on the left hand side of the ~ and the explanatory variable (that which might meaningfully predict a differerence in groups) on the right hand side of the ~. If we swap the order (or don’t provide it), the sign of the estimated difference may differ, but the actual \(p\)-value won’t change for a two-sided test (as done here).

    We can extend this analysis to perform it separately for each of three years using a bit of nested data trickery. The mechanics of this are more advanced, but you should be able to extend this to your own analysis by just changing variable names:

    penguins_ok |>
    filter(species != "Chinstrap") |>
    group_by(year) |>
    nest() |>
    mutate(p_value_adelie_gentoo = map(data, \(d) t_test(d, 
                                                         body_mass ~ species,
                                                         order=c("Adelie", "Gentoo")))) |>
    unnest(everything())
    # A tibble: 265 × 15
    # Groups:   year [3]
        year species island  bill_len bill_dep flipper_len body_mass sex   statistic
       <int> <chr>   <fct>      <dbl>    <dbl>       <int>     <int> <fct>     <dbl>
     1  2007 Adelie  Torger…     39.1     18.7         181      3750 male      -11.6
     2  2007 Adelie  Torger…     39.5     17.4         186      3800 fema…     -11.6
     3  2007 Adelie  Torger…     40.3     18           195      3250 fema…     -11.6
     4  2007 Adelie  Torger…     36.7     19.3         193      3450 fema…     -11.6
     5  2007 Adelie  Torger…     39.3     20.6         190      3650 male      -11.6
     6  2007 Adelie  Torger…     38.9     17.8         181      3625 fema…     -11.6
     7  2007 Adelie  Torger…     39.2     19.6         195      4675 male      -11.6
     8  2007 Adelie  Torger…     41.1     17.6         182      3200 fema…     -11.6
     9  2007 Adelie  Torger…     38.6     21.2         191      3800 male      -11.6
    10  2007 Adelie  Torger…     34.6     21.1         198      4400 male      -11.6
    # ℹ 255 more rows
    # ℹ 6 more variables: t_df <dbl>, p_value <dbl>, alternative <chr>,
    #   estimate <dbl>, lower_ci <dbl>, upper_ci <dbl>

    Here, we see that our result has columns for the \(t\) statistic, the \(p\)-value, the estimated difference, and many other useful inferential quantities.

  1. Example 02: In other contexts, we may have a categorial response. In this case, we should not perform a \(t\)-test on means, but we instead want to use a binomial proportion test via the prop_test() function. In the simplest case, we might want to examine the fraction of sampled penguins which are female:

    penguins_ok |> 
      mutate(is_female = (sex == "female")) |> 
      prop_test(is_female ~ NULL)
    No `p` argument was hypothesized, so the test will assume a null hypothesis `p
    = .5`.
    # A tibble: 1 × 6
      statistic chisq_df p_value alternative lower_ci upper_ci
          <dbl>    <int>   <dbl> <chr>          <dbl>    <dbl>
    1    0.0120        1   0.913 two.sided      0.441    0.550

    Note that, here the statistic is the test statistic, not the sample proportion. Examining the confidence interval, we see that it’s quite clear the fraction of female penguins is basically 50% (as we would expect). Here, the ~ NULL of the specification implies that we aren’t looking at any potential covariates.

    We can extend this to the two-sample setting to see if the frequency differs between two groups. For instance, if we want to see whether Adelie penguins are more likely to be female than Gentoos:

    library(infer)
    library(tidyverse)
    penguins_ok |>
       filter(species != "Chinstrap") |>
       mutate(is_female = (sex == "female")) |> 
       prop_test(is_female ~ species, 
                 order = c("Adelie", "Gentoo"))
    # A tibble: 1 × 6
      statistic chisq_df p_value alternative lower_ci upper_ci
          <dbl>    <dbl>   <dbl> <chr>          <dbl>    <dbl>
    1   0.00650        1   0.936 two.sided     -0.116    0.141

    As expected, we see that there is basically no evidence that the fraction of female penguins differs by species. Note the key difference between the t_test() and the prop_test(): in t_test() the response is a numeric variable and we are testing the mean of that distribution; in a prop_test() the response is a Boolean (TRUE/FALSE) value and we are testing the probability of observing a TRUE.6

    We can also use this with a derived quantity. For instance, if we want to see if Gentoos are more likely to be over 4000 grams than non-Gentoos:

    penguins |>
        mutate(is_gentoo = (species == "Gentoo"),
               over_4k = body_mass > 4000) |>
        prop_test(over_4k ~ is_gentoo, 
                  order = c("TRUE", "FALSE"))
    # A tibble: 1 × 6
      statistic chisq_df  p_value alternative lower_ci upper_ci
          <dbl>    <dbl>    <dbl> <chr>          <dbl>    <dbl>
    1      181.        1 3.50e-41 two.sided      0.699    0.828

    Here, we see that there is strong statistical evidence (in the form of a tiny \(p\)-value) that being a Gentoo penguin increases probability of weighing over 4000 grams. While this interface does not give us an estimate of

    We can also modify this to be a one-sided test:

    penguins |>
        mutate(is_gentoo = (species == "Gentoo"),
               over_4k = body_mass > 4000) |>
        prop_test(over_4k ~ is_gentoo, 
                  alternative = "greater",
                  order = c("TRUE", "FALSE"))
    # A tibble: 1 × 6
      statistic chisq_df  p_value alternative lower_ci upper_ci
          <dbl>    <dbl>    <dbl> <chr>          <dbl>    <dbl>
    1      181.        1 1.75e-41 greater        0.709        1

    Finally, recall that these values can be pulled out of the test result and used in the body of your text using the pull function from the dplyr package.

Task 4: Statistical Inference

Using the tools of the infer package, construct confidence intervals for the following quantities of interest:

  1. The average number of gold medals won by the US at a summer Olympics.
  2. The percent increase in gold medals won by the host country vs their performance in the prior same-season Olympics. (E.g., make sure to compare Russia’s performance in the 2014 Sochi Olympics to the 2010 Vancouver Olympics, not the 2012 London Olympics.) Here, we use the percent increase over the prior Olympics to account for the fact that some countries have a higher ‘baseline’ medal expectation than others.
  3. The probability that the host country wins a gold medal in a new sport (defined as a sport that did not appear in the previous Olympics).

Repeat your analysis for silver medals, bronze medals, and total medals.

For each of these, you may wish to focus on a relatively more recent subset of Olympics (say 1960 and later) to ensure your data is relevant to the 2028 Olympics. This is an instance of a common tension in the analysis of data observed at different points in time: you want to focus on recent data points as they are more similar to the future, but you don’t want to focus in too much and have insufficient data.

Task 5: Combining Confidence Intervals to Forecast US Medal Counts

Combining confidence intervals is a bit of an art, especially when the intervals are not independent (as they are not here). We are going to make a working assumption that the intervals are independent which will make it a bit easier to combine things.

We can forecast Team USA’s 2028 Olympic medal count using the following equation:

\[ \text{Medals-2028} \approx \text{USA Baseline} * \text{Host Nation Factor} + 22 * \text{Probability Host Nation Wins} \]

where we used the fact that there will be 22 more medal events in 2028 than there were in 2024.

To get a range of predictions for the expression above, we will use a simple Monte Carlo sample. That is, rather than doing the calculations exactly, we will simply simulate them many times over and use the results of that simulation to get an interval on the derived quantity.

For example, if I have two parameters \(\alpha, \beta\) with 95% confidence intervals \([2, 5]\) and \([-1, 3]\) respectively, I can get a confidence interval for their product as follows:7

alpha_ci_lower <- 2
alpha_ci_upper <- 5
beta_ci_lower <- -1
beta_ci_upper <- 3

alpha_ci_mean <- (alpha_ci_lower + alpha_ci_upper) / 2
beta_ci_mean  <- (beta_ci_lower + beta_ci_upper) / 2

alpha_ci_sd <- (alpha_ci_upper - alpha_ci_mean) / qnorm(1 - 0.05/2)
beta_ci_sd  <- (beta_ci_upper - beta_ci_mean) / qnorm(1 - 0.05/2)

To wit, we have the confidence interval for \(\alpha\) given by:

qnorm(c(0.025, 0.975), mean=alpha_ci_mean, sd=alpha_ci_sd)
[1] 2 5

and similarly for \(\beta\). To get a CI for \(\alpha\beta\), we then can simply proceed as:

alpha_samples <- rnorm(1e7, mean=alpha_ci_mean, sd=alpha_ci_sd)
beta_samples  <- rnorm(1e7, mean=beta_ci_mean, sd=beta_ci_sd)

alphabeta_samples <- alpha_samples * beta_samples

alphabeta_interval <- quantile(alphabeta_samples, c(0.025, 0.975))

which gives an interval of [-3.51, 11.36]. It’s not obvious how we could have computed this analytically.

Adapt this approach to get 95% confidence intervals on Team USA’s medal counts in LA 2028.

Final Deliverable: Team USA Fundraising Fact Sheet

With all of your analyses in hand, it is time to start promoting the LA 2028 Olympics. Specifically, write a short brief for the LA 2028 Organizing Committee forecasting Team USA’s performance that the fundraising team can use to start soliciting donations from potential sponsors. Specifically, you want to help your fundraising team speak confidently about how well you expect Team USA to perform so that sponsors feel comfortable giving to the cause. (Corporations always want to pay more to be associated with winners.)

Task 6: Write Your Final Deliverable

Your forecast report should be no more than two pages in length and should describe your expected Team USA performance, using all of the work you have done above. Give your ‘three-factor’ model, interpret the terms individually, and then put them together for your final prediction. Be sure to provide suitable background for each factor, to write at a level that could be easily interpreted by a marketing executive at a major corporation, and to include well-formatted graphics and tables as appropriate.

AI Usage Statement

At the end of your report, you must include a description of the extent to which you used Generative AI tools to complete the mini-project. This should be a one paragraph section clearly deliniated using a collapsable Quarto “Callout Note”. Failure to include an AI disclosure will result in an automatic 25% penalty.

E.g.,

No Generative AI tools were used to complete this mini-project.

or

GitHub Co-Pilot Pro was used via RStudio integration while completing this project. No other generative AI tools were used.

or

ChatGPT was used to help write the code in this project, but all non-code text was generated without the use of any Generative AI tools. Additionally, ChatGPT was used to provide additional background information on the topic and to brainstorm ideas for the final open-ended prompt.

Recall that Generative AI may not be used to write or edit any non-code text in this course.

These blocks should be created using the following syntax:


::: {.callout-note title="AI Usage Statement" collapse="true"}

Your text goes here. 

:::

Make sure to use this specific type of callout (.callout-note), title, and collapse="true" setting.

Please contact the instructor if you have any questions about appropriate AI usage in this course.

Extra Credit Opportunities

There are optional Extra Credit Opportunities where extra points can be awarded for specific additional tasks in this mini-project. The amount of the extra credit is typically not proportional to the work required to complete these tasks, but I provide these for students who want to dive deeper into this project and develop additional data analysis skills not covered in the main part of this mini-project.

For this mini-project, no more than 7 total points of extra credit may be awarded. Even with extra credit, your grade on this mini-project cannot exceed 80 points total.

Extra Credit Opportunity #01: Relaxing Distributional Assumptions

The prop_test() and t_test() are classical normal-theory based statistical tests. This makes them simple to implement, but at the cost of additional assumptions. Assumptions are not necessarily a bad thing - and most statistical procedures are robust to at least mild violations of assumptions - but for two points of extra credit, we can explore ways to relax these assumptions.

Review the infer documentation and implement bootstrap based alternatives to prop_test() and t_test() in your analysis pipeline. Compare your parametric and bootstrap based forecasts. Chapters 7-10 of the Statistical Inference via Data Science book will give additional useful background.

Extra Credit Opportunity #02: Model Improvement

For up to two points of extra credit, add an additional factor to your forecast system and compare your forecasts with and without that extra term. Be sure to do the full work-up (some EDA, a statistical test to get a confidence interval, inclusion in the forecast model, propagation through to a final confidence interval). Make sure to justify your additional factor and to explain why you think it will make your model more accurate.

Extra Credit Opportunity #03: Model Validation

Finally, for up to three points, evaluate the performance of your forecast model retrospectively. E.g., if you had used the same model for Japan in 2020, how well would it have predicted medal outcomes for those Olympics? There are many ways you can go with this, but at a minimum, you should

  1. Make at least 10 additional sets of predictions based on historical data
  2. Compare the accuracy of those predictions with the realized values for each medal category
  3. Use that error estimate to put some sort of margin of error on your 2028 population predictions.

This work ©2026 by Michael Weylandt is licensed under a Creative Commons BY-NC-SA 4.0 license.

Footnotes

  1. This the level of “ChatGPT-level” prose, without obvious flaws but lacking the style and elegance associated with true quality writing.↩︎

  2. Throughout this section, replace YOUR_GITHUB_ID with your GitHub ID from Mini-Project #00. Note that the automated course infrastructure will be looking for precise formatting, so follow these instructions closely.↩︎

  3. The degree to which LA is itself preparing for the 2028 Olympics is rather more complicated.↩︎

  4. For the legal and societal preconditions of this trend, see this article from the Boston College Law Review.↩︎

  5. For this assignment, we are of course only looking at the modern Olympics that resumed in 1896. We will not look at the ancient (pre-fall of the Roman Empire) Olympics. Note that, while the Summer Olympics resumed in 1896, the first modern Winter Olympics were only held starting in 1924 and the Winter Olympics have only been “off-cycle” from the Summer Olympics starting with the 1994 Lillehammer games. Additionally, note that certain Olympics were not held during WWI and WWII. Finally, we will only focus on “mainline” Summer and Winter Olympics and will ignore the 1956 Equestrian Olympics and similar events.↩︎

  6. Of course, if you haven taken STA 9715, you know that there is (essentially) only one univariate Boolean distribution (the Bernoulli) and that the probability of observing a success is equal to the mean of the distribution. (\(X \sim \text{Bernoulli} \implies \E[X] = \P(X = 1)\) and all that). This might lead you to ask whether there really is a meaningful difference between a \(t\)-test and a proportion test. These are good questions to ask your STA 9719 professor.↩︎

  7. You might ask whether we want a confidence interval or a prediction interval here. You are right to do so, but we are omitting this distinction as it’s a far less odious problem than the other statistical sins we are already committing here.↩︎