STA 9750 Mini-Project #04: TBD

\[\newcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}}\]

Due Dates

  • Released to Students: 2026-04-23
  • Initial Submission: 2026-05-15 11:59pm ET on GitHub and Brightspace
  • Peer Feedback:
    • Peer Feedback Assigned: 2026-05-18 on GitHub
    • Peer Feedback Due: 2026-05-24 11:59pm ET on GitHub

Estimated Time to Complete: 13-15 Hours

Estimated Time for Peer Feedback: 1 Hour


Introduction

Welcome to Mini-Project #04! TBD

This project will bring together all of the skills you have developed in this course:

  • Accessing data from the web using HTTP requests (httr2)
  • Extracting data from HTML and cleaning it (rvest, stringr)
  • Joining together data from different tables (multi-table dplyr)
  • “Pivoting” and preparing data for analysis (single-table dplyr)
  • Exploring and visualizing data (ggplot2)
  • Statistical analysis and inference (infer or stats)
  • Communicating your findings using web-based reproducible research tools (Quarto)

TBD

Note that, compared to previous mini-projects, the scope of this project is a bit smaller: in light of this, and the more advanced skills you have spent the past 3 months developing, this mini-project should be the least demanding of the course. (It is still likely to be the most difficult simply because you are responsible for more steps of the data analysis than any prior project.) At this point in the course, you should be spending the majority of your out-of-class hours on your Course Project.

The Final Mini-Project

This mini-project completes our whirlwind tour of several different forms of data-driven writing:

TBD

You have applied your skills to a wide range of topics, including TBD.

There are, of course, many other ways that data can be used to generate and communicate insights and many other topics where your skills can be applied, but hopefully this “hit parade” has given you a sense of just how widely your new skills can take you. In each of these domains, you have seen how sophisticated and thoughtful analysis - not simply computing a single mean or regression coefficient - has allowed deeper understanding of the complexities of real world problems. You have learned to think beyond the simple binary “correct/incorrect” of an analysis and to see that even superficially simple questions - “is this true?”, “what is the average?”, “is there a correlation?” - can lead down deep and rewarding rabbit holes of complexity. But those rabbit holes are where your real power as a data analyst can be found.

The tools of quantitative analysis and communication you have developed in this course can be used in essentially infinite contexts– we have only scratched the surface–and I’m excited to see what you do in the remainder of this course, in your remaining time at Baruch, and in your future careers.

Student Responsibilities

Recall our basic analytic workflow and table of student responsibilities:

  • Data Ingest and Cleaning: Given a data source, read it into R and transform it to a reasonably useful and standardized (‘tidy’) format.
  • Data Combination and Alignment: Combine multiple data sources to enable insights not possible from a single source.
  • Descriptive Statistical Analysis: Take a data table and compute informative summary statistics from both the entire population and relevant subgroups
  • Data Visualization: Generate insightful data visualizations to spur insights not attainable from point statistics
  • Inferential Statistical Analysis and Modeling: Develop relevant predictive models and statistical analyses to generate insights about the underlying population and not simply the data at hand.

In this course, our primary focus is on the first four stages: you will take other courses that develop analytical and modeling techniques for a variety of data types. As we progress through the course, you will eventually be responsible for the first four steps. Specifically, you are responsible for the following stages of each mini-project:

Students’ Responsibilities in Mini-Project Analyses
Ingest and Cleaning Combination and Alignment Descriptive Statistical Analysis Visualization
Mini-Project #01
Mini-Project #02 ½
Mini-Project #03 ½
Mini-Project #04

In this mini-project, you are in charge of the whole pipeline, from data acquisition to statistical inference. The rubric below evaluates your work on all aspects of this project.

Rubric

STA 9750 Mini-Projects are evaluated using peer grading with meta-review by the course GTAs. The following basic rubric will be used for all mini-projects:

Course Element Excellent (9-10) Great (7-8) Good (5-6) Adequate (3-4) Needs Improvement (1-2)
Written Communication Report is very well-written and flows naturally. Motivation for key steps is clearly explained to reader without excessive detail. Key findings are highlighted and appropriately given sufficient context, including reference to related work where appropriate. Report has no grammatical or writing issues.1 Writing is accessible and flows naturally. Key findings are highlighted and clearly explained, but lack suitable motivation and context. Report has no grammatical or writing issues. Key findings are present but insufficiently highlighted or unclearly explained. Writing is intelligible, but has some grammatical errors. Key findings are difficult to discern. Report exhibits significant weakness in written communication. Key points are nearly impossible to identify.
Project Skeleton Code completes all instructor-provided tasks correctly. Responses to open-ended tasks are especially insightful and creative. Code completes all instructor-provided tasks satisfactorily. Responses to open-ended tasks are insightful, creative, and do not have any minor flaws. Response to one instructor provided task is skipped, incorrect, or otherwise incomplete. Responses to open-ended tasks are solid and without serious flaws. Responses to two instructor provided tasks are skipped, incorrect, or otherwise incomplete. Responses to open-ended tasks are acceptable, but have at least one serious flaw. Response to three or more instructor provided tasks are skipped, incorrect, or otherwise incomplete. Responses to open-ended tasks are seriously lacking.
Tables & Document Presentation Tables go beyond standard publication-quality formatting, using advanced features like color formatting, interactivity, or embedded visualization. Tables are well-formatted, with publication-quality selection of data to present, formatting of table contents (e.g., significant figures) and column names. Tables are well-formatted, but still have room for improvement in one of these categories: subsetting and selection of data to present, formatting of table contents (e.g., significant figures), column names. Tables lack significant ‘polish’ and need improvement in substance (filtering and down-selecting of presented data) or style. Document is difficult to read due to distracting formatting choices. Unfiltered ‘data dump’ instead of curated table. Document is illegible at points.
Data Visualization Figures go beyond standard publication-quality formatting, using advanced features like animation, interactivity, or advanced plot types implemented in ggplot2 extension packages. Figures are ‘publication-quality,’ with suitable axis labels, well-chosen structure, attractive color schemes, titles, subtitles, and captions, etc. Figures are above ‘exploratory-quality’ and reflect a moderate degree of polish, but do not reach full ‘publication-quality’ in one-to-two ways. Figures are above ‘exploratory-quality’ and reflect a moderate degree of polish, but do not reach full ‘publication-quality’ in three or more distinct ways. Figures are suitable to support claims made, but are ‘exploratory-quality,’ reflecting zero-to-minimal effort to customize and ‘polish’ beyond ggplot2 defaults.
Exploratory Data Analysis Deep and ‘story-telling’ EDA identifying non-obvious patterns that are then used to drive further analysis in support of the project. All patterns and irregularities are noted and well characterized, demonstrating mastery and deep understanding of all data sets used. Meaningful ‘story-telling’ EDA identifying non-obvious patterns in the data. Major and pinor patterns and irregularities are noted and well characterized at a level sufficient to achieve the goals of the analysis. EDA demonstrates clear understanding of all data sets used. Extensive EDA that thoroughly explores the data, but lacks narrative and does not deliver a meaningful ‘story’ to the reader. Obvious patterns or irregularities noted and well characterized, but more subtle structure may be overlooked or not fully discussed. EDA demonstrates competence and basic understanding of the data sets used. Solid EDA that identifies major structure to the data, but does not fully explore all relevant structure. Obvious patterns or irregularities ignored or missed. EDA demonstrates familiarity with high-level structure of the data sets used. Minimal EDA, covering only standard summary statistics, and providing limited insight into data patterns or irregularities. EDA fails to demonstrate familiarity with even the most basic properties of the data sets being analyzed.

Code Quality

Code is (near) flawless. Intent is clear throughout and all code is efficient, clear, and fully idiomatic.

Code passes all styler and lintr type analyses without issue.

Comments give context and structure of the analysis, not simply defining functions used in a particular line. Intent is clear throughout, but code can be minorly improved in certain sections.

Code has well-chosen variable names and basic comments. Intent is generally clear, though some sections may be messy and code may have serious clarity or efficiency issues.

Code executes properly, but is difficult to read. Intent is generally clear and code is messy or inefficient.

Code fails to execute properly.

Data Preparation Data import is fully-automated and efficient, taking care to only download from web-sources if not available locally. All data cleaning steps are fully-automated and robustly implemented, yielding a clean data set that can be widely used. Data is imported and prepared effectively, in an automated fashion with minimal hard-coding of URLs and file paths. Data cleaning is fully-automated and sufficient to address all issues relevant to the analysis at plan. Data is imported and prepared effectively, though source and destination file names are hard-coded. Data cleaning is rather manual and hard-codes most transformations. Data is imported in a manner likely to have errors. Data cleaning is insufficient and fails to address clear problems. Data is hard-coded and not imported from an external source.
Analysis and Findings Analysis demonstrates uncommon insight and quality, providing unexpected and subtle insights. Analysis is clear and convincing, leaving essentially no doubts about correctness. Analysis clearly appears to be correct and passes the “sniff test” for all findings, but a detailed review notes some questions remain unanswered. Analysis is not clearly flawed at any point and is likely to be within the right order of magnitude for all findings. Analysis is clearly incorrect in at least one major finding, reporting clearly implausible results that are likely off by an order of magnitude or more.

Note that the “Excellent” category for most elements applies only to truly exceptional “above-and-beyond” work. Most student submissions will likely fall in the “Good” to “Great” range.

For this mini-project, students are responsible for all elements of the analysis and will be evaluated on all rubric elements.

For this mini-project, no more than 4 total points of extra credit can be be awarded. Opportunities for extra credit exist for students who go above and beyond the instructor-provided scaffolding. Specific opportunities for extra credit can be found below.

Students pursuing careers in data analytics are strongly encouraged to go beyond the strict ambit of the mini-projects to

  1. further refine their skills;
  2. learn additional techniques that can be used in the final course project; and
  3. develop a more impressive professional portfolio.

Because students are encouraged to use STA 9750 mini-projects as the basis for a professional portfolio, the basic skeleton of each project will be released under a fairly permissive usage license. Take advantage of it!

Submission Instructions

After completing the analysis, write up your findings, showing all of your code, using a dynamic quarto document and post it to your course repository. The qmd file should be named mp04.qmd (lower case!) so the rendered document can be found at docs/mp04.html in the student’s repository and will be served at the URL:2

https://YOUR_GITHUB_ID.github.io/STA9750-2026-SPRING/mp04.html

You can use the helper function mp_start available at in the Course Helper Functions to create a file with the appropriate name and some meta-data already included. Do so by running the following command at the R Console:

source("https://michael-weylandt.com/STA9750/load_helpers.R"); mp_start(N=04)

After completing this mini-project, upload your rendered output and necessary ancillary files to GitHub to make sure your site works. The mp_submission_ready function in the Course Helper Functions can perform some of these checks automatically. You can run this function by running the following commands at the R Console:

source("https://michael-weylandt.com/STA9750/load_helpers.R"); mp_submission_ready(N=04)

Once you confirm this website works (substituting YOUR_GITHUB_ID for the actual GitHub username provided to the professor in MP#00 of course), open a GitHub issue on the instructor’s repository to submit your completed work.

The easiest way to do so is by use of the mp_submission_create function in the Course Helper Functions, which can be used by running the following command at the R Console:

source("https://michael-weylandt.com/STA9750/load_helpers.R"); mp_submission_create(N=04)

Alternatively, if you wish to submit manually, open a new issue at

https://github.com/michaelweylandt/STA9750-2026-SPRING/issues/new .

Title the issue STA 9750 YOUR_GITHUB_ID MiniProject #04 and fill in the following text for the issue:

Hi @michaelweylandt!

I've uploaded my work for MiniProject #**04** - check it out!

<https://<GITHUB_ID>.github.io/STA9750-2026-SPRING/mp04.html>

At various points before and after the submission deadline, the instructor will run some automated checks to ensure your submission has all necessary components. Please respond to any issues raised in a timely fashion as failing to address them may lead to a lower set of scores when graded.

Additionally, a PDF export of this report should be submitted on Brightspace. To create a PDF from the uploaded report, simply use your browser’s ‘Print to PDF’ functionality.

NB: The analysis outline below specifies key tasks you need to perform within your write up. Your peer evaluators will check that you complete these. You are encouraged to do extra analysis, but the bolded Tasks are mandatory.

NB: Your final submission should look like a report, not simply a list of facts answering questions. Add introductions, conclusions, and your own commentary. You should be practicing both raw coding skills and written communication in all mini-projects. There is little value in data points stated without context or motivation.

Mini-Project #04: TBD

Data Acquisition

TBD

Data Integration and Exploration

Statistical Analysis

While Exploratory Data Analysis (EDA) can inspire questions, it remains useful to perform formal statistical inference to assess whether observed differences are larger than can be explained by pure randomness. The infer package can be used to neatly integrate \(t\)-tests and binomial proportion tests into a tidyverse workflow.

We will use the t_test and prop_test functions from this package to complete our analyses. These can be used as demonstrated below.

  • Example 01: Suppose we want to test whether Gentoo penguins are heavier than Adelie penguins on average. Since we have a numerical response and two different groups, this is a two-sample \(t\) test:

    library(infer)
    library(tidyverse)
    penguins_ok <- penguins |>
        drop_na() |>
        # Things will work better if we ensure species is a character vector 
        # and not a factor vector
        # 
        # You probably don't need to copy this step since your import is unlikely
        # to 'accidentally' make a factor
        mutate(species = as.character(species)) 
    
    penguins_ok |>
        filter(species != "Chinstrap") |>
        t_test(body_mass ~ species, 
               order = c("Adelie", "Gentoo"))
    # A tibble: 1 × 7
      statistic  t_df  p_value alternative estimate lower_ci upper_ci
          <dbl> <dbl>    <dbl> <chr>          <dbl>    <dbl>    <dbl>
    1     -23.3  242. 1.22e-63 two.sided     -1386.   -1504.   -1269.

    From this, we see that Adelie penguins are 1375 grams lighter on average than Gentoos and the \(p\)-value is far less than 0.01, so the difference is unlikely to be solely due to sampling variability.

    Here, we perform the test by specifying the response (quantity of interest) on the left hand side of the ~ and the explanatory variable (that which might meaningfully predict a differerence in groups) on the right hand side of the ~. If we swap the order (or don’t provide it), the sign of the estimated difference may differ, but the actual \(p\)-value won’t change for a two-sided test (as done here).

    We can extend this analysis to perform it separately for each of three years using a bit of nested data trickery. The mechanics of this are more advanced, but you should be able to extend this to your own analysis by just changing variable names:

    penguins_ok |>
    filter(species != "Chinstrap") |>
    group_by(year) |>
    nest() |>
    mutate(p_value_adelie_gentoo = map(data, \(d) t_test(d, 
                                                         body_mass ~ species,
                                                         order=c("Adelie", "Gentoo")))) |>
    unnest(everything())
    # A tibble: 265 × 15
    # Groups:   year [3]
        year species island  bill_len bill_dep flipper_len body_mass sex   statistic
       <int> <chr>   <fct>      <dbl>    <dbl>       <int>     <int> <fct>     <dbl>
     1  2007 Adelie  Torger…     39.1     18.7         181      3750 male      -11.6
     2  2007 Adelie  Torger…     39.5     17.4         186      3800 fema…     -11.6
     3  2007 Adelie  Torger…     40.3     18           195      3250 fema…     -11.6
     4  2007 Adelie  Torger…     36.7     19.3         193      3450 fema…     -11.6
     5  2007 Adelie  Torger…     39.3     20.6         190      3650 male      -11.6
     6  2007 Adelie  Torger…     38.9     17.8         181      3625 fema…     -11.6
     7  2007 Adelie  Torger…     39.2     19.6         195      4675 male      -11.6
     8  2007 Adelie  Torger…     41.1     17.6         182      3200 fema…     -11.6
     9  2007 Adelie  Torger…     38.6     21.2         191      3800 male      -11.6
    10  2007 Adelie  Torger…     34.6     21.1         198      4400 male      -11.6
    # ℹ 255 more rows
    # ℹ 6 more variables: t_df <dbl>, p_value <dbl>, alternative <chr>,
    #   estimate <dbl>, lower_ci <dbl>, upper_ci <dbl>

    Here, we see that our result has columns for the \(t\) statistic, the \(p\)-value, the estimated difference, and many other useful inferential quantities.

  1. Example 02: In other contexts, we may have a categorial response. In this case, we should not perform a \(t\)-test on means, but we instead want to use a binomial proportion test via the prop_test function. For instance, if we want to see whether Adelie penguins are more likely to be female than Gentoos:

    library(infer)
    library(tidyverse)
    penguins_ok |>
       filter(species != "Chinstrap") |>
       prop_test(sex ~ species, 
                 order = c("Adelie", "Gentoo"))
    # A tibble: 1 × 6
      statistic chisq_df p_value alternative lower_ci upper_ci
          <dbl>    <dbl>   <dbl> <chr>          <dbl>    <dbl>
    1   0.00650        1   0.936 two.sided     -0.116    0.141

    As before, we see that there is basically no evidence that the fraction of female penguins differs by species. Note the key difference between the t_test and the prop_test: in t_test the response is a numeric variable and we are testing the mean of that distribution; in a prop_test the response is a Boolean (TRUE/FALSE) value and we are testing the probability of observing a TRUE.3

    We can also use this with a derived quantity. For instance, if we want to see if Gentoos are more likely to be over 4000 grams than non-Gentoos:

    penguins |>
        mutate(is_gentoo = (species == "Gentoo"),
               over_4k = body_mass > 4000) |>
        prop_test(over_4k ~ is_gentoo, 
                  order = c("TRUE", "FALSE"))
    # A tibble: 1 × 6
      statistic chisq_df  p_value alternative lower_ci upper_ci
          <dbl>    <dbl>    <dbl> <chr>          <dbl>    <dbl>
    1      181.        1 3.50e-41 two.sided      0.699    0.828

    Here, we see that there is strong statistical evidence (in the form of a tiny \(p\)-value) that being a Gentoo penguin increases probability of weighing over 4000 grams.

    We can also modify this to be a one-sided test:

    penguins |>
        mutate(is_gentoo = (species == "Gentoo"),
               over_4k = body_mass > 4000) |>
        prop_test(over_4k ~ is_gentoo, 
                  alternative = "greater",
                  order = c("TRUE", "FALSE"))
    # A tibble: 1 × 6
      statistic chisq_df  p_value alternative lower_ci upper_ci
          <dbl>    <dbl>    <dbl> <chr>          <dbl>    <dbl>
    1      181.        1 1.75e-41 greater        0.709        1

    Finally, recall that these values can be pulled out of the test result and used in the body of your text using the pull function from the dplyr package.

Task 4: Statistical Inference

TBD

Final Deliverable: TBD

TBD

AI Usage Statement

At the end of your report, you must include a description of the extent to which you used Generative AI tools to complete the mini-project. This should be a one paragraph section clearly deliniated using a collapsable Quarto “Callout Note”.

E.g.,

No Generative AI tools were used to complete this mini-project.

or

GitHub Co-Pilot Pro was used via RStudio integration while completing this project. No other generative AI tools were used.

or

ChatGPT was used to help write the code in this project, but all non-code text was generated without the use of any Generative AI tools. Additionally, ChatGPT was used to provide additional background information on the topic and to brainstorm ideas for the final open-ended prompt.

Recall that Generative AI may not be used to write or edit any non-code text in this course.

These blocks can be created using the following syntax:


::: {.callout-note title="AI Usage Statement" collapse="true"}

Your text goes here. 

:::

Please contact the instructor if you have any questions about appropriate AI usage in this course.

Extra Credit Opportunities

There are optional Extra Credit Opportunities where extra points can be awarded for specific additional tasks in this mini-project. The amount of the extra credit is typically not proportional to the work required to complete these tasks, but I provide these for students who want to dive deeper into this project and develop additional data analysis skills not covered in the main part of this mini-project.

For this mini-project, no more than 4 total points of extra credit may be awarded. Even with extra credit, your grade on this mini-project cannot exceed 80 points total.

TBD


This work ©2026 by Michael Weylandt is licensed under a Creative Commons BY-NC-SA 4.0 license.

Footnotes

  1. This the level of “ChatGPT-level” prose, without obvious flaws but lacking the style and elegance associated with true quality writing.↩︎

  2. Throughout this section, replace YOUR_GITHUB_ID with your GitHub ID from Mini-Project #00. Note that the automated course infrastructure will be looking for precise formatting, so follow these instructions closely.↩︎

  3. Of course, if you haven taken STA 9715, you know that there is only one univariate Boolean distribution (the Bernoulli) and that the probability of observing a success is equal to the mean of the distribution. (\(X \sim \text{Bernoulli} \implies \E[X] = \P(X = 1)\) and all that). This might lead you to ask whether there really is a meaningful difference between a \(t\)-test and a proportion test. These are good questions to ask your STA 9719 professor.↩︎