STA 9750 Mini-Project #01: Assessing the Impact of SFFA on Campus Diversity One-Year Later

Due Dates

Released to Students: 2026-02-20
Initial Submission: 2026-03-13 11:59pm ET on GitHub and Brightspace
Peer Feedback:
- Peer Feedback Assigned: 2026-03-16 on GitHub
- Peer Feedback Due: 2026-03-22 11:59pm ET on GitHub

Estimated Time to Complete: 13-15 Hours

Estimated Time for Peer Feedback: 1 Hour

Welcome to STA 9750 Mini Projects!

In the STA 9750 Mini-Projects, you will perform basic data analyses intended to model best practices for your course final project. (Note, however, that these are mini-projects; your final course project is expected to be far more extensive than any single MP.)

Introduction

On June 29, 2023, The US Supreme Court handed down its decision in the closely watched case of Students for Fair Admissions v. Harvard (“SFFA”). In SFFA, the court found that the admissions programs at Harvard and at the University of North Carolina violated the Equal Protection Clause of the Fourteenth Amendment to the US Constitution and engaged in impermissible race-conscious admissions practices. While the court had explicitly disallowed race-conscious practices in other contexts, Harvard argued that the admissions process served a compelling governmental interest - educational benefits from a diverse student body - that had been recognized by the court as recently as 2016.

While the SFFA case touched on many aspects of law, one important factual question was the practical impact of the programs in question: were they essential to maintaining a diverse student body or were they merely a slight “thumb on the scale” that had only a minor impact?

Now that the first set of post-SFFA admissions data has been released,¹ you are going to attempt to analyze the extent to which post-SFFA admissions differ from pre-SFFA practice. You will use admissions data distributed via IPEDS, the Integrated Postsecondary Education Data System, managed by the National Center for Education Statistics within the Federal Department of Education.

After some initial analysis, you will write a brief (750) word Op-Ed from the perspective of a college president describing the impact of SFFA on the demographics of admitted students at your college and at colleges across the country.

In this mini-project, you will:

Practice use of dplyr for analysis of tabular data
Practice use of quarto and Reproducible Research Tools for Effective Communication of Data Analysis Results
Begin your professional data science portfolio.

Student Responsibilities

For purposes of MPs, we are dividing the basic data analytic workflow into several major stages:

Data Ingest and Cleaning: Given a data source, read it into R and transform it to a reasonably useful and standardized (‘tidy’) format.
Data Combination and Alignment: Combine multiple data sources to enable insights not possible from a single source.
Descriptive Statistical Analysis: Take a data table and compute informative summary statistics from both the entire population and relevant subgroups
Data Visualization: Generate insightful data visualizations to spur insights not attainable from point statistics
Inferential Statistical Analysis and Modeling: Develop relevant predictive models and statistical analyses to generate insights about the underlying population and not simply the data at hand.

In this course, our primary focus is on the first four stages: you will take other courses that develop analytical and modeling techniques for a variety of data types. As we progress through the course, you will eventually be responsible for the first four steps. Specifically, you are responsible for the following stages of each mini-project:

Students’ Responsibilities in Mini-Project Analyses
	Ingest and Cleaning	Combination and Alignment	Descriptive Statistical Analysis	Visualization
Mini-Project #01			✓
Mini-Project #02		✓	✓	½
Mini-Project #03	½	✓	✓	✓
Mini-Project #04	✓	✓	✓	✓

In early stages of the course, such as this MP, I will ‘scaffold’ much of the analysis for you, leaving only those stages we have discussed in class for you to fill in. As the course progresses, the mini-projects will be more self-directed and results less standardized.

Rubric

STA 9750 Mini-Projects are evaluated using peer grading with meta-review by the course staff (GTAs and the instructor). The following basic rubric will be used for all mini-projects:

Course Element	Excellent (9-10)	Great (7-8)	Good (5-6)	Adequate (3-4)	Needs Improvement (1-2)
Written Communication	Report is very well-written and flows naturally. Motivation for key steps is clearly explained to reader without excessive detail. Key findings are highlighted and appropriately given sufficient context, including reference to related work where appropriate.	Report has no grammatical or writing issues.² Writing is accessible and flows naturally. Key findings are highlighted and clearly explained, but lack suitable motivation and context.	Report has no grammatical or writing issues. Key findings are present but insufficiently highlighted or unclearly explained.	Writing is intelligible, but has some grammatical errors. Key findings are difficult to discern.	Report exhibits significant weakness in written communication. Key points are nearly impossible to identify.
Project Skeleton	Code completes all instructor-provided tasks correctly. Responses to open-ended tasks are especially insightful and creative.	Code completes all instructor-provided tasks satisfactorily. Responses to open-ended tasks are insightful, creative, and do not have any minor flaws.	Response to one instructor provided task is skipped, incorrect, or otherwise incomplete. Responses to open-ended tasks are solid and without serious flaws.	Responses to two instructor provided tasks are skipped, incorrect, or otherwise incomplete. Responses to open-ended tasks are acceptable, but have at least one serious flaw.	Response to three or more instructor provided tasks are skipped, incorrect, or otherwise incomplete. Responses to open-ended tasks are seriously lacking.
Tables & Document Presentation	Tables go beyond standard publication-quality formatting, using advanced features like color formatting, interactivity, or embedded visualization.	Tables are well-formatted, with publication-quality selection of data to present, formatting of table contents (e.g., significant figures) and column names.	Tables are well-formatted, but still have room for improvement in one of these categories: subsetting and selection of data to present, formatting of table contents (e.g., significant figures), column names.	Tables lack significant ‘polish’ and need improvement in substance (filtering and down-selecting of presented data) or style. Document is difficult to read due to distracting formatting choices.	Unfiltered ‘data dump’ instead of curated table. Document is illegible at points.
Data Visualization	Figures go beyond standard publication-quality formatting, using advanced features like animation, interactivity, or advanced plot types implemented in `ggplot2` extension packages.	Figures are ‘publication-quality,’ with suitable axis labels, well-chosen structure, attractive color schemes, titles, subtitles, and captions, etc.	Figures are above ‘exploratory-quality’ and reflect a moderate degree of polish, but do not reach full ‘publication-quality’ in one-to-two ways.	Figures are above ‘exploratory-quality’ and reflect a moderate degree of polish, but do not reach full ‘publication-quality’ in three or more distinct ways.	Figures are suitable to support claims made, but are ‘exploratory-quality,’ reflecting zero-to-minimal effort to customize and ‘polish’ beyond `ggplot2` defaults.
Exploratory Data Analysis	Deep and ‘story-telling’ EDA identifying non-obvious patterns that are then used to drive further analysis in support of the project. All patterns and irregularities are noted and well characterized, demonstrating mastery and deep understanding of all data sets used.	Meaningful ‘story-telling’ EDA identifying non-obvious patterns in the data. Major and pinor patterns and irregularities are noted and well characterized at a level sufficient to achieve the goals of the analysis. EDA demonstrates clear understanding of all data sets used.	Extensive EDA that thoroughly explores the data, but lacks narrative and does not deliver a meaningful ‘story’ to the reader. Obvious patterns or irregularities noted and well characterized, but more subtle structure may be overlooked or not fully discussed. EDA demonstrates competence and basic understanding of the data sets used.	Solid EDA that identifies major structure to the data, but does not fully explore all relevant structure. Obvious patterns or irregularities ignored or missed. EDA demonstrates familiarity with high-level structure of the data sets used.	Minimal EDA, covering only standard summary statistics, and providing limited insight into data patterns or irregularities. EDA fails to demonstrate familiarity with even the most basic properties of the data sets being analyzed.
Code Quality	Code is (near) flawless. Intent is clear throughout and all code is efficient, clear, and fully idiomatic. Code passes all `styler` and `lintr` type analyses without issue.	Comments give context and structure of the analysis, not simply defining functions used in a particular line. Intent is clear throughout, but code can be minorly improved in certain sections.	Code has well-chosen variable names and basic comments. Intent is generally clear, though some sections may be messy and code may have serious clarity or efficiency issues.	Code executes properly, but is difficult to read. Intent is generally clear and code is messy or inefficient.	Code fails to execute properly.
Data Preparation	Data import is fully-automated and efficient, taking care to only download from web-sources if not available locally. All data cleaning steps are fully-automated and robustly implemented, yielding a clean data set that can be widely used.	Data is imported and prepared effectively, in an automated fashion with minimal hard-coding of URLs and file paths. Data cleaning is fully-automated and sufficient to address all issues relevant to the analysis at plan.	Data is imported and prepared effectively, though source and destination file names are hard-coded. Data cleaning is rather manual and hard-codes most transformations.	Data is imported in a manner likely to have errors. Data cleaning is insufficient and fails to address clear problems.	Data is hard-coded and not imported from an external source.
Analysis and Findings	Analysis demonstrates uncommon insight and quality, providing unexpected and subtle insights.	Analysis is clear and convincing, leaving essentially no doubts about correctness.	Analysis clearly appears to be correct and passes the “sniff test” for all findings, but a detailed review notes some questions remain unanswered.	Analysis is not clearly flawed at any point and is likely to be within the right order of magnitude for all findings.	Analysis is clearly incorrect in at least one major finding, reporting clearly implausible results that are likely off by an order of magnitude or more.

Note that the “Excellent” category for most elements applies only to truly exceptional “above-and-beyond” work. Most student submissions will likely fall in the “Good” to “Great” range.

At this early point, you are not responsible for all elements of this rubric. In particular, all submissions will receive an automatic 10/10 for Data Visualization as this is outside the scope of this mini-project. Furthermore, because I am providing code to download the data, load it into R, and prepare it for analysis, all reports submitted using my code will receive an automatic 10/10 for the ‘Data Preparation’ element of the rubric. Finally, reports completing all tasks described under Data Integration and Exploration below should receive a 10/10 for the ‘Exploratory Data Analysis’ rubric element.

Taken together, you are only really responsible for these portions of the rubric:

Written Communication
Project Skeleton
Tables & Document Presentation
Code Quality
Analysis and Findings

Reports completing all key steps outlined below essentially start with 30 free points.

Writing Requirements

Note that you are evaluated*on writing and communication in these Mini-Projects. You are required to write a report in the prescribed style, culminating in an Op-Ed. A submission that performs the instructor-specified tasks, but does not write and give appropriate context and commentary will score very poorly on the relevant rubric elements.

In particular, if a submission does not include a clearly delineated Op-Ed and only answers the instructor prompts in narrative text, peer evaluators should judge it to have “Good” quality Written Communication (at best) as key findings are not conveyed appropriately.

Quarto’s code folding functionality is useful for “hiding” code so that it doesn’t break the flow of your writing.

You can also make use of Quarto’s contents shortcode to present code and findings in an order other than how the code should be executed. This is particularly useful if you want to include a figure or table in an “Executive Summary” at the top of your submission.

For this mini-project, no more than 6 total points of extra credit can be be awarded. Opportunities for extra credit exist for students who go above and beyond the instructor-provided scaffolding. Specific opportunities for extra credit can be found below.

Students pursuing careers in data analytics are strongly encouraged to go beyond the strict ambit of the mini-projects to

further refine their skills;
learn additional techniques that can be used in the final course project; and
develop a more impressive professional portfolio.

Because students are encouraged to use STA 9750 mini-projects as the basis for a professional portfolio, the basic skeleton of each project will be released under a fairly permissive usage license. Take advantage of it!

Submission Instructions

After completing the analysis, write up your findings, showing all of your code, using a dynamic quarto document and post it to your course repository. The qmd file should be named mp01.qmd (lower case!) so the rendered document can be found at docs/mp01.html in the student’s repository and will be served at the URL:³

https://YOUR_GITHUB_ID.github.io/STA9750-2026-SPRING/mp01.html

You can use the helper function mp_start available at in the Course Helper Functions to create a file with the appropriate name and some meta-data already included. Do so by running the following command at the R Console:

source("https://michael-weylandt.com/STA9750/load_helpers.R"); mp_start(N=01)

After completing this mini-project, upload your rendered output and necessary ancillary files to GitHub to make sure your site works. The mp_submission_ready function in the Course Helper Functions can perform some of these checks automatically. You can run this function by running the following commands at the R Console:

source("https://michael-weylandt.com/STA9750/load_helpers.R"); mp_submission_ready(N=01)

Once you confirm this website works (substituting YOUR_GITHUB_ID for the actual GitHub username provided to the professor in MP#00 of course), open a GitHub issue on the instructor’s repository to submit your completed work.

The easiest way to do so is by use of the mp_submission_create function in the Course Helper Functions, which can be used by running the following command at the R Console:

source("https://michael-weylandt.com/STA9750/load_helpers.R"); mp_submission_create(N=01)

Alternatively, if you wish to submit manually, open a new issue at

https://github.com/michaelweylandt/STA9750-2026-SPRING/issues/new .

Title the issue STA 9750 YOUR_GITHUB_ID MiniProject #01 and fill in the following text for the issue:

Hi @michaelweylandt!

I've uploaded my work for MiniProject #**01** - check it out!

<https://YOUR_GITHUB_ID.github.io/STA9750-2026-SPRING/mp01.html>

At various points before and after the submission deadline, the instructor will run some automated checks to ensure your submission has all necessary components. Please respond to any issues raised in a timely fashion as failing to address them may lead to a lower set of scores when graded.

Additionally, a PDF export of this report should be submitted on Brightspace. To create a PDF from the uploaded report, simply use your browser’s ‘Print to PDF’ functionality.

NB: The analysis outline below specifies key tasks you need to perform within your write up. Your peer evaluators will check that you complete these. You are encouraged to do extra analysis, but the bolded Tasks are mandatory.

NB: Your final submission should look like a report, not simply a list of facts answering questions. Add introductions, conclusions, and your own commentary. You should be practicing both raw coding skills and written communication in all mini-projects. There is little value in data points stated without context or motivation.

Mini-Project #01: Assessing the Impact of SFFA on Campus Diversity One-Year Later

Data Acquisition

The following code can be used to acquire data from IPEDS. Specifically, once run, this code will download the fall enrollment (“EF”) and institution description (“HD”) files for the previous 15 years. To be efficient, this function will save a copy of the downloaded data in a folder called data/mp01 and use that copy to avoid re-downloading a file if it is already present on your computer. This will make your code faster to run and will avoid putting unnecessary stress on the IPEDS servers.

#' Acquire IPEDS Data for MP#01
#' 
#' This function will acquire and standardize all data for MP#01
#' from IPEDS (https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx)
#' 
#' We're starting in 2010 as the data seems to be reasonably complete 
#' after that point. 
acquire_ipeds_data <- function(start_year=2010, end_year=2024){
    library(tidyverse)
    library(glue)
    
    data_dir <- file.path("data", "mp01")
    
    if(!dir.exists(data_dir)){
        dir.create(data_dir, showWarnings=FALSE, recursive=TRUE)
    }
    
    YEARS <- seq(start_year, end_year)
    
    EFA_ALL <- map(YEARS, function(yy){
        if(yy <= 2022){
            ef_url <- glue("https://nces.ed.gov/ipeds/datacenter/data/EF{yy}A.zip")
            
        } else {
            ef_url <- glue("https://nces.ed.gov/ipeds/data-generator?year={yy}&tableName=EF{yy}A&HasRV=0&type=csv")
        }
        
        ef_file <- file.path(data_dir, glue("ef{yy}a.csv.zip"))
        
        if(!file.exists(ef_file)){
            message(glue("Downloading Enrollment Data for {yy} from {ef_url}"))
            download.file(ef_url, destfile = ef_file, quiet=TRUE, mode="wb")    
        }
        
        read_csv(ef_file, 
                 show_col_types=FALSE) |>
            mutate(year = yy, 
                   # American Indian or Alaskan Native
                   enrollment_m_aian = EFAIANM, 
                   enrollment_f_aian = EFAIANW, 
                   # Asian
                   enrollment_m_asia = EFASIAM, 
                   enrollment_f_asia = EFASIAW, 
                   # Black or African-American, 
                   enrollment_m_bkaa = EFBKAAM, 
                   enrollment_f_bkaa = EFBKAAW, 
                   # Hispanic 
                   enrollment_m_hisp = EFHISPM, 
                   enrollment_f_hisp = EFHISPW, 
                   # Native Hawaiian or Other Pacific Islander 
                   enrollment_m_nhpi = EFNHPIM, 
                   enrollment_f_nhpi = EFNHPIW, 
                   # White
                   enrollment_m_whit = EFWHITM, 
                   enrollment_f_whit = EFWHITW, 
                   # Two or More Races
                   enrollment_m_2mor = EF2MORM, 
                   enrollment_f_2mor = EF2MORW, 
                   # Unknown / Undisclosed Race
                   enrollment_m_unkn = EFUNKNM, 
                   enrollment_f_unkn = EFUNKNW, 
                   # US Non-Resident
                   enrollment_m_nral = EFNRALM, 
                   enrollment_f_nral = EFNRALW, 
            ) |> filter(
                (EFALEVEL %in% c(2, 12)) | (LINE %in% c(1, 15))
                # Per 2024 Data Dictionary, 
                # - EFALEVEL 2 = undergrad 
                # - EFALELVE 12 = grad
                # - Line 1 = first year first time full-time undergrad
                # - Line 15 = first year first time part-time undergrad
            ) |> mutate(level = case_when(
                       EFALEVEL == 2 ~ "all undergrad", 
                       EFALEVEL == 12 ~ "all graduate",
                       LINE %in% c(1, 15) ~ "first year undergrad"
                   )
            ) |>
            select(institution_id = UNITID, 
                   year, 
                   level,
                   starts_with("enrollment_")) |>
            group_by(institution_id, 
                     year, 
                     level) |>
            summarize(across(starts_with("enrollment_"), sum), 
                      .groups = "drop")
        
    }) |> bind_rows()
    
    DESC_ALL <- map(YEARS, function(yy){
        if(yy <= 2022){
            hd_url <- glue("https://nces.ed.gov/ipeds/datacenter/data/HD{yy}.zip")
            
        } else {
            hd_url <- glue("https://nces.ed.gov/ipeds/data-generator?year={yy}&tableName=HD{yy}&HasRV=0&type=csv")
        }
        
        hd_file <- file.path(data_dir, glue("hd{yy}.csv.zip"))
        
        if(!file.exists(hd_file)){
            message(glue("Downloading Institutional Descriptions for {yy} from {hd_url}"))
            download.file(hd_url, destfile = hd_file, quiet=TRUE, mode="wb")    
        }
        
        suppressWarnings(
            read_csv(hd_file, 
                 show_col_types=FALSE, 
                 locale=locale(encoding=if_else(yy==2024, "utf-8", "windows-1252"))) |>
            mutate(year = yy, 
                   INSTNM) |> 
            select(institution_id = UNITID, 
                   institution_name = INSTNM, 
                   state = STABBR, 
                   year)
        )
        
    }) |> bind_rows()
    
    inner_join(EFA_ALL, 
               DESC_ALL, 
               join_by(institution_id == institution_id, 
                       year == year))
}

IPEDS <- acquire_ipeds_data()

This creates a data frame called IPEDS that has 219316 rows and 23 columns in your local environment. This data will be used for the remainder of this mini-project.

Task 1: Data Acquisition

Using the code above, acquire the IPEDS data. Copy the code into your Quarto document and make sure it runs successfully.

Do Not git add Data Files

Make sure that git is set to ignore data files, such as the one created above. Check the git pane in RStudio and make sure that the files in data/mp01 do not appear. (If you set up your .gitignore file correctly in MP#00, it should already be ignored.) If it is appearing, you may need to edit your .gitignore file.

Removing a large data file from git is possible, but difficult. Don’t get into a bad state!

Data Cleaning and Preparation

IPEDS provides many different pieces of information, so I have provided you a subset of interesting variables. These include:

A unique institutional ID code (institution_id)
The name of the institution (institution_name)
The state in which the institution’s principal campus is located
The year of reporting. Note that, for enrollment data, these are the enrollment as of the Fall semester, so, e.g., Year 2010 is for the first semester of the 2010-2011 academic year.
The “level” of students being considered:
- "all graduate" refers to all graduate students on campus (both full-time and part-time)
- "all undergraduate" refers to all undergraduate students on campus (both full-time and part-time)
- "first year undergrad" refers to “stereotypical” first year college students. These are students (either full-time or part-time) who are in their first year of study during their first undergraduate enrollment (i.e., excluding students who took time off and re-enrolled, students who transferred from another institution, and students who previously studied at a different institution earning an associates’ degree or similar credential)
Number of enrolled students in various demographic groups: these 18 variables are formatted as: enrollment_X_YYYY, where
- X is m or f for male or female students
- YYYY is a racial/ethnic group identifier:
  - aian: American Indian or Alaskan Native
  - asia: Asian-American
  - bkaa: Black or African-American
  - hisp: Hispanic
  - nhpi: Native Hawaiian or Pacific Islander
  - whit: White
  - 2mor: Two or More Races
  - unkn: Unknown / Unreported
  - nral: Not a US Resident before enrollment (“Non-Resident Alien”)
Note that these are somewhat non-standard identifiers and do not match other official sources. (E.g., the US Census considers Hispanic ancestry a separate axis so it is possible to have Non-Hispanic White, Hispanic Black, etc.) We use these as given since they are what IPEDS provides.

Together these give 18 enrollment categories that, when summed, give the total enrollment at an institution.

Before proceeding, we will create two additional variables for our analyses:

is_cuny: A Boolean (True/False) variable indicating whether an institution is part of the CUNY system.
is_calpublic: A variable indicating whether an institution is a public college or university of the state of California

Task 2: Identify CUNY Schools

Create a new column is_cuny in your IPEDS data. For our purposes, we will assume that CUNY schools all have “CUNY” in their institution name.

The str_detect function can be used to test whether a string (like a name) contains a substring:⁴ e.g.,

# True because "CUNY" (2nd arg) is in the longer string (1st arg)
str_detect("CUNY Bernard M. Baruch College", "CUNY")

[1] TRUE

# False because "Hunter" is nowhere in the first argument
str_detect("CUNY Bernard M. Baruch College", "Hunter")

[1] FALSE

As with other functions in R, str_detect is vectorized, making it easy to use inside of other functions.

names <- c("City College-Miami", "CUNY City College", "CUNY Hunter College")
str_detect(names, "CUNY")

[1] FALSE  TRUE  TRUE

Use this function, in conjunction with a mutate function, to create a new column called is_cuny inside the IPEDS data. Make sure to assign your mutated data frame so that you can use it later.

Your code should look something like:

IPEDS <- IPEDS |>
    mutate(is_cuny = ...)

where the ... is some code using the string "CUNY", the existing column names in IPEDS and the str_detect function.

The public colleges and universities of the state of California are an interesting test case. Persuant to California Proposition 209 (1996), public institutions of higher education in the state were specifically not allowed to implement any sort of Affirmative Action program, so the ending of Affirmative Action would theoretically have no impact on their admissions practices. We will investigate this possibility below. For now, we need to create an additional variable which identifies the California public institutions in our data set.

Unlike the CUNY system, which combines community colleges, senior (four-year) colleges, and specialized graduate institutions into a single system, California has three separate public systems:

the University of California System
the California State System and
the California Community Colleges system.

Because of this, we will need to do a bit more work to create our is_calpublic variable. Because the Community Colleges are open-enrollment, we can ignore them from our analysis and focus exclusively on the UC system and the Cal State system.

Task 3: Identify UC and Cal State Schools

Add a new variable is_calpublic to the IPEDS data that is TRUE for institutions that are part of the University of California System or the California State system.

To do this, you will need to use str_detect (see the previous task) and one of the Boolean logical operators (!, &, |).

Once you have created your variable, use a combination of dplyr functions to get a list of the institutions identified by your new variable. You may wish to compare this against the Wikipedia articles linked above to ensure your results are accurate.

Note: If you know what a regular expression is, you know that you can use only a single str_detect call here and avoid the use of a Boolean operator. Do not do this - you must use some Boolean logic to get full credit for this task. If you do not know what a regular expression is, you can disregard this note.

Initial Data Exploration

Before moving to our final analysis, we will do a bit of Exploratory Data Analysis (EDA). EDA serves many purposes in data science– quality control, hypothesis generation, outlier identification, etc.–but perhaps the most important is simply knowing what information can be found in a novel data set. Now that our data is imported and cleaned, it’s almost time to start our EDA. Before we do EDA, however, we should pause briefly to consider how we want to display data.

Displaying Data in Tables

While we could continue investigating our data using R’s basic print-outs, this is a good time to introduce the gt package which can be used to create complex tables natively in R.

What follows here is a very brief introduction to the gt package. You do not need to copy this into your submission - it is provided only as background. You will use the gt package to format your answers to the next few tasks.

For this introduction, I am going to use the penguins data, but the same functions can be applied to the IPEDS data that you are analyzing.

Let’s first look at what a “basic” or “raw” display of a data frame gives us:

penguins

      species    island bill_len bill_dep flipper_len body_mass    sex year
1      Adelie Torgersen     39.1     18.7         181      3750   male 2007
2      Adelie Torgersen     39.5     17.4         186      3800 female 2007
3      Adelie Torgersen     40.3     18.0         195      3250 female 2007
4      Adelie Torgersen       NA       NA          NA        NA   <NA> 2007
5      Adelie Torgersen     36.7     19.3         193      3450 female 2007
6      Adelie Torgersen     39.3     20.6         190      3650   male 2007
7      Adelie Torgersen     38.9     17.8         181      3625 female 2007
8      Adelie Torgersen     39.2     19.6         195      4675   male 2007
9      Adelie Torgersen     34.1     18.1         193      3475   <NA> 2007
10     Adelie Torgersen     42.0     20.2         190      4250   <NA> 2007
11     Adelie Torgersen     37.8     17.1         186      3300   <NA> 2007
12     Adelie Torgersen     37.8     17.3         180      3700   <NA> 2007
13     Adelie Torgersen     41.1     17.6         182      3200 female 2007
14     Adelie Torgersen     38.6     21.2         191      3800   male 2007
15     Adelie Torgersen     34.6     21.1         198      4400   male 2007
16     Adelie Torgersen     36.6     17.8         185      3700 female 2007
17     Adelie Torgersen     38.7     19.0         195      3450 female 2007
18     Adelie Torgersen     42.5     20.7         197      4500   male 2007
19     Adelie Torgersen     34.4     18.4         184      3325 female 2007
20     Adelie Torgersen     46.0     21.5         194      4200   male 2007
21     Adelie    Biscoe     37.8     18.3         174      3400 female 2007
22     Adelie    Biscoe     37.7     18.7         180      3600   male 2007
23     Adelie    Biscoe     35.9     19.2         189      3800 female 2007
24     Adelie    Biscoe     38.2     18.1         185      3950   male 2007
25     Adelie    Biscoe     38.8     17.2         180      3800   male 2007
26     Adelie    Biscoe     35.3     18.9         187      3800 female 2007
27     Adelie    Biscoe     40.6     18.6         183      3550   male 2007
28     Adelie    Biscoe     40.5     17.9         187      3200 female 2007
29     Adelie    Biscoe     37.9     18.6         172      3150 female 2007
30     Adelie    Biscoe     40.5     18.9         180      3950   male 2007
31     Adelie     Dream     39.5     16.7         178      3250 female 2007
32     Adelie     Dream     37.2     18.1         178      3900   male 2007
33     Adelie     Dream     39.5     17.8         188      3300 female 2007
34     Adelie     Dream     40.9     18.9         184      3900   male 2007
35     Adelie     Dream     36.4     17.0         195      3325 female 2007
36     Adelie     Dream     39.2     21.1         196      4150   male 2007
37     Adelie     Dream     38.8     20.0         190      3950   male 2007
38     Adelie     Dream     42.2     18.5         180      3550 female 2007
39     Adelie     Dream     37.6     19.3         181      3300 female 2007
40     Adelie     Dream     39.8     19.1         184      4650   male 2007
41     Adelie     Dream     36.5     18.0         182      3150 female 2007
42     Adelie     Dream     40.8     18.4         195      3900   male 2007
43     Adelie     Dream     36.0     18.5         186      3100 female 2007
44     Adelie     Dream     44.1     19.7         196      4400   male 2007
45     Adelie     Dream     37.0     16.9         185      3000 female 2007
46     Adelie     Dream     39.6     18.8         190      4600   male 2007
47     Adelie     Dream     41.1     19.0         182      3425   male 2007
48     Adelie     Dream     37.5     18.9         179      2975   <NA> 2007
49     Adelie     Dream     36.0     17.9         190      3450 female 2007
50     Adelie     Dream     42.3     21.2         191      4150   male 2007
51     Adelie    Biscoe     39.6     17.7         186      3500 female 2008
52     Adelie    Biscoe     40.1     18.9         188      4300   male 2008
53     Adelie    Biscoe     35.0     17.9         190      3450 female 2008
54     Adelie    Biscoe     42.0     19.5         200      4050   male 2008
55     Adelie    Biscoe     34.5     18.1         187      2900 female 2008
56     Adelie    Biscoe     41.4     18.6         191      3700   male 2008
57     Adelie    Biscoe     39.0     17.5         186      3550 female 2008
58     Adelie    Biscoe     40.6     18.8         193      3800   male 2008
59     Adelie    Biscoe     36.5     16.6         181      2850 female 2008
60     Adelie    Biscoe     37.6     19.1         194      3750   male 2008
61     Adelie    Biscoe     35.7     16.9         185      3150 female 2008
62     Adelie    Biscoe     41.3     21.1         195      4400   male 2008
63     Adelie    Biscoe     37.6     17.0         185      3600 female 2008
64     Adelie    Biscoe     41.1     18.2         192      4050   male 2008
65     Adelie    Biscoe     36.4     17.1         184      2850 female 2008
66     Adelie    Biscoe     41.6     18.0         192      3950   male 2008
67     Adelie    Biscoe     35.5     16.2         195      3350 female 2008
68     Adelie    Biscoe     41.1     19.1         188      4100   male 2008
69     Adelie Torgersen     35.9     16.6         190      3050 female 2008
70     Adelie Torgersen     41.8     19.4         198      4450   male 2008
71     Adelie Torgersen     33.5     19.0         190      3600 female 2008
72     Adelie Torgersen     39.7     18.4         190      3900   male 2008
73     Adelie Torgersen     39.6     17.2         196      3550 female 2008
74     Adelie Torgersen     45.8     18.9         197      4150   male 2008
75     Adelie Torgersen     35.5     17.5         190      3700 female 2008
76     Adelie Torgersen     42.8     18.5         195      4250   male 2008
77     Adelie Torgersen     40.9     16.8         191      3700 female 2008
78     Adelie Torgersen     37.2     19.4         184      3900   male 2008
79     Adelie Torgersen     36.2     16.1         187      3550 female 2008
80     Adelie Torgersen     42.1     19.1         195      4000   male 2008
81     Adelie Torgersen     34.6     17.2         189      3200 female 2008
82     Adelie Torgersen     42.9     17.6         196      4700   male 2008
83     Adelie Torgersen     36.7     18.8         187      3800 female 2008
84     Adelie Torgersen     35.1     19.4         193      4200   male 2008
85     Adelie     Dream     37.3     17.8         191      3350 female 2008
86     Adelie     Dream     41.3     20.3         194      3550   male 2008
87     Adelie     Dream     36.3     19.5         190      3800   male 2008
88     Adelie     Dream     36.9     18.6         189      3500 female 2008
89     Adelie     Dream     38.3     19.2         189      3950   male 2008
90     Adelie     Dream     38.9     18.8         190      3600 female 2008
91     Adelie     Dream     35.7     18.0         202      3550 female 2008
92     Adelie     Dream     41.1     18.1         205      4300   male 2008
93     Adelie     Dream     34.0     17.1         185      3400 female 2008
94     Adelie     Dream     39.6     18.1         186      4450   male 2008
95     Adelie     Dream     36.2     17.3         187      3300 female 2008
96     Adelie     Dream     40.8     18.9         208      4300   male 2008
97     Adelie     Dream     38.1     18.6         190      3700 female 2008
98     Adelie     Dream     40.3     18.5         196      4350   male 2008
99     Adelie     Dream     33.1     16.1         178      2900 female 2008
100    Adelie     Dream     43.2     18.5         192      4100   male 2008
101    Adelie    Biscoe     35.0     17.9         192      3725 female 2009
102    Adelie    Biscoe     41.0     20.0         203      4725   male 2009
103    Adelie    Biscoe     37.7     16.0         183      3075 female 2009
104    Adelie    Biscoe     37.8     20.0         190      4250   male 2009
105    Adelie    Biscoe     37.9     18.6         193      2925 female 2009
106    Adelie    Biscoe     39.7     18.9         184      3550   male 2009
107    Adelie    Biscoe     38.6     17.2         199      3750 female 2009
108    Adelie    Biscoe     38.2     20.0         190      3900   male 2009
109    Adelie    Biscoe     38.1     17.0         181      3175 female 2009
110    Adelie    Biscoe     43.2     19.0         197      4775   male 2009
111    Adelie    Biscoe     38.1     16.5         198      3825 female 2009
112    Adelie    Biscoe     45.6     20.3         191      4600   male 2009
113    Adelie    Biscoe     39.7     17.7         193      3200 female 2009
114    Adelie    Biscoe     42.2     19.5         197      4275   male 2009
115    Adelie    Biscoe     39.6     20.7         191      3900 female 2009
116    Adelie    Biscoe     42.7     18.3         196      4075   male 2009
117    Adelie Torgersen     38.6     17.0         188      2900 female 2009
118    Adelie Torgersen     37.3     20.5         199      3775   male 2009
119    Adelie Torgersen     35.7     17.0         189      3350 female 2009
120    Adelie Torgersen     41.1     18.6         189      3325   male 2009
121    Adelie Torgersen     36.2     17.2         187      3150 female 2009
122    Adelie Torgersen     37.7     19.8         198      3500   male 2009
123    Adelie Torgersen     40.2     17.0         176      3450 female 2009
124    Adelie Torgersen     41.4     18.5         202      3875   male 2009
125    Adelie Torgersen     35.2     15.9         186      3050 female 2009
126    Adelie Torgersen     40.6     19.0         199      4000   male 2009
127    Adelie Torgersen     38.8     17.6         191      3275 female 2009
128    Adelie Torgersen     41.5     18.3         195      4300   male 2009
129    Adelie Torgersen     39.0     17.1         191      3050 female 2009
130    Adelie Torgersen     44.1     18.0         210      4000   male 2009
131    Adelie Torgersen     38.5     17.9         190      3325 female 2009
132    Adelie Torgersen     43.1     19.2         197      3500   male 2009
133    Adelie     Dream     36.8     18.5         193      3500 female 2009
134    Adelie     Dream     37.5     18.5         199      4475   male 2009
135    Adelie     Dream     38.1     17.6         187      3425 female 2009
136    Adelie     Dream     41.1     17.5         190      3900   male 2009
137    Adelie     Dream     35.6     17.5         191      3175 female 2009
138    Adelie     Dream     40.2     20.1         200      3975   male 2009
139    Adelie     Dream     37.0     16.5         185      3400 female 2009
140    Adelie     Dream     39.7     17.9         193      4250   male 2009
141    Adelie     Dream     40.2     17.1         193      3400 female 2009
142    Adelie     Dream     40.6     17.2         187      3475   male 2009
143    Adelie     Dream     32.1     15.5         188      3050 female 2009
144    Adelie     Dream     40.7     17.0         190      3725   male 2009
145    Adelie     Dream     37.3     16.8         192      3000 female 2009
146    Adelie     Dream     39.0     18.7         185      3650   male 2009
147    Adelie     Dream     39.2     18.6         190      4250   male 2009
148    Adelie     Dream     36.6     18.4         184      3475 female 2009
149    Adelie     Dream     36.0     17.8         195      3450 female 2009
150    Adelie     Dream     37.8     18.1         193      3750   male 2009
151    Adelie     Dream     36.0     17.1         187      3700 female 2009
152    Adelie     Dream     41.5     18.5         201      4000   male 2009
153    Gentoo    Biscoe     46.1     13.2         211      4500 female 2007
154    Gentoo    Biscoe     50.0     16.3         230      5700   male 2007
155    Gentoo    Biscoe     48.7     14.1         210      4450 female 2007
156    Gentoo    Biscoe     50.0     15.2         218      5700   male 2007
157    Gentoo    Biscoe     47.6     14.5         215      5400   male 2007
158    Gentoo    Biscoe     46.5     13.5         210      4550 female 2007
159    Gentoo    Biscoe     45.4     14.6         211      4800 female 2007
160    Gentoo    Biscoe     46.7     15.3         219      5200   male 2007
161    Gentoo    Biscoe     43.3     13.4         209      4400 female 2007
162    Gentoo    Biscoe     46.8     15.4         215      5150   male 2007
163    Gentoo    Biscoe     40.9     13.7         214      4650 female 2007
164    Gentoo    Biscoe     49.0     16.1         216      5550   male 2007
165    Gentoo    Biscoe     45.5     13.7         214      4650 female 2007
166    Gentoo    Biscoe     48.4     14.6         213      5850   male 2007
167    Gentoo    Biscoe     45.8     14.6         210      4200 female 2007
168    Gentoo    Biscoe     49.3     15.7         217      5850   male 2007
169    Gentoo    Biscoe     42.0     13.5         210      4150 female 2007
170    Gentoo    Biscoe     49.2     15.2         221      6300   male 2007
171    Gentoo    Biscoe     46.2     14.5         209      4800 female 2007
172    Gentoo    Biscoe     48.7     15.1         222      5350   male 2007
173    Gentoo    Biscoe     50.2     14.3         218      5700   male 2007
174    Gentoo    Biscoe     45.1     14.5         215      5000 female 2007
175    Gentoo    Biscoe     46.5     14.5         213      4400 female 2007
176    Gentoo    Biscoe     46.3     15.8         215      5050   male 2007
177    Gentoo    Biscoe     42.9     13.1         215      5000 female 2007
178    Gentoo    Biscoe     46.1     15.1         215      5100   male 2007
179    Gentoo    Biscoe     44.5     14.3         216      4100   <NA> 2007
180    Gentoo    Biscoe     47.8     15.0         215      5650   male 2007
181    Gentoo    Biscoe     48.2     14.3         210      4600 female 2007
182    Gentoo    Biscoe     50.0     15.3         220      5550   male 2007
183    Gentoo    Biscoe     47.3     15.3         222      5250   male 2007
184    Gentoo    Biscoe     42.8     14.2         209      4700 female 2007
185    Gentoo    Biscoe     45.1     14.5         207      5050 female 2007
186    Gentoo    Biscoe     59.6     17.0         230      6050   male 2007
187    Gentoo    Biscoe     49.1     14.8         220      5150 female 2008
188    Gentoo    Biscoe     48.4     16.3         220      5400   male 2008
189    Gentoo    Biscoe     42.6     13.7         213      4950 female 2008
190    Gentoo    Biscoe     44.4     17.3         219      5250   male 2008
191    Gentoo    Biscoe     44.0     13.6         208      4350 female 2008
192    Gentoo    Biscoe     48.7     15.7         208      5350   male 2008
193    Gentoo    Biscoe     42.7     13.7         208      3950 female 2008
194    Gentoo    Biscoe     49.6     16.0         225      5700   male 2008
195    Gentoo    Biscoe     45.3     13.7         210      4300 female 2008
196    Gentoo    Biscoe     49.6     15.0         216      4750   male 2008
197    Gentoo    Biscoe     50.5     15.9         222      5550   male 2008
198    Gentoo    Biscoe     43.6     13.9         217      4900 female 2008
199    Gentoo    Biscoe     45.5     13.9         210      4200 female 2008
200    Gentoo    Biscoe     50.5     15.9         225      5400   male 2008
201    Gentoo    Biscoe     44.9     13.3         213      5100 female 2008
202    Gentoo    Biscoe     45.2     15.8         215      5300   male 2008
203    Gentoo    Biscoe     46.6     14.2         210      4850 female 2008
204    Gentoo    Biscoe     48.5     14.1         220      5300   male 2008
205    Gentoo    Biscoe     45.1     14.4         210      4400 female 2008
206    Gentoo    Biscoe     50.1     15.0         225      5000   male 2008
207    Gentoo    Biscoe     46.5     14.4         217      4900 female 2008
208    Gentoo    Biscoe     45.0     15.4         220      5050   male 2008
209    Gentoo    Biscoe     43.8     13.9         208      4300 female 2008
210    Gentoo    Biscoe     45.5     15.0         220      5000   male 2008
211    Gentoo    Biscoe     43.2     14.5         208      4450 female 2008
212    Gentoo    Biscoe     50.4     15.3         224      5550   male 2008
213    Gentoo    Biscoe     45.3     13.8         208      4200 female 2008
214    Gentoo    Biscoe     46.2     14.9         221      5300   male 2008
215    Gentoo    Biscoe     45.7     13.9         214      4400 female 2008
216    Gentoo    Biscoe     54.3     15.7         231      5650   male 2008
217    Gentoo    Biscoe     45.8     14.2         219      4700 female 2008
218    Gentoo    Biscoe     49.8     16.8         230      5700   male 2008
219    Gentoo    Biscoe     46.2     14.4         214      4650   <NA> 2008
220    Gentoo    Biscoe     49.5     16.2         229      5800   male 2008
221    Gentoo    Biscoe     43.5     14.2         220      4700 female 2008
222    Gentoo    Biscoe     50.7     15.0         223      5550   male 2008
223    Gentoo    Biscoe     47.7     15.0         216      4750 female 2008
224    Gentoo    Biscoe     46.4     15.6         221      5000   male 2008
225    Gentoo    Biscoe     48.2     15.6         221      5100   male 2008
226    Gentoo    Biscoe     46.5     14.8         217      5200 female 2008
227    Gentoo    Biscoe     46.4     15.0         216      4700 female 2008
228    Gentoo    Biscoe     48.6     16.0         230      5800   male 2008
229    Gentoo    Biscoe     47.5     14.2         209      4600 female 2008
230    Gentoo    Biscoe     51.1     16.3         220      6000   male 2008
231    Gentoo    Biscoe     45.2     13.8         215      4750 female 2008
232    Gentoo    Biscoe     45.2     16.4         223      5950   male 2008
233    Gentoo    Biscoe     49.1     14.5         212      4625 female 2009
234    Gentoo    Biscoe     52.5     15.6         221      5450   male 2009
235    Gentoo    Biscoe     47.4     14.6         212      4725 female 2009
236    Gentoo    Biscoe     50.0     15.9         224      5350   male 2009
237    Gentoo    Biscoe     44.9     13.8         212      4750 female 2009
238    Gentoo    Biscoe     50.8     17.3         228      5600   male 2009
239    Gentoo    Biscoe     43.4     14.4         218      4600 female 2009
240    Gentoo    Biscoe     51.3     14.2         218      5300   male 2009
241    Gentoo    Biscoe     47.5     14.0         212      4875 female 2009
242    Gentoo    Biscoe     52.1     17.0         230      5550   male 2009
243    Gentoo    Biscoe     47.5     15.0         218      4950 female 2009
244    Gentoo    Biscoe     52.2     17.1         228      5400   male 2009
245    Gentoo    Biscoe     45.5     14.5         212      4750 female 2009
246    Gentoo    Biscoe     49.5     16.1         224      5650   male 2009
247    Gentoo    Biscoe     44.5     14.7         214      4850 female 2009
248    Gentoo    Biscoe     50.8     15.7         226      5200   male 2009
249    Gentoo    Biscoe     49.4     15.8         216      4925   male 2009
250    Gentoo    Biscoe     46.9     14.6         222      4875 female 2009
251    Gentoo    Biscoe     48.4     14.4         203      4625 female 2009
252    Gentoo    Biscoe     51.1     16.5         225      5250   male 2009
253    Gentoo    Biscoe     48.5     15.0         219      4850 female 2009
254    Gentoo    Biscoe     55.9     17.0         228      5600   male 2009
255    Gentoo    Biscoe     47.2     15.5         215      4975 female 2009
256    Gentoo    Biscoe     49.1     15.0         228      5500   male 2009
257    Gentoo    Biscoe     47.3     13.8         216      4725   <NA> 2009
258    Gentoo    Biscoe     46.8     16.1         215      5500   male 2009
259    Gentoo    Biscoe     41.7     14.7         210      4700 female 2009
260    Gentoo    Biscoe     53.4     15.8         219      5500   male 2009
261    Gentoo    Biscoe     43.3     14.0         208      4575 female 2009
262    Gentoo    Biscoe     48.1     15.1         209      5500   male 2009
263    Gentoo    Biscoe     50.5     15.2         216      5000 female 2009
264    Gentoo    Biscoe     49.8     15.9         229      5950   male 2009
265    Gentoo    Biscoe     43.5     15.2         213      4650 female 2009
266    Gentoo    Biscoe     51.5     16.3         230      5500   male 2009
267    Gentoo    Biscoe     46.2     14.1         217      4375 female 2009
268    Gentoo    Biscoe     55.1     16.0         230      5850   male 2009
269    Gentoo    Biscoe     44.5     15.7         217      4875   <NA> 2009
270    Gentoo    Biscoe     48.8     16.2         222      6000   male 2009
271    Gentoo    Biscoe     47.2     13.7         214      4925 female 2009
272    Gentoo    Biscoe       NA       NA          NA        NA   <NA> 2009
273    Gentoo    Biscoe     46.8     14.3         215      4850 female 2009
274    Gentoo    Biscoe     50.4     15.7         222      5750   male 2009
275    Gentoo    Biscoe     45.2     14.8         212      5200 female 2009
276    Gentoo    Biscoe     49.9     16.1         213      5400   male 2009
277 Chinstrap     Dream     46.5     17.9         192      3500 female 2007
278 Chinstrap     Dream     50.0     19.5         196      3900   male 2007
279 Chinstrap     Dream     51.3     19.2         193      3650   male 2007
280 Chinstrap     Dream     45.4     18.7         188      3525 female 2007
281 Chinstrap     Dream     52.7     19.8         197      3725   male 2007
282 Chinstrap     Dream     45.2     17.8         198      3950 female 2007
283 Chinstrap     Dream     46.1     18.2         178      3250 female 2007
284 Chinstrap     Dream     51.3     18.2         197      3750   male 2007
285 Chinstrap     Dream     46.0     18.9         195      4150 female 2007
286 Chinstrap     Dream     51.3     19.9         198      3700   male 2007
287 Chinstrap     Dream     46.6     17.8         193      3800 female 2007
288 Chinstrap     Dream     51.7     20.3         194      3775   male 2007
289 Chinstrap     Dream     47.0     17.3         185      3700 female 2007
290 Chinstrap     Dream     52.0     18.1         201      4050   male 2007
291 Chinstrap     Dream     45.9     17.1         190      3575 female 2007
292 Chinstrap     Dream     50.5     19.6         201      4050   male 2007
293 Chinstrap     Dream     50.3     20.0         197      3300   male 2007
294 Chinstrap     Dream     58.0     17.8         181      3700 female 2007
295 Chinstrap     Dream     46.4     18.6         190      3450 female 2007
296 Chinstrap     Dream     49.2     18.2         195      4400   male 2007
297 Chinstrap     Dream     42.4     17.3         181      3600 female 2007
298 Chinstrap     Dream     48.5     17.5         191      3400   male 2007
299 Chinstrap     Dream     43.2     16.6         187      2900 female 2007
300 Chinstrap     Dream     50.6     19.4         193      3800   male 2007
301 Chinstrap     Dream     46.7     17.9         195      3300 female 2007
302 Chinstrap     Dream     52.0     19.0         197      4150   male 2007
303 Chinstrap     Dream     50.5     18.4         200      3400 female 2008
304 Chinstrap     Dream     49.5     19.0         200      3800   male 2008
305 Chinstrap     Dream     46.4     17.8         191      3700 female 2008
306 Chinstrap     Dream     52.8     20.0         205      4550   male 2008
307 Chinstrap     Dream     40.9     16.6         187      3200 female 2008
308 Chinstrap     Dream     54.2     20.8         201      4300   male 2008
309 Chinstrap     Dream     42.5     16.7         187      3350 female 2008
310 Chinstrap     Dream     51.0     18.8         203      4100   male 2008
311 Chinstrap     Dream     49.7     18.6         195      3600   male 2008
312 Chinstrap     Dream     47.5     16.8         199      3900 female 2008
313 Chinstrap     Dream     47.6     18.3         195      3850 female 2008
314 Chinstrap     Dream     52.0     20.7         210      4800   male 2008
315 Chinstrap     Dream     46.9     16.6         192      2700 female 2008
316 Chinstrap     Dream     53.5     19.9         205      4500   male 2008
317 Chinstrap     Dream     49.0     19.5         210      3950   male 2008
318 Chinstrap     Dream     46.2     17.5         187      3650 female 2008
319 Chinstrap     Dream     50.9     19.1         196      3550   male 2008
320 Chinstrap     Dream     45.5     17.0         196      3500 female 2008
321 Chinstrap     Dream     50.9     17.9         196      3675 female 2009
322 Chinstrap     Dream     50.8     18.5         201      4450   male 2009
323 Chinstrap     Dream     50.1     17.9         190      3400 female 2009
324 Chinstrap     Dream     49.0     19.6         212      4300   male 2009
325 Chinstrap     Dream     51.5     18.7         187      3250   male 2009
326 Chinstrap     Dream     49.8     17.3         198      3675 female 2009
327 Chinstrap     Dream     48.1     16.4         199      3325 female 2009
328 Chinstrap     Dream     51.4     19.0         201      3950   male 2009
329 Chinstrap     Dream     45.7     17.3         193      3600 female 2009
330 Chinstrap     Dream     50.7     19.7         203      4050   male 2009
331 Chinstrap     Dream     42.5     17.3         187      3350 female 2009
332 Chinstrap     Dream     52.2     18.8         197      3450   male 2009
333 Chinstrap     Dream     45.2     16.6         191      3250 female 2009
334 Chinstrap     Dream     49.3     19.9         203      4050   male 2009
335 Chinstrap     Dream     50.2     18.8         202      3800   male 2009
336 Chinstrap     Dream     45.6     19.4         194      3525 female 2009
337 Chinstrap     Dream     51.9     19.5         206      3950   male 2009
338 Chinstrap     Dream     46.8     16.5         189      3650 female 2009
339 Chinstrap     Dream     45.7     17.0         195      3650 female 2009
340 Chinstrap     Dream     55.8     19.8         207      4000   male 2009
341 Chinstrap     Dream     43.5     18.1         202      3400 female 2009
342 Chinstrap     Dream     49.6     18.2         193      3775   male 2009
343 Chinstrap     Dream     50.8     19.0         210      4100   male 2009
344 Chinstrap     Dream     50.2     18.7         198      3775 female 2009

This has several problems:

We are showing way too much data. A reader will not be able to easily find meaningful trends or patterns in a big data set like this.

As a general rule, you should rarely have more than 10-15 rows in a table; even then, you will still want to guide your reader to the point of the table.
The column names are rather ugly. Some, like species are not too bad, though it would still be better if they were capitalized. Others, like bill_len are pretty terrible: bill_len is not an English word, the underscore exists only to separate two words “in code” (recall R’s restrictions on variable names), and the unit isn’t clear. In this case, a column name like Bill Length (mm) would be far preferable.
The row numbers are essentially pointless and just take up space, adding no value. Any content that is not adding value is simply distracting the reader from the content that has value.
The “point” of the table is unclear. What is a reader supposed to get from this? As a data analyst - doing work on behalf of a reader who may not be a data analyst - you have a responsibility to clearly convey the “story” of your findings and this does not do so.

I may want to use this data to show that Gentoo penguins are, on average, heavier than the other two species in this data set, but this is far from clear.
It’s just a bit ugly.

Good table design requires us to take on the mindset of the reader. Tools like gt can help pretty things up, but you still have to think about what you want to display. Well-formatted garbage is still garbage.

To start improving this table, let’s do the calculations for our reader instead of expecting them to do it all manually:

library(tidyverse)
penguins |> 
    group_by(species) |> 
    summarize(n_species = n(), 
              avg_body_mass = mean(body_mass, na.rm=TRUE)) |>
    arrange(desc(avg_body_mass))

# A tibble: 3 × 3
  species   n_species avg_body_mass
  <fct>         <int>         <dbl>
1 Gentoo          124         5076.
2 Chinstrap        68         3733.
3 Adelie          152         3701.

We’re definitely not done - but here the “point” of the table is clear, at least if we also put some text surround it.

To improve this further, we can also pass this smaller summary data frame to the gt function from the package of the same name:

library(gt)
penguins |> 
    group_by(species) |> 
    summarize(n_species = n(), 
              avg_body_mass = mean(body_mass, na.rm=TRUE)) |>
    arrange(desc(avg_body_mass)) |>
    gt()

species	n_species	avg_body_mass
Gentoo	124	5076.016
Chinstrap	68	3733.088
Adelie	152	3700.662

Note here that gt recognizes we are rendering an HTML page and produces a “real” HTML table here. If you were to copy and paste the table above into “table” software, e.g. Google Sheets or Microsoft Excel, it would be properly and automatically handled. For us, the table is the end-point, but it’s a nice courtesy to your reader who may want to use your results in their own presentations.

The gt package provides many functions for tweaking and improving the appearance of a table. You will almost always want to, at a minimum, use these for:

Ordering and (re-)naming columns
Adding titles and footers
Formatting values

Let’s to through these one at a time. Firstly, we want to rename and reorder the columns. This can be done in pure dplyr with the select and rename columns, but we’ll show the gt way here:

library(gt)
penguins |> 
    group_by(species) |> 
    summarize(n_species = n(), 
              avg_body_mass = mean(body_mass, na.rm=TRUE)) |>
    arrange(desc(avg_body_mass)) |>
    gt() |>
    cols_move_to_end(n_species) |>
    cols_label(species="Species", 
               avg_body_mass = "Avg. Body Mass (g)", 
               n_species = "Number of Penguins in Sample")

Species	Avg. Body Mass (g)	Number of Penguins in Sample
Gentoo	5076.016	124
Chinstrap	3733.088	68
Adelie	3700.662	152

Here, we used the cols_move_to_end function to move the n_species column to the end (no surprise!). In other contexts, we might want to use the cols_move_to_start function to move a column to the leftmost side of a table or cols_move to put a column in the middle of the table.

The cols_label function essentially serves as a renaming operation: the left side of each parentheses is the old column name in the table and the right side gives the new name. (Note, a bit confusingly, that this is the reverse of dplyr::rename.) While we can just pass a basic string here, we can also use the md function to pass Markdown which lets us do some custom formatting:

penguins |> 
    group_by(species) |> 
    summarize(n_species = n(), 
              avg_body_mass = mean(body_mass, na.rm=TRUE)) |>
    arrange(desc(avg_body_mass)) |>
    gt() |>
    cols_move_to_end(n_species) |>
    cols_label(species=md("**Species**"), 
               avg_body_mass = md("Avg. Body Mass (*g*)"), 
               n_species = md("*Number of Penguins in Sample*"))

Species	Avg. Body Mass (g)	Number of Penguins in Sample
Gentoo	5076.016	124
Chinstrap	3733.088	68
Adelie	3700.662	152

Here, we could use boldface and italics for certain text using standard Markdown syntax.

Next, we can add a table title and subtitle to make the content and point of this table clear to our reader:

penguins |> 
    group_by(species) |> 
    summarize(n_species = n(), 
              avg_body_mass = mean(body_mass, na.rm=TRUE)) |>
    arrange(desc(avg_body_mass)) |>
    gt() |>
    cols_move_to_end(n_species) |>
    cols_label(species=md("**Species**"), 
               avg_body_mass = md("Avg. Body Mass (*g*)"), 
               n_species = md("*Number of Penguins in Sample*")) |>
    tab_header(title="Average Body Mass of Three Penguin Species", 
               subtitle="Gentoo penguins are the largest in the study")

Average Body Mass of Three Penguin Species
Gentoo penguins are the largest in the study
Species	Avg. Body Mass (g)	Number of Penguins in Sample
Gentoo	5076.016	124
Chinstrap	3733.088	68
Adelie	3700.662	152

While we can and should describe our analysis in more detail in the main text, I like this pattern of having the super-simple one-liner present directly in the table. This also makes it convenient to clip the table (or a screenshot thereof) for use in other documents and presentations.

Next, we should always note the source of the data used to get our results. In this case, the original penguins data comes from this article so we can cite that in our work Note the use of Markdown (md()) to let us include a link to the original source within our table:

penguins |> 
    group_by(species) |> 
    summarize(n_species = n(), 
              avg_body_mass = mean(body_mass, na.rm=TRUE)) |>
    arrange(desc(avg_body_mass)) |>
    gt() |>
    cols_move_to_end(n_species) |>
    cols_label(species=md("**Species**"), 
               avg_body_mass = md("Avg. Body Mass (*g*)"), 
               n_species = md("*Number of Penguins in Sample*")) |>
    tab_header(title="Average Body Mass of Three Penguin Species", 
               subtitle="Gentoo penguins are the largest in the study") |>
    tab_source_note(md("Data originally published in by K.B. Gorman, T.D. Williams, 
                       and W. R. Fraser in 'Ecological Sexual Dimorphism and 
                       Environmental Variability within a Community of Antarctic 
                       Penguins (Genus *Pyogscelis*). *PLoS One* 9(3): e90081.
                       <https://doi.org/10.1371/journal.pone.0090081>. Later 
                       popularized via the `R` package
                       [`palmerpenguins`](https://allisonhorst.github.io/palmerpenguins/)"))

Average Body Mass of Three Penguin Species
Gentoo penguins are the largest in the study
Species	Avg. Body Mass (g)	Number of Penguins in Sample
Gentoo	5076.016	124
Chinstrap	3733.088	68
Adelie	3700.662	152
Data originally published in by K.B. Gorman, T.D. Williams, and W. R. Fraser in ’Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pyogscelis). PLoS One 9(3): e90081. https://doi.org/10.1371/journal.pone.0090081. Later popularized via the `R` package `palmerpenguins`

A small, but lovely, quality of life feature here is the fact that md will automatically re-align the text to the fit the dimensions of the rendered table. This lets us put new lines within our citation text so that our code doesn’t exceed the 80 characters-per-line guideline.

Finally, we want to make sure the number values are formatted appropriately. For these numbers, plain formatting really isn’t much of a problem, but for very large or small numbers, we might want to use scientific notation; for dates, we might want to control the formatting; etc.. This is done with the fmt_* family of functions.

Each fmt_ function takes one or more column names and applies a formatting transformation to that column. The specifics of the formatting can be controlled with additional optional arguments. For example, if we want to round the average weight to the nearest gram, we would use the fmt_number function with the argument deicmals=0:

penguins |> 
    group_by(species) |> 
    summarize(n_species = n(), 
              avg_body_mass = mean(body_mass, na.rm=TRUE)) |>
    arrange(desc(avg_body_mass)) |>
    gt() |>
    cols_move_to_end(n_species) |>
    cols_label(species=md("**Species**"), 
               avg_body_mass = md("Avg. Body Mass (*g*)"), 
               n_species = md("*Number of Penguins in Sample*")) |>
    tab_header(title="Average Body Mass of Three Penguin Species", 
               subtitle="Gentoo penguins are the largest in the study") |>
    tab_source_note(md("Data originally published in by K.B. Gorman, T.D. Williams, 
                       and W. R. Fraser in 'Ecological Sexual Dimorphism and 
                       Environmental Variability within a Community of Antarctic 
                       Penguins (Genus *Pyogscelis*). *PLoS One* 9(3): e90081.
                       <https://doi.org/10.1371/journal.pone.0090081>. Later 
                       popularized via the `R` package
                       [`palmerpenguins`](https://allisonhorst.github.io/palmerpenguins/)")) |>
    fmt_number(avg_body_mass, decimals=0)

Average Body Mass of Three Penguin Species
Gentoo penguins are the largest in the study
Species	Avg. Body Mass (g)	Number of Penguins in Sample
Gentoo	5,076	124
Chinstrap	3,733	68
Adelie	3,701	152
Data originally published in by K.B. Gorman, T.D. Williams, and W. R. Fraser in ’Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pyogscelis). PLoS One 9(3): e90081. https://doi.org/10.1371/journal.pone.0090081. Later popularized via the `R` package `palmerpenguins`

In this case, it may be more appropriate to display the body mass in kilograms and we can do so semi-automatically with the format_number_si formatter:

penguins |> 
    group_by(species) |> 
    summarize(n_species = n(), 
              avg_body_mass = mean(body_mass, na.rm=TRUE)) |>
    arrange(desc(avg_body_mass)) |>
    gt() |>
    cols_move_to_end(n_species) |>
    cols_label(species=md("**Species**"), 
               avg_body_mass = md("Avg. Body Mass "), 
               n_species = md("*Number of Penguins in Sample*")) |>
    tab_header(title="Average Body Mass of Three Penguin Species", 
               subtitle="Gentoo penguins are the largest in the study") |>
    tab_source_note(md("Data originally published in by K.B. Gorman, T.D. Williams, 
                       and W. R. Fraser in 'Ecological Sexual Dimorphism and 
                       Environmental Variability within a Community of Antarctic 
                       Penguins (Genus *Pyogscelis*). *PLoS One* 9(3): e90081.
                       <https://doi.org/10.1371/journal.pone.0090081>. Later 
                       popularized via the `R` package
                       [`palmerpenguins`](https://allisonhorst.github.io/palmerpenguins/)")) |>
    fmt_number_si(avg_body_mass, 
                  decimals=2, 
                  unit = "g")

Average Body Mass of Three Penguin Species
Gentoo penguins are the largest in the study
Species	Avg. Body Mass	Number of Penguins in Sample
Gentoo	5.08 kg	124
Chinstrap	3.73 kg	68
Adelie	3.70 kg	152
Data originally published in by K.B. Gorman, T.D. Williams, and W. R. Fraser in ’Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pyogscelis). PLoS One 9(3): e90081. https://doi.org/10.1371/journal.pone.0090081. Later popularized via the `R` package `palmerpenguins`

Note that, because fmt_number_si automatically includes the unit and transforms it to the most natural scale (here kg), we can remove the unit from the column name. gt has many more advanced options that can be used for further customization; refer to the package documentation for more details or ask on the course discussion board.

Exploratory Analysis

When faced with a new data set, it is tempting to look only at the first few rows to get a sense of the data: R does this by default. In practice, I recommend viewing a random selection of rows instead. This won’t guarantee you find any issues, but it increases the probability of finding issues in older parts of a data set. The slice_sample function can be used for this.

While EDA can be an extensive activity on its own, at an absolute minimum, I recommend you always do at least two basic checks:

Ensure that you know what R thinks your data is. You might see a value like "2025-12-01" and think that R is reading it as a date value, but R might instead by interpreting it as a string value.⁵ You have several options for this in R, but when working with a data frame, I’d recommend the glimpse function: e.g.
```
glimpse(penguins)
```
```
Rows: 344
Columns: 8
$ species     <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island      <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
$ bill_len    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, …
$ bill_dep    <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, …
$ flipper_len <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180,…
$ body_mass   <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
$ sex         <fct> male, female, female, NA, female, male, female, male, NA, …
$ year        <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
```
After getting the basic dimensions of this data frame, glimpse() will print a line summary of each column giving its name, type, and the first few values in the table.

In this case, since I prepared the data for you, all columns are of the correct type (mostly numeric, with a few character columns for institution name and state name, and the two Boolean columns you created earlier), but this is a quick and easy check. If there are issues with your data types, it’s better to catch them early than to have silent and hard to identify errors further down the line. (“Fail fast” is great advice in any programming exercise.)

Take a quick look at some basic (univariate) summary statistics for each column. There are several functions for this in base R: e.g.,

summary(penguins)

      species          island       bill_len        bill_dep    
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
  flipper_len      body_mass        sex           year     
 Min.   :172.0   Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0   1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0   Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9   Mean   :4202                Mean   :2008  
 3rd Qu.:213.0   3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0   Max.   :6300                Max.   :2009  
 NA's   :2       NA's   :2

But I actually like the skim function from the skimr package:

library(skimr)
skim(penguins)

Data summary
Name	penguins
Number of rows	344
Number of columns	8
_______________________
Column type frequency:
factor	3
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
species	0	1.00	FALSE	3	Ade: 152, Gen: 124, Chi: 68
island	0	1.00	FALSE	3	Bis: 168, Dre: 124, Tor: 52
sex	11	0.97	FALSE	2	mal: 168, fem: 165

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bill_len	2	0.99	43.92	5.46	32.1	39.23	44.45	48.5	59.6	▃▇▇▆▁
bill_dep	2	0.99	17.15	1.97	13.1	15.60	17.30	18.7	21.5	▅▅▇▇▂
flipper_len	2	0.99	200.92	14.06	172.0	190.00	197.00	213.0	231.0	▂▇▃▅▂
body_mass	2	0.99	4201.75	801.95	2700.0	3550.00	4050.00	4750.0	6300.0	▃▇▆▃▂
year	0	1.00	2008.03	0.82	2007.0	2007.00	2008.00	2009.0	2009.0	▇▁▇▁▇

You’ll see that this gives some nice overview about data structure and types, identifies grouping structure (if any is present), and gives summaries appropriate to the type of each column (here factors, i.e., categorical variables and numeric). I like that this summary gives means, standard deviations, minima (p0), maxima (p100), medians (p50), and a cute little histogram of each variable.

Note that this type of summary only reveals univariate structure: if there are interesting multi-dimensional outliers or weird correlation patterns, they won’t appear here.

Task 4: Initial EDA Pass

Perform the two checks described above on the IPEDS data you loaded and processed above. Include the code to perform the checks, the output of the checks, and describe what you see.

In general, you shouldn’t include this in analysis reports, but we’re making an exception here to practice a good habit. I won’t ask you to include this analysis in future mini-projects or your course project reports, but these checks are easy and very valuable, so I recommend you make them part of all analyses.

We are now ready to begin some EDA. Analysts organize their EDA in a variety of ways, but one of my favorites is to think of a variety of interesting questions and to answer them. When the answers don’t match my intuition, I know I’ve found somewhere I want to dig deeper. For the first few mini-projects, I will provide these exploratory questions. Later in the course, particularly after we have discussed the role of plotting and graphics in EDA, you will have opportunities to organize your EDA in other fashions.

Task 5: Exploratory Questions - Inline Values

Using dplyr tools, answer the following questions:

How many distinct institutions appear in this data set?
How many graduate students were enrolled at Baruch in 2024?

Hint: Use the str_detect function discussed above inside a filter command to identify Baruch.
How many total students were enrolled at Baruch in 2024?

Hint: Make sure to avoid double-counting first-year undergraduates.
Which institution had the highest number of enrolled female students in 2019?

Report at least both the institution and the total number of female students.
Which institution with over 1000 total students admitted the highest proportion of Native Hawaiian or Pacific Islander (nhpi) first-year undergraduates in 2024?

Report at least both the institution and the fraction of relevant students.

As you go through these questions, you may find it useful to create new variables in the IPEDS data to avoid repeated lengthy calculations, e.g., a new variable for the total number of enrolled students.

Each of these questions can be answered with one or two scalar values. Use Quarto’s inline code functionality to place the values in a sentence; that is, you should answer in complete sentences, written as normal text with inline code for computed values.

Where appropriate, the scales package can be used to format numbers attractively: e.g.,

library(scales)
comma(12345.6789)

[1] "12,346"

comma(12345.6789, accuracy=0.01)

[1] "12,345.68"

dollar(12345.6789)

[1] "$12,345.68"

percent(0.1234)

[1] "12%"

percent(0.1234, accuracy=0.01)

[1] "12.34%"

The next set of questions have slightly more complex answers and should be answered with a table, formatted using the techniques described above. Longer term, you will prefer plots to tables as they are a bit easier to interpret (humans being visual creatures) but these are the types of questions you might use to create plots as well.

Task 6: Exploratory Questions - Table Answers

Using dplyr tools, answer the following questions:

Which 5 states had the highest number of graduate students across all institutions located in that state?
In 2024, how many first year undergraduate students were enrolled at CUNY colleges and which colleges did they attend? Report both absolute enrollment numbers and percent of total first-year undergraduates?

Hint: The fmt_percent and fmt_number functions will be useful here.
How has Baruch’s total undergraduate enrollment changed over the study period? Report both enrollment numbers and percent change year-over-year.

Hint: The lag function will be helpful here.
At what 5 institutions did the fraction of white students decrease the most over the period from 2010 to 2020?

Hint: You may want to pre-filter by total enrollment to make sure you are not only reporting very small institutions in your analysis, as these are more prone to large fluctuations.
In which 3 states did the fraction of female undergraduates increase the most over the period from 2010 to 2024?

Each of these questions can be answered with a table of just a few rows. Use the gt package, as introduced above, to present your results in an attractive ‘publication-quality’ format, not just a “raw” R output.

Final Deliverable: College Newpaper Op-Ed

At this point, you have acquired your data, cleaned and prepared it, and performed your EDA. Now, you are ready to get to work on the final deliverable of your analysis. Everything that comes before this is important, but typically less visible to your final customer.

Task 7: Op-Ed

Write a brief (no more than 750 words) Op-Ed from the perspective of a college president to be published in the campus newspaper. You can be the president of any institution you want, so long as it has a meaningful undergraduate population, (i.e., not stand-alone graduate schools like the CUNY School of Law), so have some fun with choosing your persona.

Your Op-Ed should include (at a minimum) the following information:

The definition of diversity you are using
Some brief statistics about the size and make-up of the student body at your institution
A year-over-year comparison of the entering first-year first-time undergraduate class.
A year-over-year comparison of the demographics of the entire student body
A discussion of long-term diversity trends at your institution
A comparison of changes at your institution to one-or-more of the California public institutions.

(Think of these as a ‘critical value’ in a statistical test for a change: if your change is smaller than a California change, when the California schools shouldn’t have had to change their policy at all, your change is likely just noise.)

You may include optionally additional tables or even visualizations if you want, but these do not replace the requirement to write an Op-Ed. Your Op-Ed should stand “alone” and not be mixed in with your code. Place the code necessary to perform the op-ed calculations in a separate section and use inline code chunks to include results of your analysis in the text of your op-ed. Op-Eds that hard-code calculated values will be penalized.

For purposes of this exercise, you can measure “diversity” as the fraction of non-White/non-Asian-American students in your undergraduate class. If you want to use a more sophisticated metric, see the first extra credit opportunity below. It is up to you whether you want to consider gender diversity or not in your analysis: if you don’t want to consider gender diversity, simply sum corresponding m and f columns within each racial group.

AI Usage Statement

At the end of your report, you must include a description of the extent to which you used Generative AI tools to complete the mini-project. This should be a one paragraph section clearly deliniated using a collapsable Quarto “Callout Note”. Failure to include an AI disclosure will result in an automatic 25% penalty.

E.g.,

AI Usage Statement

No Generative AI tools were used to complete this mini-project.

AI Usage Statement

GitHub Co-Pilot Pro was used via RStudio integration while completing this project. No other generative AI tools were used.

AI Usage Statement

ChatGPT was used to help write the code in this project, but all non-code text was generated without the use of any Generative AI tools. Additionally, ChatGPT was used to provide additional background information on the topic and to brainstorm ideas for the final open-ended prompt.

Recall that Generative AI may not be used to write or edit any non-code text in this course.

These blocks should be created using the following syntax:


::: {.callout-note title="AI Usage Statement" collapse="true"}

Your text goes here. 

:::

Make sure to use this specific type of callout (.callout-note), title, and collapse="true" setting.

Please contact the instructor if you have any questions about appropriate AI usage in this course.

Extra Credit Opportunities

There are optional Extra Credit Opportunities where extra points can be awarded for specific additional tasks in this mini-project. The amount of the extra credit is typically not proportional to the work required to complete these tasks, but I provide these for students who want to dive deeper into this project and develop additional data analysis skills not covered in the main part of this mini-project.

For this mini-project, no more than 6 total points of extra credit may be awarded. Even with extra credit, your grade on this mini-project cannot exceed 80 points total.

Entropy Analysis (Up to 2 Points)

Diversity of a population is a difficult quantity to measure. While simple statistics (“percent female” or “percent underrepresented minority”) are often used, they suffer from various challenges in an increasingly diverse world. (For example, NYC does not have a racial majority, so what does it mean to be an underrepresented minority in the context of city politics?) History and social context can guide the choice of diversity metric, but for this extra credit opportunity, you can use a statistical measure of diversity known as entropy.

Op-Eds that use the concept of entropy may be awarded up to two points of extra credit. Specifically, entropy must be used to answer all of the questions specified in Task 7 above.

Entropy of a distribution measures how hard it is to predict. Consider two scenarios:

An urn filled with 99% red balls and 1% green balls;
An urn filled with an equal mix of red and green balls.

If you are asked repeatedly to guess the color of the next ball drawn from an urn, knowing only the baseline mixture, your predictions for the first urn will be correct 99% of the time (assuming you use the obvious “always guess red” strategy), while no strategy can be correct more than 50% of the time for the second urn.

Entropy formalizes this concept as follows: given a (discrete) random variable $X$ taking values in a set $\mathcal{X}$, each with probability $p(\cdot)$. The entropy of $X$ is given by:

\[H(X) = -\sum_{x \in \mathcal{X}: p(x) > 0} p(x) \log p(x)\]

The sum is taken over all outcomes $x$ with non-zero probability. For our two urns above:

\[\begin{align*} H(\text{Urn 1}) = -\left(0.99 * \log(0.99) + 0.01 * \log(0.01)\right) = 0.056 \\ H(\text{Urn 2}) = -\left(0.50 * \log(0.50) + 0.50 * \log(0.50)\right) = 0.693 \\ \end{align*}\]

This tells us that Urn 2 is quite a bit more random than Urn 1.⁶ Entropy is particularly helpful in our context as it applies naturally to categorial quantities like race where numerical measures of randomness like variance don’t naturally fit.

This definition can be extended straightforwardly to multi-category quantities. Consider these demographics of Bronx and Queens counties in the 2020 census:

Demographics of Two NYC Boroughs
Data from 2020 Census
County / Borough	Asian	Black	Hispanic	White	All Other
Bronx	4.60%	28.48%	54.76%	8.88%	3.36%
Queens	27.30%	15.85%	22.76%	22.84%	6.25%
Asian, Black, and White percentages correspond to census estimates of Asian (only, non-hispanic), etc. Hispanic percentage corresponds to census estimates of Hispanic (any race), and All Other was calculated to values would sum to 100%.
Percentages from relevant Queens and Bronx Wikipedia articles

If you repeat the entropy calculation here (with five terms in the sum), you find that the Bronx has an entropy of approximately 1.16 while Queens has an entropy of 1.49, indicating a more diverse population, even though the Bronx has a higher proportion of Black and Hispanic residents.

Hint: When computing entropy, it is useful to take $0 * \log(0) = 0$ so that impossible outcomes are automatically discarded. As such, code like this is useful to avoid NA issues:

bronx_demos <- c(0.046, 0.2848, 0.5476, 0.0888, 0.0336)
-sum(bronx_demos * log(bronx_demos + 1e-10))

[1] 1.158139

The small ‘nugget’ term (1e-10) has no real impact on most probabilities, but prevents issues arising from $\log(0)$ being undefined.

Advanced `dplyr` Programming (Up to 3 Points)

For up to three extra credit points, add the following question to your Task 6 EDA: In 2024, which 3 institutions of at least 1000 undergraduates had student populations that were ‘most representative’ of the US undergraduate population as a whole?

Extra credit will be determined on the basis of how accurately and how efficiently the KL divergence is calculated, in addition to how well it is explained and presented in table format.

The Kullback–Leibler divergence or KL divergence can be used to measure the difference between two different probability distributions. For this problem, you can use the enrollment counts at each institution to define a probability distribution (e.g., $\mathbb{P}(\text{White Male}) = 20\%$, $\mathbb{P}(\text{White Female}) = 22\%$, etc.) for that campus and compare it to the national undergraduate population.

The following example may help: suppose I have categorized my personal library into three categories of books: i) fiction; ii) non-fiction; and iii) textbooks. Furthermore, my books are unorganized and spread randomly across 5 shelves as follows:

Shelf	Fiction	Nonfiction	Textbook
1	15	20	5
2	20	10	10
3	40	20	20
4	10	10	30
5	15	20	20

Comparing these shelves is a bit tricky since there aren’t the same number of books on each shelf and a purely ‘numbers-based’ comparison might treat Shelf 2 as more representative than Shelf 3 simply because it has fewer books. The KL divergence things of these as probabilities (e.g., Shelf 2 is 50% fiction) and can be used to compare them to the overall probabilities of the whole collection.

Adding up the columns, I have 100 fiction books, 80 non-fiction books, and 85 textbooks, so my collection is about 38% fiction, 30% non-fiction, and 32% textbooks.

We use these probabilities to compare each row against the overall population using the KL divergence formula:

\[\mathcal{D}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}\]

where the sum $x$ is taken over all categories and $P, Q$ are different probability distributions. Here $P$ is the “baseline” (correct) distribution and $Q$ is the approximation. So the KL divergence between Shelf 1, which is about 38% fiction, 50% non-fiction, and 12% textbooks, and the collection as a whole is given by

\[ 0.3773585 \log \frac{0.3773585}{0.375} + 0.3018868 \log \frac{0.3018868}{0.5} + 0.3207547 \log \frac{0.3207547}{0.125} \approx 0.1523145 \]

We can repeat this analysis across all 5 shelves to get the following KL table

Shelf	KL Divergence from Entire Collection
1	0.1523
2	0.0307
3	0.0307
4	0.1630
5	0.0261

From which we see that Shelf 5 is “most representative” of my whole collection. Note also that Shelves 2 and 3 have the same KL Divergence because they are (proportionally) the same.

Hint: Note that this is a significantly more complex analysis than we have performed so far and requires some ideas that won’t be covered until Week 6 and some column selection functionality that will not be covered in lecture. Make sure to access the source code for this assignment and use it as a guide for completing this extra credit task.

Data Visualization (1 point)

Inclusion of a well-formatted visual element to accompany your op-ed will get one extra credit point.

Footnotes

Following the decision in June 2023, universities were first applying post-SFFA admissions policies for the undergraduate class entering in Fall 2024. Due to reporting lags, these data were released in late 2025 and are currently the latest available. As data from more admissions cycles are released in IPEDS, the impact (or not) of SFFA will become clearer.↩︎
This the level of “ChatGPT-level” prose, without obvious flaws but lacking the style and elegance associated with true quality writing.↩︎
Throughout this section, replace YOUR_GITHUB_ID with your GitHub ID from Mini-Project #00. Note that the automated course infrastructure will be looking for precise formatting, so follow these instructions closely.↩︎
str_detect can actually be used to perform significantly more complex string analysis than simple “does it contain this subset of letters” but we won’t cover that sort of string processing for a few more weeks.↩︎
In one famous (and slightly tragic) example, Microsoft Excel silently misinterpreted the names of various genes as numeric values and changed them from (what it thought was) scientific notation to (what it thought was) standard numeric formatting. This wound up ruining several important scientific studies. Always check your data types! The original study identifying this problem can be found here and a popular news summary is here.↩︎
The units of entropy are somewhat tricky to understand. For us, it suffices to know that a larger entropy means more randomness.↩︎

Due Dates

Welcome to STA 9750 Mini Projects!

Introduction

Student Responsibilities

Rubric

Submission Instructions

Mini-Project #01: Assessing the Impact of SFFA on Campus Diversity One-Year Later

Data Acquisition

Data Cleaning and Preparation

Initial Data Exploration

Displaying Data in Tables

Exploratory Analysis

Final Deliverable: College Newpaper Op-Ed

AI Usage Statement

Extra Credit Opportunities

Entropy Analysis (Up to 2 Points)

Advanced dplyr Programming (Up to 3 Points)

Data Visualization (1 point)

Footnotes

Advanced `dplyr` Programming (Up to 3 Points)