#' Acquire IPEDS Data for MP#01
#'
#' This function will acquire and standardize all data for MP#01
#' from IPEDS (https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx)
#'
#' We're starting in 2010 as the data seems to be reasonably complete
#' after that point.
acquire_ipeds_data <- function(start_year=2010, end_year=2024){
library(tidyverse)
library(glue)
data_dir <- file.path("data", "mp01")
if(!dir.exists(data_dir)){
dir.create(data_dir, showWarnings=FALSE, recursive=TRUE)
}
YEARS <- seq(start_year, end_year)
EFA_ALL <- map(YEARS, function(yy){
if(yy <= 2022){
ef_url <- glue("https://nces.ed.gov/ipeds/datacenter/data/EF{yy}A.zip")
} else {
ef_url <- glue("https://nces.ed.gov/ipeds/data-generator?year={yy}&tableName=EF{yy}A&HasRV=0&type=csv")
}
ef_file <- file.path(data_dir, glue("ef{yy}a.csv.zip"))
if(!file.exists(ef_file)){
message(glue("Downloading Enrollment Data for {yy} from {ef_url}"))
download.file(ef_url, destfile = ef_file, quiet=TRUE)
}
read_csv(ef_file,
show_col_types=FALSE) |>
mutate(year = yy,
# American Indian or Alaskan Native
enrollment_m_aian = EFAIANM,
enrollment_f_aian = EFAIANW,
# Asian
enrollment_m_asia = EFASIAM,
enrollment_f_asia = EFASIAW,
# Black or African-American,
enrollment_m_bkaa = EFBKAAM,
enrollment_f_bkaa = EFBKAAW,
# Hispanic
enrollment_m_hisp = EFHISPM,
enrollment_f_hisp = EFHISPW,
# Native Hawaiian or Other Pacific Islander
enrollment_m_nhpi = EFNHPIM,
enrollment_f_nhpi = EFNHPIW,
# White
enrollment_m_whit = EFWHITM,
enrollment_f_whit = EFWHITW,
# Two or More Races
enrollment_m_2mor = EF2MORM,
enrollment_f_2mor = EF2MORW,
# Unknown / Undisclosed Race
enrollment_m_unkn = EFUNKNM,
enrollment_f_unkn = EFUNKNW,
# US Non-Resident
enrollment_m_nral = EFNRALM,
enrollment_f_nral = EFNRALW,
) |> filter(
(EFALEVEL %in% c(2, 12)) | (LINE %in% c(1, 15))
# Per 2024 Data Dictionary,
# - EFALEVEL 2 = undergrad
# - EFALELVE 12 = grad
# - Line 1 = first year first time full-time undergrad
# - Line 15 = first year first time part-time undergrad
) |> mutate(level = case_when(
EFALEVEL == 2 ~ "all undergrad",
EFALEVEL == 12 ~ "all graduate",
LINE %in% c(1, 15) ~ "first year undergrad"
)
) |>
select(institution_id = UNITID,
year,
level,
starts_with("enrollment_")) |>
group_by(institution_id,
year,
level) |>
summarize(across(starts_with("enrollment_"), sum),
.groups = "drop")
}) |> bind_rows()
DESC_ALL <- map(YEARS, function(yy){
if(yy <= 2022){
hd_url <- glue("https://nces.ed.gov/ipeds/datacenter/data/HD{yy}.zip")
} else {
hd_url <- glue("https://nces.ed.gov/ipeds/data-generator?year={yy}&tableName=HD{yy}&HasRV=0&type=csv")
}
hd_file <- file.path(data_dir, glue("hd{yy}.csv.zip"))
if(!file.exists(hd_file)){
message(glue("Downloading Institutional Descriptions for {yy} from {hd_url}"))
download.file(hd_url, destfile = hd_file, quiet=TRUE)
}
suppressWarnings(
read_csv(hd_file,
show_col_types=FALSE,
locale=locale(encoding=if_else(yy==2024, "utf-8", "windows-1252"))) |>
mutate(year = yy,
INSTNM) |>
select(institution_id = UNITID,
institution_name = INSTNM,
state = STABBR,
year)
)
}) |> bind_rows()
inner_join(EFA_ALL,
DESC_ALL,
join_by(institution_id == institution_id,
year == year))
}
IPEDS <- acquire_ipeds_data()STA 9750 Mini-Project #01: Assessing the Impact of SFFA on Campus Diversity One-Year Later
Due Dates
- Released to Students: 2026-02-20
- Initial Submission: 2026-03-13 11:59pm ET on GitHub and Brightspace
-
Peer Feedback:
- Peer Feedback Assigned: 2026-03-16 on GitHub
- Peer Feedback Due: 2026-03-22 11:59pm ET on GitHub
Estimated Time to Complete: 13-15 Hours
Estimated Time for Peer Feedback: 1 Hour
Welcome to STA 9750 Mini Projects!
In the STA 9750 Mini-Projects, you will perform basic data analyses intended to model best practices for your course final project. (Note, however, that these are mini-projects; your final course project is expected to be far more extensive than any single MP.)
Introduction
On June 29, 2023, The US Supreme Court handed down its decision in the closely watched case of Students for Fair Admissions v. Harvard (“SFFA”). In SFFA, the court found that the admissions programs at Harvard and at the University of North Carolina violated the Equal Protection Clause of the Fourteenth Amendment to the US Constitution and engaged in impermissible race-conscious admissions practices. While the court had explicitly disallowed race-conscious practices in other contexts, Harvard argued that the admissions process served a compelling governmental interest - educational benefits from a diverse student body - that had been recognized by the court as recently as 2016.
While the SFFA case touched on many aspects of law, one important factual question was the practical impact of the programs in question: were they essential to maintaining a diverse student body or were they merely a slight “thumb on the scale” that had only a minor impact?
Now that the first set of post-SFFA admissions data has been released,1 you are going to attempt to analyze the extent to which post-SFFA admissions differ from pre-SFFA practice. You will use admissions data distributed via IPEDS, the Integrated Postsecondary Education Data System, managed by the National Center for Education Statistics within the Federal Department of Education.
After some initial analysis, you will write a brief (750) word Op-Ed from the perspective of a college president describing the impact of SFFA on the demographics of admitted students at your college and at colleges across the country.
In this mini-project, you will:
- Practice use of
dplyrfor analysis of tabular data - Practice use of
quartoand Reproducible Research Tools for Effective Communication of Data Analysis Results - Begin your professional data science portfolio.
Student Responsibilities
For purposes of MPs, we are dividing the basic data analytic workflow into several major stages:
- Data Ingest and Cleaning: Given a data source, read it into
Rand transform it to a reasonably useful and standardized (‘tidy’) format. - Data Combination and Alignment: Combine multiple data sources to enable insights not possible from a single source.
- Descriptive Statistical Analysis: Take a data table and compute informative summary statistics from both the entire population and relevant subgroups
- Data Visualization: Generate insightful data visualizations to spur insights not attainable from point statistics
- Inferential Statistical Analysis and Modeling: Develop relevant predictive models and statistical analyses to generate insights about the underlying population and not simply the data at hand.
In this course, our primary focus is on the first four stages: you will take other courses that develop analytical and modeling techniques for a variety of data types. As we progress through the course, you will eventually be responsible for the first four steps. Specifically, you are responsible for the following stages of each mini-project:
| Ingest and Cleaning | Combination and Alignment | Descriptive Statistical Analysis | Visualization | |
|---|---|---|---|---|
| Mini-Project #01 | ✓ | |||
| Mini-Project #02 | ✓ | ✓ | ½ | |
| Mini-Project #03 | ½ | ✓ | ✓ | ✓ |
| Mini-Project #04 | ✓ | ✓ | ✓ | ✓ |
In early stages of the course, such as this MP, I will ‘scaffold’ much of the analysis for you, leaving only those stages we have discussed in class for you to fill in. As the course progresses, the mini-projects will be more self-directed and results less standardized.
Rubric
STA 9750 Mini-Projects are evaluated using peer grading with meta-review by the course staff. The following basic rubric will be used for all mini-projects:
| Course Element | Excellent (9-10) | Great (7-8) | Good (5-6) | Adequate (3-4) | Needs Improvement (1-2) |
|---|---|---|---|---|---|
| Written Communication | Report is very well-written and flows naturally. Motivation for key steps is clearly explained to reader without excessive detail. Key findings are highlighted and appropriately given sufficient context, including reference to related work where appropriate. | Report has no grammatical or writing issues.2 Writing is accessible and flows naturally. Key findings are highlighted and clearly explained, but lack suitable motivation and context. | Report has no grammatical or writing issues. Key findings are present but insufficiently highlighted or unclearly explained. | Writing is intelligible, but has some grammatical errors. Key findings are difficult to discern. | Report exhibits significant weakness in written communication. Key points are nearly impossible to identify. |
| Project Skeleton | Code completes all instructor-provided tasks correctly. Responses to open-ended tasks are especially insightful and creative. | Code completes all instructor-provided tasks satisfactorily. Responses to open-ended tasks are insightful, creative, and do not have any minor flaws. | Response to one instructor provided task is skipped, incorrect, or otherwise incomplete. Responses to open-ended tasks are solid and without serious flaws. | Responses to two instructor provided tasks are skipped, incorrect, or otherwise incomplete. Responses to open-ended tasks are acceptable, but have at least one serious flaw. | Response to three or more instructor provided tasks are skipped, incorrect, or otherwise incomplete. Responses to open-ended tasks are seriously lacking. |
| Tables & Document Presentation | Tables go beyond standard publication-quality formatting, using advanced features like color formatting, interactivity, or embedded visualization. | Tables are well-formatted, with publication-quality selection of data to present, formatting of table contents (e.g., significant figures) and column names. | Tables are well-formatted, but still have room for improvement in one of these categories: subsetting and selection of data to present, formatting of table contents (e.g., significant figures), column names. | Tables lack significant ‘polish’ and need improvement in substance (filtering and down-selecting of presented data) or style. Document is difficult to read due to distracting formatting choices. | Unfiltered ‘data dump’ instead of curated table. Document is illegible at points. |
| Data Visualization | Figures go beyond standard publication-quality formatting, using advanced features like animation, interactivity, or advanced plot types implemented in ggplot2 extension packages. |
Figures are ‘publication-quality,’ with suitable axis labels, well-chosen structure, attractive color schemes, titles, subtitles, and captions, etc. | Figures are above ‘exploratory-quality’ and reflect a moderate degree of polish, but do not reach full ‘publication-quality’ in one-to-two ways. | Figures are above ‘exploratory-quality’ and reflect a moderate degree of polish, but do not reach full ‘publication-quality’ in three or more distinct ways. | Figures are suitable to support claims made, but are ‘exploratory-quality,’ reflecting zero-to-minimal effort to customize and ‘polish’ beyond ggplot2 defaults. |
| Exploratory Data Analysis | Deep and ‘story-telling’ EDA identifying non-obvious patterns that are then used to drive further analysis in support of the project. All patterns and irregularities are noted and well characterized, demonstrating mastery and deep understanding of all data sets used. | Meaningful ‘story-telling’ EDA identifying non-obvious patterns in the data. Major and pinor patterns and irregularities are noted and well characterized at a level sufficient to achieve the goals of the analysis. EDA demonstrates clear understanding of all data sets used. | Extensive EDA that thoroughly explores the data, but lacks narrative and does not deliver a meaningful ‘story’ to the reader. Obvious patterns or irregularities noted and well characterized, but more subtle structure may be overlooked or not fully discussed. EDA demonstrates competence and basic understanding of the data sets used. | Solid EDA that identifies major structure to the data, but does not fully explore all relevant structure. Obvious patterns or irregularities ignored or missed. EDA demonstrates familiarity with high-level structure of the data sets used. | Minimal EDA, covering only standard summary statistics, and providing limited insight into data patterns or irregularities. EDA fails to demonstrate familiarity with even the most basic properties of the data sets being analyzed. |
Code Quality |
Code is (near) flawless. Intent is clear throughout and all code is efficient, clear, and fully idiomatic. Code passes all |
Comments give context and structure of the analysis, not simply defining functions used in a particular line. Intent is clear throughout, but code can be minorly improved in certain sections. |
Code has well-chosen variable names and basic comments. Intent is generally clear, though some sections may be messy and code may have serious clarity or efficiency issues. |
Code executes properly, but is difficult to read. Intent is generally clear and code is messy or inefficient. |
Code fails to execute properly. |
| Data Preparation | Data import is fully-automated and efficient, taking care to only download from web-sources if not available locally. All data cleaning steps are fully-automated and robustly implemented, yielding a clean data set that can be widely used. | Data is imported and prepared effectively, in an automated fashion with minimal hard-coding of URLs and file paths. Data cleaning is fully-automated and sufficient to address all issues relevant to the analysis at plan. | Data is imported and prepared effectively, though source and destination file names are hard-coded. Data cleaning is rather manual and hard-codes most transformations. | Data is imported in a manner likely to have errors. Data cleaning is insufficient and fails to address clear problems. | Data is hard-coded and not imported from an external source. |
| Analysis and Findings | Analysis demonstrates uncommon insight and quality, providing unexpected and subtle insights. | Analysis is clear and convincing, leaving essentially no doubts about correctness. | Analysis clearly appears to be correct and passes the “sniff test” for all findings, but a detailed review notes some questions remain unanswered. | Analysis is not clearly flawed at any point and is likely to be within the right order of magnitude for all findings. | Analysis is clearly incorrect in at least one major finding, reporting clearly implausible results that are likely off by an order of magnitude or more. |
Note that the “Excellent” category for most elements applies only to truly exceptional “above-and-beyond” work. Most student submissions will likely fall in the “Good” to “Great” range.
At this early point, you are not responsible for all elements of this rubric. In particular, all submissions will receive an automatic 10/10 for Data Visualization as this is outside the scope of this mini-project. Furthermore, because I am providing code to download the data, load it into R, and prepare it for analysis, all reports submitted using my code will receive an automatic 10/10 for the ‘Data Preparation’ element of the rubric. Finally, reports completing all tasks described under Data Integration and Exploration below should receive a 10/10 for the ‘Exploratory Data Analysis’ rubric element.
Taken together, you are only really responsible for these portions of the rubric:
- Written Communication
- Project Skeleton
- Tables & Document Presentation
- Code Quality
- Analysis and Findings
Reports completing all key steps outlined below essentially start with 30 free points.
Note that you are evaluated*on writing and communication in these Mini-Projects. You are required to write a report in the prescribed style, culminating in an Op-Ed. A submission that performs the instructor-specified tasks, but does not write and give appropriate context and commentary will score very poorly on the relevant rubric elements.
In particular, if a submission does not include a clearly delineated Op-Ed and only answers the instructor prompts in narrative text, peer evaluators should judge it to have “Good” quality Written Communication (at best) as key findings are not conveyed appropriately.
Quarto’s code folding functionality is useful for “hiding” code so that it doesn’t break the flow of your writing.
You can also make use of Quarto’s contents shortcode to present code and findings in an order other than how the code should be executed. This is particularly useful if you want to include a figure or table in an “Executive Summary” at the top of your submission.
For this mini-project, no more than 6 total points of extra credit can be be awarded. Opportunities for extra credit exist for students who go above and beyond the instructor-provided scaffolding. Specific opportunities for extra credit can be found below.
Students pursuing careers in data analytics are strongly encouraged to go beyond the strict ambit of the mini-projects to
- further refine their skills;
- learn additional techniques that can be used in the final course project; and
- develop a more impressive professional portfolio.
Because students are encouraged to use STA 9750 mini-projects as the basis for a professional portfolio, the basic skeleton of each project will be released under a fairly permissive usage license. Take advantage of it!
Submission Instructions
After completing the analysis, write up your findings, showing all of your code, using a dynamic quarto document and post it to your course repository. The qmd file should be named mp01.qmd (lower case!) so the rendered document can be found at docs/mp01.html in the student’s repository and will be served at the URL:3
https://YOUR_GITHUB_ID.github.io/STA9750-2026-SPRING/mp01.html
You can use the helper function mp_start available at in the Course Helper Functions to create a file with the appropriate name and some meta-data already included. Do so by running the following command at the R Console:
source("https://michael-weylandt.com/STA9750/load_helpers.R"); mp_start(N=01)
After completing this mini-project, upload your rendered output and necessary ancillary files to GitHub to make sure your site works. The mp_submission_ready function in the Course Helper Functions can perform some of these checks automatically. You can run this function by running the following commands at the R Console:
source("https://michael-weylandt.com/STA9750/load_helpers.R"); mp_submission_ready(N=01)
Once you confirm this website works (substituting YOUR_GITHUB_ID for the actual GitHub username provided to the professor in MP#00 of course), open a GitHub issue on the instructor’s repository to submit your completed work.
The easiest way to do so is by use of the mp_submission_create function in the Course Helper Functions, which can be used by running the following command at the R Console:
source("https://michael-weylandt.com/STA9750/load_helpers.R"); mp_submission_create(N=01)
Alternatively, if you wish to submit manually, open a new issue at
https://github.com/michaelweylandt/STA9750-2026-SPRING/issues/new.
Title the issue STA 9750 YOUR_GITHUB_ID MiniProject #01 and fill in the following text for the issue:
Hi @michaelweylandt!
I've uploaded my work for MiniProject #**01** - check it out!
<https://<GITHUB_ID>.github.io/STA9750-2026-SPRING/mp01.html>
At various points before and after the submission deadline, the instructor will run some automated checks to ensure your submission has all necessary components. Please respond to any issues raised in a timely fashion as failing to address them may lead to a lower set of scores when graded.
Additionally, a PDF export of this report should be submitted on Brightspace. To create a PDF from the uploaded report, simply use your browser’s ‘Print to PDF’ functionality.
NB: The analysis outline below specifies key tasks you need to perform within your write up. Your peer evaluators will check that you complete these. You are encouraged to do extra analysis, but the bolded Tasks are mandatory.
NB: Your final submission should look like a report, not simply a list of facts answering questions. Add introductions, conclusions, and your own commentary. You should be practicing both raw coding skills and written communication in all mini-projects. There is little value in data points stated without context or motivation.
Mini-Project #01: Assessing the Impact of SFFA on Campus Diversity One-Year Later
Data Acquisition
The following code can be used to acquire data from IPEDS. Specifically, once run, this code will download the fall enrollment (“EF”) and institution description (“HD”) files for the previous 15 years. To be efficient, this function will save a copy of the downloaded data in a folder called data/mp01 and use that copy to avoid re-downloading a file if it is already present on your computer. This will make your code faster to run and will avoid putting unnecessary stress on the IPEDS servers.
This creates a data frame called IPEDS that has 219316 rows and 23 columns in your local environment. This data will be used for the remainder of this mini-project.
Using the code above, acquire the IPEDS data. Copy the code into your Quarto document and make sure it runs successfully.
git add Data Files
Make sure that git is set to ignore data files, such as the one created above. Check the git pane in RStudio and make sure that the files in data/mp01 do not appear. (If you set up your .gitignore file correctly in MP#00, it should already be ignored.) If it is appearing, you may need to edit your .gitignore file.
Removing a large data file from git is possible, but difficult. Don’t get into a bad state!
Data Cleaning and Preparation
IPEDS provides many different pieces of information, so I have provided you a subset of interesting variables. These include:
A unique institutional ID code (
institution_id)The name of the institution (
insitution_name)The state in which the institution’s principal campus is located
The year of reporting. Note that, for enrollment data, these are the enrollment as of the Fall semester, so, e.g., Year 2010 is for the first semester of the 2010-2011 academic year.
-
The “level” of students being considered:
-
"all graduate"refers to all graduate students on campus (both full-time and part-time) -
"all undergraduate"refers to all undergraduate students on campus (both full-time and part-time) -
"first year undergrad"refers to “stereotypical” first year college students. These are students (either full-time or part-time) who are in their first year of study during their first undergraduate enrollment (i.e., excluding students who took time off and re-enrolled, students who transferred from another institution, and students who previously studied at a different institution earning an associates’ degree or similar credential)
-
-
Number of enrolled students in various demographic groups: these 18 variables are formatted as:
enrollment_X_YYYY, where-
Xismorffor male or female students -
YYYYis a racial/ethnic group identifier:-
aian: American Indian or Alaskan Native -
asia: Asian-American -
bkaa: Black or African-American -
hisp: Hispanic -
nhpi: Native Hawaiian or Pacific Islander -
whit: White -
2mor: Two or More Races -
unkn: Unknown / Unreported -
nral: Not a US Resident before enrollment (“Non-Resident Alien”)
-
Note that these are somewhat non-standard identifiers and do not match other official sources. (E.g., the US Census considers Hispanic ancestry a separate axis so it is possible to have Non-Hispanic White, Hispanic Black, etc.) We use these as given since they are what IPEDS provides.
Together these give 18 enrollment categories that, when summed, give the total enrollment at an institution.
-
Before proceeding, we will create two additional variables for our analyses:
-
is_cuny: A Boolean (True/False) variable indicating whether an institution is part of the CUNY system. -
is_calpublic: A variable indicating whether an institution is a public college or university of the state of California
Create a new column is_cuny in your IPEDS data. For our purposes, we will assume that CUNY schools all have “CUNY” in their institution name.
The str_detect function can be used to test whether a string (like a name) contains a substring:4 e.g.,
# True because "CUNY" (2nd arg) is in the longer string (1st arg)
str_detect("CUNY Bernard M. Baruch College", "CUNY")[1] TRUE
# False because "Hunter" is nowhere in the first argument
str_detect("CUNY Bernard M. Baruch College", "Hunter")[1] FALSE
As with other functions in R, str_detect is vectorized, making it easy to use inside of other functions.
names <- c("City College-Miami", "CUNY City College", "CUNY Hunter College")
str_detect(names, "CUNY")[1] FALSE TRUE TRUE
Use this function, in conjunction with a mutate function, to create a new column called is_cuny inside the IPEDS data. Make sure to assign your mutated data frame so that you can use it later.
Your code should look something like:
IPEDS <- IPEDS |>
mutate(is_cuny = ...)where the ... is some code using the string "CUNY", the existing column names in IPEDS and the str_detect function.
The public colleges and universities of the state of California are an interesting test case. Persuant to California Proposition 209 (1996), public institutions of higher education in the state were specifically not allowed to implement any sort of Affirmative Action program, so the ending of Affirmative Action would theoretically have no impact on their admissions practices. We will investigate this possibility below. For now, we need to create an additional variable which identifies the California public institutions in our data set.
Unlike the CUNY system, which combines community colleges, senior (four-year) colleges, and specialized graduate institutions into a single system, California has three separate public systems:
- the Univeristy of California System
- the California State System and
- the California Community Colleges system.
Because of this, we will need to do a bit more work to create our is_calpublic variable. Because the Community Colleges are open-enrollment, we can ignore them from our analysis and focus exclusively on the UC system and the Cal State system.
Add a new variable is_calpublic to the IPEDS data that is TRUE for institutions that are part of the University of California System or the California State system.
To do this, you will need to use str_detect (see the previous task) and one of the Boolean logical operators (!, &, |).
Once you have created your variable, use a combination of dplyr functions to get a list of the institutions identified by your new variable. You may wish to compare this against the Wikipedia articles linked above to ensure your results are accurate.
Note: If you know what a regular expression is, you know that you can use only a single str_detect call here and avoid the use of a Boolean operator. Do not do this - you must use some Boolean logic to get full credit for this task. If you do not know what a regular expression is, you can disregard this note.
Initial Data Exploration
Before moving to our final analysis, we will do a bit of Exploratory Data Analysis (EDA). EDA serves many purposes in data science– quality control, hypothesis generation, outlier identification, etc.–but perhaps the most important is simply knowing what information can be found in a novel data set. Now that our data is imported and cleaned, it’s almost time to start our EDA. Before we do EDA, however, we should pause briefly to consider how we want to display data.
Displaying Data in Tables
While we could continue investigating our data using R’s basic print-outs, this is a good time to introduce the gt package which can be used to create complex tables natively in R.
What follows here is a very brief introduction to the gt package. You do not need to copy this into your submission - it is provided only as background. You will use the gt package to format your answers to the next few tasks.
For this introduction, I am going to use the penguins data, but the same functions can be applied to the IPEDS data that you are analyzing.
Let’s first look at what a “basic” or “raw” display of a data frame gives us:
penguins species island bill_len bill_dep flipper_len body_mass sex year
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
3 Adelie Torgersen 40.3 18.0 195 3250 female 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 female 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42.0 20.2 190 4250 <NA> 2007
11 Adelie Torgersen 37.8 17.1 186 3300 <NA> 2007
12 Adelie Torgersen 37.8 17.3 180 3700 <NA> 2007
13 Adelie Torgersen 41.1 17.6 182 3200 female 2007
14 Adelie Torgersen 38.6 21.2 191 3800 male 2007
15 Adelie Torgersen 34.6 21.1 198 4400 male 2007
16 Adelie Torgersen 36.6 17.8 185 3700 female 2007
17 Adelie Torgersen 38.7 19.0 195 3450 female 2007
18 Adelie Torgersen 42.5 20.7 197 4500 male 2007
19 Adelie Torgersen 34.4 18.4 184 3325 female 2007
20 Adelie Torgersen 46.0 21.5 194 4200 male 2007
21 Adelie Biscoe 37.8 18.3 174 3400 female 2007
22 Adelie Biscoe 37.7 18.7 180 3600 male 2007
23 Adelie Biscoe 35.9 19.2 189 3800 female 2007
24 Adelie Biscoe 38.2 18.1 185 3950 male 2007
25 Adelie Biscoe 38.8 17.2 180 3800 male 2007
26 Adelie Biscoe 35.3 18.9 187 3800 female 2007
27 Adelie Biscoe 40.6 18.6 183 3550 male 2007
28 Adelie Biscoe 40.5 17.9 187 3200 female 2007
29 Adelie Biscoe 37.9 18.6 172 3150 female 2007
30 Adelie Biscoe 40.5 18.9 180 3950 male 2007
31 Adelie Dream 39.5 16.7 178 3250 female 2007
32 Adelie Dream 37.2 18.1 178 3900 male 2007
33 Adelie Dream 39.5 17.8 188 3300 female 2007
34 Adelie Dream 40.9 18.9 184 3900 male 2007
35 Adelie Dream 36.4 17.0 195 3325 female 2007
36 Adelie Dream 39.2 21.1 196 4150 male 2007
37 Adelie Dream 38.8 20.0 190 3950 male 2007
38 Adelie Dream 42.2 18.5 180 3550 female 2007
39 Adelie Dream 37.6 19.3 181 3300 female 2007
40 Adelie Dream 39.8 19.1 184 4650 male 2007
41 Adelie Dream 36.5 18.0 182 3150 female 2007
42 Adelie Dream 40.8 18.4 195 3900 male 2007
43 Adelie Dream 36.0 18.5 186 3100 female 2007
44 Adelie Dream 44.1 19.7 196 4400 male 2007
45 Adelie Dream 37.0 16.9 185 3000 female 2007
46 Adelie Dream 39.6 18.8 190 4600 male 2007
47 Adelie Dream 41.1 19.0 182 3425 male 2007
48 Adelie Dream 37.5 18.9 179 2975 <NA> 2007
49 Adelie Dream 36.0 17.9 190 3450 female 2007
50 Adelie Dream 42.3 21.2 191 4150 male 2007
51 Adelie Biscoe 39.6 17.7 186 3500 female 2008
52 Adelie Biscoe 40.1 18.9 188 4300 male 2008
53 Adelie Biscoe 35.0 17.9 190 3450 female 2008
54 Adelie Biscoe 42.0 19.5 200 4050 male 2008
55 Adelie Biscoe 34.5 18.1 187 2900 female 2008
56 Adelie Biscoe 41.4 18.6 191 3700 male 2008
57 Adelie Biscoe 39.0 17.5 186 3550 female 2008
58 Adelie Biscoe 40.6 18.8 193 3800 male 2008
59 Adelie Biscoe 36.5 16.6 181 2850 female 2008
60 Adelie Biscoe 37.6 19.1 194 3750 male 2008
61 Adelie Biscoe 35.7 16.9 185 3150 female 2008
62 Adelie Biscoe 41.3 21.1 195 4400 male 2008
63 Adelie Biscoe 37.6 17.0 185 3600 female 2008
64 Adelie Biscoe 41.1 18.2 192 4050 male 2008
65 Adelie Biscoe 36.4 17.1 184 2850 female 2008
66 Adelie Biscoe 41.6 18.0 192 3950 male 2008
67 Adelie Biscoe 35.5 16.2 195 3350 female 2008
68 Adelie Biscoe 41.1 19.1 188 4100 male 2008
69 Adelie Torgersen 35.9 16.6 190 3050 female 2008
70 Adelie Torgersen 41.8 19.4 198 4450 male 2008
71 Adelie Torgersen 33.5 19.0 190 3600 female 2008
72 Adelie Torgersen 39.7 18.4 190 3900 male 2008
73 Adelie Torgersen 39.6 17.2 196 3550 female 2008
74 Adelie Torgersen 45.8 18.9 197 4150 male 2008
75 Adelie Torgersen 35.5 17.5 190 3700 female 2008
76 Adelie Torgersen 42.8 18.5 195 4250 male 2008
77 Adelie Torgersen 40.9 16.8 191 3700 female 2008
78 Adelie Torgersen 37.2 19.4 184 3900 male 2008
79 Adelie Torgersen 36.2 16.1 187 3550 female 2008
80 Adelie Torgersen 42.1 19.1 195 4000 male 2008
81 Adelie Torgersen 34.6 17.2 189 3200 female 2008
82 Adelie Torgersen 42.9 17.6 196 4700 male 2008
83 Adelie Torgersen 36.7 18.8 187 3800 female 2008
84 Adelie Torgersen 35.1 19.4 193 4200 male 2008
85 Adelie Dream 37.3 17.8 191 3350 female 2008
86 Adelie Dream 41.3 20.3 194 3550 male 2008
87 Adelie Dream 36.3 19.5 190 3800 male 2008
88 Adelie Dream 36.9 18.6 189 3500 female 2008
89 Adelie Dream 38.3 19.2 189 3950 male 2008
90 Adelie Dream 38.9 18.8 190 3600 female 2008
91 Adelie Dream 35.7 18.0 202 3550 female 2008
92 Adelie Dream 41.1 18.1 205 4300 male 2008
93 Adelie Dream 34.0 17.1 185 3400 female 2008
94 Adelie Dream 39.6 18.1 186 4450 male 2008
95 Adelie Dream 36.2 17.3 187 3300 female 2008
96 Adelie Dream 40.8 18.9 208 4300 male 2008
97 Adelie Dream 38.1 18.6 190 3700 female 2008
98 Adelie Dream 40.3 18.5 196 4350 male 2008
99 Adelie Dream 33.1 16.1 178 2900 female 2008
100 Adelie Dream 43.2 18.5 192 4100 male 2008
101 Adelie Biscoe 35.0 17.9 192 3725 female 2009
102 Adelie Biscoe 41.0 20.0 203 4725 male 2009
103 Adelie Biscoe 37.7 16.0 183 3075 female 2009
104 Adelie Biscoe 37.8 20.0 190 4250 male 2009
105 Adelie Biscoe 37.9 18.6 193 2925 female 2009
106 Adelie Biscoe 39.7 18.9 184 3550 male 2009
107 Adelie Biscoe 38.6 17.2 199 3750 female 2009
108 Adelie Biscoe 38.2 20.0 190 3900 male 2009
109 Adelie Biscoe 38.1 17.0 181 3175 female 2009
110 Adelie Biscoe 43.2 19.0 197 4775 male 2009
111 Adelie Biscoe 38.1 16.5 198 3825 female 2009
112 Adelie Biscoe 45.6 20.3 191 4600 male 2009
113 Adelie Biscoe 39.7 17.7 193 3200 female 2009
114 Adelie Biscoe 42.2 19.5 197 4275 male 2009
115 Adelie Biscoe 39.6 20.7 191 3900 female 2009
116 Adelie Biscoe 42.7 18.3 196 4075 male 2009
117 Adelie Torgersen 38.6 17.0 188 2900 female 2009
118 Adelie Torgersen 37.3 20.5 199 3775 male 2009
119 Adelie Torgersen 35.7 17.0 189 3350 female 2009
120 Adelie Torgersen 41.1 18.6 189 3325 male 2009
121 Adelie Torgersen 36.2 17.2 187 3150 female 2009
122 Adelie Torgersen 37.7 19.8 198 3500 male 2009
123 Adelie Torgersen 40.2 17.0 176 3450 female 2009
124 Adelie Torgersen 41.4 18.5 202 3875 male 2009
125 Adelie Torgersen 35.2 15.9 186 3050 female 2009
126 Adelie Torgersen 40.6 19.0 199 4000 male 2009
127 Adelie Torgersen 38.8 17.6 191 3275 female 2009
128 Adelie Torgersen 41.5 18.3 195 4300 male 2009
129 Adelie Torgersen 39.0 17.1 191 3050 female 2009
130 Adelie Torgersen 44.1 18.0 210 4000 male 2009
131 Adelie Torgersen 38.5 17.9 190 3325 female 2009
132 Adelie Torgersen 43.1 19.2 197 3500 male 2009
133 Adelie Dream 36.8 18.5 193 3500 female 2009
134 Adelie Dream 37.5 18.5 199 4475 male 2009
135 Adelie Dream 38.1 17.6 187 3425 female 2009
136 Adelie Dream 41.1 17.5 190 3900 male 2009
137 Adelie Dream 35.6 17.5 191 3175 female 2009
138 Adelie Dream 40.2 20.1 200 3975 male 2009
139 Adelie Dream 37.0 16.5 185 3400 female 2009
140 Adelie Dream 39.7 17.9 193 4250 male 2009
141 Adelie Dream 40.2 17.1 193 3400 female 2009
142 Adelie Dream 40.6 17.2 187 3475 male 2009
143 Adelie Dream 32.1 15.5 188 3050 female 2009
144 Adelie Dream 40.7 17.0 190 3725 male 2009
145 Adelie Dream 37.3 16.8 192 3000 female 2009
146 Adelie Dream 39.0 18.7 185 3650 male 2009
147 Adelie Dream 39.2 18.6 190 4250 male 2009
148 Adelie Dream 36.6 18.4 184 3475 female 2009
149 Adelie Dream 36.0 17.8 195 3450 female 2009
150 Adelie Dream 37.8 18.1 193 3750 male 2009
151 Adelie Dream 36.0 17.1 187 3700 female 2009
152 Adelie Dream 41.5 18.5 201 4000 male 2009
153 Gentoo Biscoe 46.1 13.2 211 4500 female 2007
154 Gentoo Biscoe 50.0 16.3 230 5700 male 2007
155 Gentoo Biscoe 48.7 14.1 210 4450 female 2007
156 Gentoo Biscoe 50.0 15.2 218 5700 male 2007
157 Gentoo Biscoe 47.6 14.5 215 5400 male 2007
158 Gentoo Biscoe 46.5 13.5 210 4550 female 2007
159 Gentoo Biscoe 45.4 14.6 211 4800 female 2007
160 Gentoo Biscoe 46.7 15.3 219 5200 male 2007
161 Gentoo Biscoe 43.3 13.4 209 4400 female 2007
162 Gentoo Biscoe 46.8 15.4 215 5150 male 2007
163 Gentoo Biscoe 40.9 13.7 214 4650 female 2007
164 Gentoo Biscoe 49.0 16.1 216 5550 male 2007
165 Gentoo Biscoe 45.5 13.7 214 4650 female 2007
166 Gentoo Biscoe 48.4 14.6 213 5850 male 2007
167 Gentoo Biscoe 45.8 14.6 210 4200 female 2007
168 Gentoo Biscoe 49.3 15.7 217 5850 male 2007
169 Gentoo Biscoe 42.0 13.5 210 4150 female 2007
170 Gentoo Biscoe 49.2 15.2 221 6300 male 2007
171 Gentoo Biscoe 46.2 14.5 209 4800 female 2007
172 Gentoo Biscoe 48.7 15.1 222 5350 male 2007
173 Gentoo Biscoe 50.2 14.3 218 5700 male 2007
174 Gentoo Biscoe 45.1 14.5 215 5000 female 2007
175 Gentoo Biscoe 46.5 14.5 213 4400 female 2007
176 Gentoo Biscoe 46.3 15.8 215 5050 male 2007
177 Gentoo Biscoe 42.9 13.1 215 5000 female 2007
178 Gentoo Biscoe 46.1 15.1 215 5100 male 2007
179 Gentoo Biscoe 44.5 14.3 216 4100 <NA> 2007
180 Gentoo Biscoe 47.8 15.0 215 5650 male 2007
181 Gentoo Biscoe 48.2 14.3 210 4600 female 2007
182 Gentoo Biscoe 50.0 15.3 220 5550 male 2007
183 Gentoo Biscoe 47.3 15.3 222 5250 male 2007
184 Gentoo Biscoe 42.8 14.2 209 4700 female 2007
185 Gentoo Biscoe 45.1 14.5 207 5050 female 2007
186 Gentoo Biscoe 59.6 17.0 230 6050 male 2007
187 Gentoo Biscoe 49.1 14.8 220 5150 female 2008
188 Gentoo Biscoe 48.4 16.3 220 5400 male 2008
189 Gentoo Biscoe 42.6 13.7 213 4950 female 2008
190 Gentoo Biscoe 44.4 17.3 219 5250 male 2008
191 Gentoo Biscoe 44.0 13.6 208 4350 female 2008
192 Gentoo Biscoe 48.7 15.7 208 5350 male 2008
193 Gentoo Biscoe 42.7 13.7 208 3950 female 2008
194 Gentoo Biscoe 49.6 16.0 225 5700 male 2008
195 Gentoo Biscoe 45.3 13.7 210 4300 female 2008
196 Gentoo Biscoe 49.6 15.0 216 4750 male 2008
197 Gentoo Biscoe 50.5 15.9 222 5550 male 2008
198 Gentoo Biscoe 43.6 13.9 217 4900 female 2008
199 Gentoo Biscoe 45.5 13.9 210 4200 female 2008
200 Gentoo Biscoe 50.5 15.9 225 5400 male 2008
201 Gentoo Biscoe 44.9 13.3 213 5100 female 2008
202 Gentoo Biscoe 45.2 15.8 215 5300 male 2008
203 Gentoo Biscoe 46.6 14.2 210 4850 female 2008
204 Gentoo Biscoe 48.5 14.1 220 5300 male 2008
205 Gentoo Biscoe 45.1 14.4 210 4400 female 2008
206 Gentoo Biscoe 50.1 15.0 225 5000 male 2008
207 Gentoo Biscoe 46.5 14.4 217 4900 female 2008
208 Gentoo Biscoe 45.0 15.4 220 5050 male 2008
209 Gentoo Biscoe 43.8 13.9 208 4300 female 2008
210 Gentoo Biscoe 45.5 15.0 220 5000 male 2008
211 Gentoo Biscoe 43.2 14.5 208 4450 female 2008
212 Gentoo Biscoe 50.4 15.3 224 5550 male 2008
213 Gentoo Biscoe 45.3 13.8 208 4200 female 2008
214 Gentoo Biscoe 46.2 14.9 221 5300 male 2008
215 Gentoo Biscoe 45.7 13.9 214 4400 female 2008
216 Gentoo Biscoe 54.3 15.7 231 5650 male 2008
217 Gentoo Biscoe 45.8 14.2 219 4700 female 2008
218 Gentoo Biscoe 49.8 16.8 230 5700 male 2008
219 Gentoo Biscoe 46.2 14.4 214 4650 <NA> 2008
220 Gentoo Biscoe 49.5 16.2 229 5800 male 2008
221 Gentoo Biscoe 43.5 14.2 220 4700 female 2008
222 Gentoo Biscoe 50.7 15.0 223 5550 male 2008
223 Gentoo Biscoe 47.7 15.0 216 4750 female 2008
224 Gentoo Biscoe 46.4 15.6 221 5000 male 2008
225 Gentoo Biscoe 48.2 15.6 221 5100 male 2008
226 Gentoo Biscoe 46.5 14.8 217 5200 female 2008
227 Gentoo Biscoe 46.4 15.0 216 4700 female 2008
228 Gentoo Biscoe 48.6 16.0 230 5800 male 2008
229 Gentoo Biscoe 47.5 14.2 209 4600 female 2008
230 Gentoo Biscoe 51.1 16.3 220 6000 male 2008
231 Gentoo Biscoe 45.2 13.8 215 4750 female 2008
232 Gentoo Biscoe 45.2 16.4 223 5950 male 2008
233 Gentoo Biscoe 49.1 14.5 212 4625 female 2009
234 Gentoo Biscoe 52.5 15.6 221 5450 male 2009
235 Gentoo Biscoe 47.4 14.6 212 4725 female 2009
236 Gentoo Biscoe 50.0 15.9 224 5350 male 2009
237 Gentoo Biscoe 44.9 13.8 212 4750 female 2009
238 Gentoo Biscoe 50.8 17.3 228 5600 male 2009
239 Gentoo Biscoe 43.4 14.4 218 4600 female 2009
240 Gentoo Biscoe 51.3 14.2 218 5300 male 2009
241 Gentoo Biscoe 47.5 14.0 212 4875 female 2009
242 Gentoo Biscoe 52.1 17.0 230 5550 male 2009
243 Gentoo Biscoe 47.5 15.0 218 4950 female 2009
244 Gentoo Biscoe 52.2 17.1 228 5400 male 2009
245 Gentoo Biscoe 45.5 14.5 212 4750 female 2009
246 Gentoo Biscoe 49.5 16.1 224 5650 male 2009
247 Gentoo Biscoe 44.5 14.7 214 4850 female 2009
248 Gentoo Biscoe 50.8 15.7 226 5200 male 2009
249 Gentoo Biscoe 49.4 15.8 216 4925 male 2009
250 Gentoo Biscoe 46.9 14.6 222 4875 female 2009
251 Gentoo Biscoe 48.4 14.4 203 4625 female 2009
252 Gentoo Biscoe 51.1 16.5 225 5250 male 2009
253 Gentoo Biscoe 48.5 15.0 219 4850 female 2009
254 Gentoo Biscoe 55.9 17.0 228 5600 male 2009
255 Gentoo Biscoe 47.2 15.5 215 4975 female 2009
256 Gentoo Biscoe 49.1 15.0 228 5500 male 2009
257 Gentoo Biscoe 47.3 13.8 216 4725 <NA> 2009
258 Gentoo Biscoe 46.8 16.1 215 5500 male 2009
259 Gentoo Biscoe 41.7 14.7 210 4700 female 2009
260 Gentoo Biscoe 53.4 15.8 219 5500 male 2009
261 Gentoo Biscoe 43.3 14.0 208 4575 female 2009
262 Gentoo Biscoe 48.1 15.1 209 5500 male 2009
263 Gentoo Biscoe 50.5 15.2 216 5000 female 2009
264 Gentoo Biscoe 49.8 15.9 229 5950 male 2009
265 Gentoo Biscoe 43.5 15.2 213 4650 female 2009
266 Gentoo Biscoe 51.5 16.3 230 5500 male 2009
267 Gentoo Biscoe 46.2 14.1 217 4375 female 2009
268 Gentoo Biscoe 55.1 16.0 230 5850 male 2009
269 Gentoo Biscoe 44.5 15.7 217 4875 <NA> 2009
270 Gentoo Biscoe 48.8 16.2 222 6000 male 2009
271 Gentoo Biscoe 47.2 13.7 214 4925 female 2009
272 Gentoo Biscoe NA NA NA NA <NA> 2009
273 Gentoo Biscoe 46.8 14.3 215 4850 female 2009
274 Gentoo Biscoe 50.4 15.7 222 5750 male 2009
275 Gentoo Biscoe 45.2 14.8 212 5200 female 2009
276 Gentoo Biscoe 49.9 16.1 213 5400 male 2009
277 Chinstrap Dream 46.5 17.9 192 3500 female 2007
278 Chinstrap Dream 50.0 19.5 196 3900 male 2007
279 Chinstrap Dream 51.3 19.2 193 3650 male 2007
280 Chinstrap Dream 45.4 18.7 188 3525 female 2007
281 Chinstrap Dream 52.7 19.8 197 3725 male 2007
282 Chinstrap Dream 45.2 17.8 198 3950 female 2007
283 Chinstrap Dream 46.1 18.2 178 3250 female 2007
284 Chinstrap Dream 51.3 18.2 197 3750 male 2007
285 Chinstrap Dream 46.0 18.9 195 4150 female 2007
286 Chinstrap Dream 51.3 19.9 198 3700 male 2007
287 Chinstrap Dream 46.6 17.8 193 3800 female 2007
288 Chinstrap Dream 51.7 20.3 194 3775 male 2007
289 Chinstrap Dream 47.0 17.3 185 3700 female 2007
290 Chinstrap Dream 52.0 18.1 201 4050 male 2007
291 Chinstrap Dream 45.9 17.1 190 3575 female 2007
292 Chinstrap Dream 50.5 19.6 201 4050 male 2007
293 Chinstrap Dream 50.3 20.0 197 3300 male 2007
294 Chinstrap Dream 58.0 17.8 181 3700 female 2007
295 Chinstrap Dream 46.4 18.6 190 3450 female 2007
296 Chinstrap Dream 49.2 18.2 195 4400 male 2007
297 Chinstrap Dream 42.4 17.3 181 3600 female 2007
298 Chinstrap Dream 48.5 17.5 191 3400 male 2007
299 Chinstrap Dream 43.2 16.6 187 2900 female 2007
300 Chinstrap Dream 50.6 19.4 193 3800 male 2007
301 Chinstrap Dream 46.7 17.9 195 3300 female 2007
302 Chinstrap Dream 52.0 19.0 197 4150 male 2007
303 Chinstrap Dream 50.5 18.4 200 3400 female 2008
304 Chinstrap Dream 49.5 19.0 200 3800 male 2008
305 Chinstrap Dream 46.4 17.8 191 3700 female 2008
306 Chinstrap Dream 52.8 20.0 205 4550 male 2008
307 Chinstrap Dream 40.9 16.6 187 3200 female 2008
308 Chinstrap Dream 54.2 20.8 201 4300 male 2008
309 Chinstrap Dream 42.5 16.7 187 3350 female 2008
310 Chinstrap Dream 51.0 18.8 203 4100 male 2008
311 Chinstrap Dream 49.7 18.6 195 3600 male 2008
312 Chinstrap Dream 47.5 16.8 199 3900 female 2008
313 Chinstrap Dream 47.6 18.3 195 3850 female 2008
314 Chinstrap Dream 52.0 20.7 210 4800 male 2008
315 Chinstrap Dream 46.9 16.6 192 2700 female 2008
316 Chinstrap Dream 53.5 19.9 205 4500 male 2008
317 Chinstrap Dream 49.0 19.5 210 3950 male 2008
318 Chinstrap Dream 46.2 17.5 187 3650 female 2008
319 Chinstrap Dream 50.9 19.1 196 3550 male 2008
320 Chinstrap Dream 45.5 17.0 196 3500 female 2008
321 Chinstrap Dream 50.9 17.9 196 3675 female 2009
322 Chinstrap Dream 50.8 18.5 201 4450 male 2009
323 Chinstrap Dream 50.1 17.9 190 3400 female 2009
324 Chinstrap Dream 49.0 19.6 212 4300 male 2009
325 Chinstrap Dream 51.5 18.7 187 3250 male 2009
326 Chinstrap Dream 49.8 17.3 198 3675 female 2009
327 Chinstrap Dream 48.1 16.4 199 3325 female 2009
328 Chinstrap Dream 51.4 19.0 201 3950 male 2009
329 Chinstrap Dream 45.7 17.3 193 3600 female 2009
330 Chinstrap Dream 50.7 19.7 203 4050 male 2009
331 Chinstrap Dream 42.5 17.3 187 3350 female 2009
332 Chinstrap Dream 52.2 18.8 197 3450 male 2009
333 Chinstrap Dream 45.2 16.6 191 3250 female 2009
334 Chinstrap Dream 49.3 19.9 203 4050 male 2009
335 Chinstrap Dream 50.2 18.8 202 3800 male 2009
336 Chinstrap Dream 45.6 19.4 194 3525 female 2009
337 Chinstrap Dream 51.9 19.5 206 3950 male 2009
338 Chinstrap Dream 46.8 16.5 189 3650 female 2009
339 Chinstrap Dream 45.7 17.0 195 3650 female 2009
340 Chinstrap Dream 55.8 19.8 207 4000 male 2009
341 Chinstrap Dream 43.5 18.1 202 3400 female 2009
342 Chinstrap Dream 49.6 18.2 193 3775 male 2009
343 Chinstrap Dream 50.8 19.0 210 4100 male 2009
344 Chinstrap Dream 50.2 18.7 198 3775 female 2009
This has several problems:
-
We are showing way too much data. A reader will not be able to easily find meaningful trends or patterns in a big data set like this.
As a general rule, you should rarely have more than 10-15 rows in a table; even then, you will still want to guide your reader to the point of the table.
The column names are rather ugly. Some, like
speciesare not too bad, though it would still be better if they were capitalized. Others, likebill_lenare pretty terrible:bill_lenis not an English word, the underscore exists only to separate two words “in code” (recallR’s restrictions on variable names), and the unit isn’t clear. In this case, a column name likeBill Length (mm)would be far preferable.The row numbers are essentially pointless and just take up space, adding no value. Any content that is not adding value is simply distracting the reader from the content that has value.
-
The “point” of the table is unclear. What is a reader supposed to get from this? As a data analyst - doing work on behalf of a reader who may not be a data analyst - you have a responsibility to clearly convey the “story” of your findings and this does not do so.
I may want to use this data to show that Gentoo penguins are, on average, heavier than the other two species in this data set, but this is far from clear.
It’s just a bit ugly.
Good table design requires us to take on the mindset of the reader. Tools like gt can help pretty things up, but you still have to think about what you want to display. Well-formatted garbage is still garbage.
To start improving this table, let’s do the calculations for our reader instead of expecting them to do it all manually:
library(tidyverse)
penguins |>
group_by(species) |>
summarize(n_species = n(),
avg_body_mass = mean(body_mass, na.rm=TRUE)) |>
arrange(desc(avg_body_mass))# A tibble: 3 × 3
species n_species avg_body_mass
<fct> <int> <dbl>
1 Gentoo 124 5076.
2 Chinstrap 68 3733.
3 Adelie 152 3701.
We’re definitely not done - but here the “point” of the table is clear, at least if we also put some text surround it.
To improve this further, we can also pass this smaller summary data frame to the gt function from the package of the same name:
library(gt)
penguins |>
group_by(species) |>
summarize(n_species = n(),
avg_body_mass = mean(body_mass, na.rm=TRUE)) |>
arrange(desc(avg_body_mass)) |>
gt()| species | n_species | avg_body_mass |
|---|---|---|
| Gentoo | 124 | 5076.016 |
| Chinstrap | 68 | 3733.088 |
| Adelie | 152 | 3700.662 |
Note here that gt recognizes we are rendering an HTML page and produces a “real” HTML table here. If you were to copy and paste the table above into “table” software, e.g. Google Sheets or Microsoft Excel, it would be properly and automatically handled. For us, the table is the end-point, but it’s a nice courtesy to your reader who may want to use your results in their own presentations.
The gt package provides many functions for tweaking and improving the appearance of a table. You will almost always want to, at a minimum, use these for:
- Ordering and (re-)naming columns
- Adding titles and footers
- Formatting values
Let’s to through these one at a time. Firstly, we want to rename and reorder the columns. This can be done in pure dplyr with the select and rename columns, but we’ll show the gt way here:
library(gt)
penguins |>
group_by(species) |>
summarize(n_species = n(),
avg_body_mass = mean(body_mass, na.rm=TRUE)) |>
arrange(desc(avg_body_mass)) |>
gt() |>
cols_move_to_end(n_species) |>
cols_label(species="Species",
avg_body_mass = "Avg. Body Mass (g)",
n_species = "Number of Penguins in Sample") | Species | Avg. Body Mass (g) | Number of Penguins in Sample |
|---|---|---|
| Gentoo | 5076.016 | 124 |
| Chinstrap | 3733.088 | 68 |
| Adelie | 3700.662 | 152 |
Here, we used the cols_move_to_end function to move the n_species column to the end (no surprise!). In other contexts, we might want to use the cols_move_to_start function to move a column to the leftmost side of a table or cols_move to put a column in the middle of the table.
The cols_label function essentially serves as a renaming operation: the left side of each parentheses is the old column name in the table and the right side gives the new name. (Note, a bit confusingly, that this is the reverse of dplyr::rename.) While we can just pass a basic string here, we can also use the md function to pass Markdown which lets us do some custom formatting:
penguins |>
group_by(species) |>
summarize(n_species = n(),
avg_body_mass = mean(body_mass, na.rm=TRUE)) |>
arrange(desc(avg_body_mass)) |>
gt() |>
cols_move_to_end(n_species) |>
cols_label(species=md("**Species**"),
avg_body_mass = md("Avg. Body Mass (*g*)"),
n_species = md("*Number of Penguins in Sample*"))| Species | Avg. Body Mass (g) | Number of Penguins in Sample |
|---|---|---|
| Gentoo | 5076.016 | 124 |
| Chinstrap | 3733.088 | 68 |
| Adelie | 3700.662 | 152 |
Here, we could use boldface and italics for certain text using standard Markdown syntax.
Next, we can add a table title and subtitle to make the content and point of this table clear to our reader:
penguins |>
group_by(species) |>
summarize(n_species = n(),
avg_body_mass = mean(body_mass, na.rm=TRUE)) |>
arrange(desc(avg_body_mass)) |>
gt() |>
cols_move_to_end(n_species) |>
cols_label(species=md("**Species**"),
avg_body_mass = md("Avg. Body Mass (*g*)"),
n_species = md("*Number of Penguins in Sample*")) |>
tab_header(title="Average Body Mass of Three Penguin Species",
subtitle="Gentoo penguins are the largest in the study")| Average Body Mass of Three Penguin Species | ||
|---|---|---|
| Gentoo penguins are the largest in the study | ||
| Species | Avg. Body Mass (g) | Number of Penguins in Sample |
| Gentoo | 5076.016 | 124 |
| Chinstrap | 3733.088 | 68 |
| Adelie | 3700.662 | 152 |
While we can and should describe our analysis in more detail in the main text, I like this pattern of having the super-simple one-liner present directly in the table. This also makes it convenient to clip the table (or a screenshot thereof) for use in other documents and presentations.
Next, we should always note the source of the data used to get our results. In this case, the original penguins data comes from this article so we can cite that in our work Note the use of Markdown (md()) to let us include a link to the original source within our table:
penguins |>
group_by(species) |>
summarize(n_species = n(),
avg_body_mass = mean(body_mass, na.rm=TRUE)) |>
arrange(desc(avg_body_mass)) |>
gt() |>
cols_move_to_end(n_species) |>
cols_label(species=md("**Species**"),
avg_body_mass = md("Avg. Body Mass (*g*)"),
n_species = md("*Number of Penguins in Sample*")) |>
tab_header(title="Average Body Mass of Three Penguin Species",
subtitle="Gentoo penguins are the largest in the study") |>
tab_source_note(md("Data originally published in by K.B. Gorman, T.D. Williams,
and W. R. Fraser in 'Ecological Sexual Dimorphism and
Environmental Variability within a Community of Antarctic
Penguins (Genus *Pyogscelis*). *PLoS One* 9(3): e90081.
<https://doi.org/10.1371/journal.pone.0090081>. Later
popularized via the `R` package
[`palmerpenguins`](https://allisonhorst.github.io/palmerpenguins/)"))| Average Body Mass of Three Penguin Species | ||
|---|---|---|
| Gentoo penguins are the largest in the study | ||
| Species | Avg. Body Mass (g) | Number of Penguins in Sample |
| Gentoo | 5076.016 | 124 |
| Chinstrap | 3733.088 | 68 |
| Adelie | 3700.662 | 152 |
Data originally published in by K.B. Gorman, T.D. Williams, and W. R. Fraser in ’Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pyogscelis). PLoS One 9(3): e90081. https://doi.org/10.1371/journal.pone.0090081. Later popularized via the R package palmerpenguins
|
||
A small, but lovely, quality of life feature here is the fact that md will automatically re-align the text to the fit the dimensions of the rendered table. This lets us put new lines within our citation text so that our code doesn’t exceed the 80 characters-per-line guideline.
Finally, we want to make sure the number values are formatted appropriately. For these numbers, plain formatting really isn’t much of a problem, but for very large or small numbers, we might want to use scientific notation; for dates, we might want to control the formatting; etc.. This is done with the fmt_* family of functions.
Each fmt_ function takes one or more column names and applies a formatting transformation to that column. The specifics of the formatting can be controlled with additional optional arguments. For example, if we want to round the average weight to the nearest gram, we would use the fmt_number function with the argument deicmals=0:
penguins |>
group_by(species) |>
summarize(n_species = n(),
avg_body_mass = mean(body_mass, na.rm=TRUE)) |>
arrange(desc(avg_body_mass)) |>
gt() |>
cols_move_to_end(n_species) |>
cols_label(species=md("**Species**"),
avg_body_mass = md("Avg. Body Mass (*g*)"),
n_species = md("*Number of Penguins in Sample*")) |>
tab_header(title="Average Body Mass of Three Penguin Species",
subtitle="Gentoo penguins are the largest in the study") |>
tab_source_note(md("Data originally published in by K.B. Gorman, T.D. Williams,
and W. R. Fraser in 'Ecological Sexual Dimorphism and
Environmental Variability within a Community of Antarctic
Penguins (Genus *Pyogscelis*). *PLoS One* 9(3): e90081.
<https://doi.org/10.1371/journal.pone.0090081>. Later
popularized via the `R` package
[`palmerpenguins`](https://allisonhorst.github.io/palmerpenguins/)")) |>
fmt_number(avg_body_mass, decimals=0)| Average Body Mass of Three Penguin Species | ||
|---|---|---|
| Gentoo penguins are the largest in the study | ||
| Species | Avg. Body Mass (g) | Number of Penguins in Sample |
| Gentoo | 5,076 | 124 |
| Chinstrap | 3,733 | 68 |
| Adelie | 3,701 | 152 |
Data originally published in by K.B. Gorman, T.D. Williams, and W. R. Fraser in ’Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pyogscelis). PLoS One 9(3): e90081. https://doi.org/10.1371/journal.pone.0090081. Later popularized via the R package palmerpenguins
|
||
In this case, it may be more appropriate to display the body mass in kilograms and we can do so semi-automatically with the format_number_si formatter:
penguins |>
group_by(species) |>
summarize(n_species = n(),
avg_body_mass = mean(body_mass, na.rm=TRUE)) |>
arrange(desc(avg_body_mass)) |>
gt() |>
cols_move_to_end(n_species) |>
cols_label(species=md("**Species**"),
avg_body_mass = md("Avg. Body Mass "),
n_species = md("*Number of Penguins in Sample*")) |>
tab_header(title="Average Body Mass of Three Penguin Species",
subtitle="Gentoo penguins are the largest in the study") |>
tab_source_note(md("Data originally published in by K.B. Gorman, T.D. Williams,
and W. R. Fraser in 'Ecological Sexual Dimorphism and
Environmental Variability within a Community of Antarctic
Penguins (Genus *Pyogscelis*). *PLoS One* 9(3): e90081.
<https://doi.org/10.1371/journal.pone.0090081>. Later
popularized via the `R` package
[`palmerpenguins`](https://allisonhorst.github.io/palmerpenguins/)")) |>
fmt_number_si(avg_body_mass,
decimals=2,
unit = "g")| Average Body Mass of Three Penguin Species | ||
|---|---|---|
| Gentoo penguins are the largest in the study | ||
| Species | Avg. Body Mass | Number of Penguins in Sample |
| Gentoo | 5.08 kg | 124 |
| Chinstrap | 3.73 kg | 68 |
| Adelie | 3.70 kg | 152 |
Data originally published in by K.B. Gorman, T.D. Williams, and W. R. Fraser in ’Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pyogscelis). PLoS One 9(3): e90081. https://doi.org/10.1371/journal.pone.0090081. Later popularized via the R package palmerpenguins
|
||
Note that, because fmt_number_si automatically includes the unit and transforms it to the most natural scale (here kg), we can remove the unit from the column name. gt has many more advanced options that can be used for further customization; refer to the package documentation for more details or ask on the course discussion board.
Exploratory Analysis
When faced with a new data set, it is tempting to look only at the first few rows to get a sense of the data: R does this by default. In practice, I recommend viewing a random selection of rows instead. This won’t guarantee you find any issues, but it increases the probability of finding issues in older parts of a data set. The slice_sample function can be used for this.
While EDA can be an extensive activity on its own, at an absolute minimum, I recommend you always do at least two basic checks:
-
Ensure that you know what
Rthinks your data is. You might see a value like"2025-12-01"and think thatRis reading it as a date value, butRmight instead by interpreting it as a string value.5 You have several options for this inR, but when working with a data frame, I’d recommend theglimpsefunction: e.g.glimpse(penguins)Rows: 344 Columns: 8 $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad… $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor… $ bill_len <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, … $ bill_dep <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, … $ flipper_len <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180,… $ body_mass <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, … $ sex <fct> male, female, female, NA, female, male, female, male, NA, … $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…After getting the basic dimensions of this data frame,
glimpse()will print a line summary of each column giving its name, type, and the first few values in the table.In this case, since I prepared the data for you, all columns are of the correct type (mostly numeric, with a few character columns for institution name and state name, and the two Boolean columns you created earlier), but this is a quick and easy check. If there are issues with your data types, it’s better to catch them early than to have silent and hard to identify errors further down the line. (“Fail fast” is great advice in any programming exercise.)
-
Take a quick look at some basic (univariate) summary statistics for each column. There are several functions for this in base
R: e.g.,summary(penguins)species island bill_len bill_dep Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10 Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60 Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30 Mean :43.92 Mean :17.15 3rd Qu.:48.50 3rd Qu.:18.70 Max. :59.60 Max. :21.50 NA's :2 NA's :2 flipper_len body_mass sex year Min. :172.0 Min. :2700 female:165 Min. :2007 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007 Median :197.0 Median :4050 NA's : 11 Median :2008 Mean :200.9 Mean :4202 Mean :2008 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009 Max. :231.0 Max. :6300 Max. :2009 NA's :2 NA's :2But I actually like the
skimfunction from theskimrpackage:Data summary Name penguins Number of rows 344 Number of columns 8 _______________________ Column type frequency: factor 3 numeric 5 ________________________ Group variables None Variable type: factor
skim_variable n_missing complete_rate ordered n_unique top_counts species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68 island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52 sex 11 0.97 FALSE 2 mal: 168, fem: 165 Variable type: numeric
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist bill_len 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁ bill_dep 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂ flipper_len 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂ body_mass 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂ year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇ You’ll see that this gives some nice overview about data structure and types, identifies grouping structure (if any is present), and gives summaries appropriate to the type of each column (here
factors, i.e., categorical variables andnumeric). I like that this summary gives means, standard deviations, minima (p0), maxima (p100), medians (p50), and a cute little histogram of each variable.Note that this type of summary only reveals univariate structure: if there are interesting multi-dimensional outliers or weird correlation patterns, they won’t appear here.
Perform the two checks described above on the IPEDS data you loaded and processed above. Include the code to perform the checks, the output of the checks, and describe what you see.
In general, you shouldn’t include this in analysis reports, but we’re making an exception here to practice a good habit. I won’t ask you to include this analysis in future mini-projects or your course project reports, but these checks are easy and very valuable, so I recommend you make them part of all analyses.
We are now ready to begin some EDA. Analysts organize their EDA in a variety of ways, but one of my favorites is to think of a variety of interesting questions and to answer them. When the answers don’t match my intuition, I know I’ve found somewhere I want to dig deeper. For the first few mini-projects, I will provide these exploratory questions. Later in the course, particularly after we have discussed the role of plotting and graphics in EDA, you will have opportunities to organize your EDA in other fashions.
Using dplyr tools, answer the following questions:
How many distinct institutions appear in this data set?
-
How many graduate students were enrolled at Baruch in 2024?
Hint: Use the
str_detectfunction discussed above inside afiltercommand to identify Baruch. -
How many total students were enrolled at Baruch in 2024?
Hint: Make sure to avoid double-counting first-year undergraduates.
-
Which institution had the highest number of enrolled female students in 2019?
Report at least both the institution and the total number of female students.
-
Which institution with over 1000 total students admitted the highest proportion of Native Hawaiian or Pacific Islander (
nhpi) first-year undergraduates in 2024?Report at least both the institution and the fraction of relevant students.
As you go through these questions, you may find it useful to create new variables in the IPEDS data to avoid repeated lengthy calculations, e.g., a new variable for the total number of enrolled students.
Each of these questions can be answered with one or two scalar values. Use Quarto’s inline code functionality to place the values in a sentence; that is, you should answer in complete sentences, written as normal text with inline code for computed values.
Where appropriate, the scales package can be used to format numbers attractively: e.g.,
The next set of questions have slightly more complex answers and should be answered with a table, formatted using the techniques described above. Longer term, you will prefer plots to tables as they are a bit easier to interpret (humans being visual creatures) but these are the types of questions you might use to create plots as well.
Using dplyr tools, answer the following questions:
Which 5 states had the highest number of graduate students across all institutions located in that state?
-
In 2024, how many first year undergraduate students were enrolled at CUNY colleges and which colleges did they attend? Report both absolute enrollment numbers and percent of total first-year undergraduates?
Hint: The
fmt_percentandfmt_numberfunctions will be useful here. -
How has Baruch’s total undergraduate enrollment changed over the study period? Report both enrollment numbers and percent change year-over-year.
Hint: The
lagfunction will be helpful here. -
At what 5 institutions did the fraction of white students decrease the most over the period from 2010 to 2020?
Hint: You may want to pre-filter by total enrollment to make sure you are not only reporting very small institutions in your analysis, as these are more prone to large fluctuations.
In which 3 states did the fraction of female undergraduates increase the most over the period from 2010 to 2024?
Each of these questions can be answered with a table of just a few rows. Use the gt package, as introduced above, to present your results in an attractive ‘publication-quality’ format, not just a “raw” R output.
Final Deliverable: College Newpaper Op-Ed
At this point, you have acquired your data, cleaned and prepared it, and performed your EDA. Now, you are ready to get to work on the final deliverable of your analysis. Everything that comes before this is important, but typically less visible to your final customer.
Write a brief (no more than 750 words) Op-Ed from the perspective of a college president to be published in the campus newspaper. You can be the president of any institution you want, so long as it has a meaningful undergraduate population, (i.e., not stand-alone graduate schools like the CUNY School of Law), so have some fun with choosing your persona.
Your Op-Ed should include (at a minimum) the following information:
The definition of diversity you are using
Some brief statistics about the size and make-up of the student body at your institution
A year-over-year comparison of the entering first-year first-time undergraduate class.
A year-over-year comparison of the demographics of the entire student body
A discussion of long-term diversity trends at your institution
-
A comparison of changes at your institution to one-or-more of the California public institutions.
(Think of these as a ‘critical value’ in a statistical test for a change: if your change is smaller than a California change, when the California schools shouldn’t have had to change their policy at all, your change is likely just noise.)
You may include optionally additional tables or even visualizations if you want, but these do not replace the requirement to write an Op-Ed. Your Op-Ed should stand “alone” and not be mixed in with your code. Place the code necessary to perform the op-ed calculations in a separate section and use inline code chunks to include results of your analysis in the text of your op-ed. Op-Eds that hard-code calculated values will be penalized.
For purposes of this exercise, you can measure “diversity” as the fraction of non-White/non-Asian-American students in your undergraduate class. If you want to use a more sophisticated metric, see the first extra credit opportunity below. It is up to you whether you want to consider gender diversity or not in your analysis: if you don’t want to consider gender diversity, simply sum corresponding m and f columns within each racial group.
AI Usage Statement
At the end of your report, you must include a description of the extent to which you used Generative AI tools to complete the mini-project. This should be a one paragraph section clearly deliniated using a collapsable Quarto “Callout Note”. Failure to include an AI disclosure will result in an automatic 25% penalty.
E.g.,
No Generative AI tools were used to complete this mini-project.
or
GitHub Co-Pilot Pro was used via RStudio integration while completing this project. No other generative AI tools were used.
or
ChatGPT was used to help write the code in this project, but all non-code text was generated without the use of any Generative AI tools. Additionally, ChatGPT was used to provide additional background information on the topic and to brainstorm ideas for the final open-ended prompt.
Recall that Generative AI may not be used to write or edit any non-code text in this course.
These blocks should be created using the following syntax:
::: {.callout-note title="AI Usage Statement" collapse="true"}
Your text goes here.
:::
Make sure to use this specific type of callout (.callout-note), title, and collapse="true" setting.
Please contact the instructor if you have any questions about appropriate AI usage in this course.
Extra Credit Opportunities
There are optional Extra Credit Opportunities where extra points can be awarded for specific additional tasks in this mini-project. The amount of the extra credit is typically not proportional to the work required to complete these tasks, but I provide these for students who want to dive deeper into this project and develop additional data analysis skills not covered in the main part of this mini-project.
For this mini-project, no more than 6 total points of extra credit may be awarded. Even with extra credit, your grade on this mini-project cannot exceed 80 points total.
Entropy Analysis (Up to 2 Points)
Diversity of a population is a difficult quantity to measure. While simple statistics (“percent female” or “percent underrepresented minority”) are often used, they suffer from various challenges in an increasingly diverse world. (For example, NYC does not have a racial majority, so what does it mean to be an underrepresented minority in the context of city politics?) History and social context can guide the choice of diversity metric, but for this extra credit opportunity, you can use a statistical measure of diversity known as entropy.
Op-Eds that use the concept of entropy may be awarded up to two points of extra credit. Specifically, entropy must be used to answer all of the questions specified in Task 7 above.
Entropy of a distribution measures how hard it is to predict. Consider two scenarios:
- An urn filled with 99% red balls and 1% green balls;
- An urn filled with an equal mix of red and green balls.
If you are asked repeatedly to guess the color of the next ball drawn from an urn, knowing only the baseline mixture, your predictions for the first urn will be correct 99% of the time (assuming you use the obvious “always guess red” strategy), while no strategy can be correct more than 50% of the time for the second urn.
Entropy formalizes this concept as follows: given a (discrete) random variable \(X\) taking values in a set \(\mathcal{X}\), each with probability \(p(\cdot)\). The entropy of \(X\) is given by:
\[H(X) = -\sum_{x \in \mathcal{X}: p(x) > 0} p(x) \log p(x)\]
The sum is taken over all outcomes \(x\) with non-zero probability. For our two urns above:
\[\begin{align*} H(\text{Urn 1}) = -\left(0.99 * \log(0.99) + 0.01 * \log(0.01)\right) = 0.056 \\ H(\text{Urn 2}) = -\left(0.50 * \log(0.50) + 0.50 * \log(0.50)\right) = 0.693 \\ \end{align*}\]
This tells us that Urn 2 is quite a bit more random than Urn 1.6 Entropy is particularly helpful in our context as it applies naturally to categorial quantities like race where numerical measures of randomness like variance don’t naturally fit.
This definition can be extended straightforwardly to multi-category quantities. Consider these demographics of Bronx and Queens counties in the 2020 census:
| Demographics of Two NYC Boroughs | |||||
|---|---|---|---|---|---|
| Data from 2020 Census | |||||
| County / Borough | Asian | Black | Hispanic | White | All Other |
| Bronx | 4.60% | 28.48% | 54.76% | 8.88% | 3.36% |
| Queens | 27.30% | 15.85% | 22.76% | 22.84% | 6.25% |
| Asian, Black, and White percentages correspond to census estimates of Asian (only, non-hispanic), etc. Hispanic percentage corresponds to census estimates of Hispanic (any race), and All Other was calculated to values would sum to 100%. | |||||
| Percentages from relevant Queens and Bronx Wikipedia articles | |||||
If you repeat the entropy calculation here (with five terms in the sum), you find that the Bronx has an entropy of approximately 1.16 while Queens has an entropy of 1.49, indicating a more diverse population, even though the Bronx has a higher proportion of Black and Hispanic residents.
Hint: When computing entropy, it is useful to take \(0 * \log(0) = 0\) so that impossible outcomes are automatically discarded. As such, code like this is useful to avoid NA issues:
bronx_demos <- c(0.046, 0.2848, 0.5476, 0.0888, 0.0336)
-sum(bronx_demos * log(bronx_demos + 1e-10))[1] 1.158139
The small ‘nugget’ term (1e-10) has no real impact on most probabilities, but prevents issues arising from \(\log(0)\) being undefined.
Advanced dplyr Programming (Up to 3 Points)
For up to three extra credit points, add the following question to your Task 6 EDA: In 2024, which 3 institutions of at least 1000 undergraduates had student populations that were ‘most representative’ of the US undergraduate population as a whole?
Extra credit will be determined on the basis of how accurately and how efficiently the KL divergence is calculated, in addition to how well it is explained and presented in table format.
The Kullback–Leibler divergence or KL divergence can be used to measure the difference between two different probability distributions. For this problem, you can use the enrollment counts at each institution to define a probability distribution (e.g., \(\mathbb{P}(\text{White Male}) = 20\%\), \(\mathbb{P}(\text{White Female}) = 22\%\), etc.) for that campus and compare it to the national undergraduate population.
The following example may help: suppose I have categorized my personal library into three categories of books: i) fiction; ii) non-fiction; and iii) textbooks. Furthermore, my books are unorganized and spread randomly across 5 shelves as follows:
| Shelf | Fiction | Nonfiction | Textbook |
|---|---|---|---|
| 1 | 15 | 20 | 5 |
| 2 | 20 | 10 | 10 |
| 3 | 40 | 20 | 20 |
| 4 | 10 | 10 | 30 |
| 5 | 15 | 20 | 20 |
Comparing these shelves is a bit tricky since there aren’t the same number of books on each shelf and a purely ‘numbers-based’ comparison might treat Shelf 2 as more representative than Shelf 3 simply because it has fewer books. The KL divergence things of these as probabilities (e.g., Shelf 2 is 50% fiction) and can be used to compare them to the overall probabilities of the whole collection.
Adding up the columns, I have 100 fiction books, 80 non-fiction books, and 85 textbooks, so my collection is about 38% fiction, 30% non-fiction, and 32% textbooks.
We use these probabilities to compare each row against the overall population using the KL divergence formula:
\[\mathcal{D}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}\]
where the sum \(x\) is taken over all categories and \(P, Q\) are different probability distributions. Here \(P\) is the “baseline” (correct) distribution and \(Q\) is the approximation. So the KL divergence between Shelf 1, which is about 38% fiction, 50% non-fiction, and 12% textbooks, and the collection as a whole is given by
\[ 0.3773585 \log \frac{0.3773585}{0.375} + 0.3018868 \log \frac{0.3018868}{0.5} + 0.3207547 \log \frac{0.3207547}{0.125} \approx 0.1523145 \]
We can repeat this analysis across all 5 shelves to get the following KL table
| Shelf | KL Divergence from Entire Collection |
|---|---|
| 1 | 0.1523 |
| 2 | 0.0307 |
| 3 | 0.0307 |
| 4 | 0.1630 |
| 5 | 0.0261 |
From which we see that Shelf 5 is “most representative” of my whole collection. Note also that Shelves 2 and 3 have the same KL Divergence because they are (proportionally) the same.
Hint: Note that this is a significantly more complex analysis than we have performed so far and requires some ideas that won’t be covered until Week 6 and some column selection functionality that will not be covered in lecture. Make sure to access the source code for this assignment and use it as a guide for completing this extra credit task.
Data Visualization (1 point)
Inclusion of a well-formatted visual element to accompany your op-ed will get one extra credit point.
This work ©2026 by Michael Weylandt is licensed under a Creative Commons BY-NC-SA 4.0 license. 
Footnotes
Following the decision in June 2023, universities were first applying post-SFFA admissions policies for the undergraduate class entering in Fall 2024. Due to reporting lags, these data were released in late 2025 and are currently the latest available. As data from more admissions cycles are released in IPEDS, the impact (or not) of SFFA will become clearer.↩︎
This the level of “ChatGPT-level” prose, without obvious flaws but lacking the style and elegance associated with true quality writing.↩︎
Throughout this section, replace
YOUR_GITHUB_IDwith your GitHub ID from Mini-Project #00. Note that the automated course infrastructure will be looking for precise formatting, so follow these instructions closely.↩︎str_detectcan actually be used to perform significantly more complex string analysis than simple “does it contain this subset of letters” but we won’t cover that sort of string processing for a few more weeks.↩︎In one famous (and slightly tragic) example, Microsoft Excel silently misinterpreted the names of various genes as numeric values and changed them from (what it thought was) scientific notation to (what it thought was) standard numeric formatting. This wound up ruining several important scientific studies. Always check your data types! The original study identifying this problem can be found here and a popular news summary is here.↩︎
The units of entropy are somewhat tricky to understand. For us, it suffices to know that a larger entropy means more randomness.↩︎