STA/OPR 9750 Mini-Project #03: Do Proportional Electoral College Allocations Yield a More Representative Presidency?

Due Dates

  • Released to Students: 2024-10-24
  • Initial Submission: 2024-11-13 11:45pm ET on GitHub and Brightspace
  • Peer Feedback:
    • Peer Feedback Assigned: 2024-11-14 on GitHub
    • Peer Feedback Due: 2024-11-20 11:45pm ET on GitHub

Introduction

Welcome to Mini-Project #03! In this project, you will write a political fact-check, that most iconic form of our current journalistic era. Specifically, you will investigate the claim that the US Electoral College systematically biases election results away from the vox populi. As you dive in to the world of political data, we’ll also learn a bit more about the mechanics of US federal elections.

In this Mini-Project, you will:

  • Integrate data from disparate governmental and academic sources
  • Learn to work with spatial data formats
  • Create many plots
  • Use spatial and animated visualizations to make your argument

Note that - as with all these mini-projects - there isn’t a single “right” answer to the questions posed herein. You may have different views about the relative importance of federalism, direct democratic structures, adherence to the formal structures of the US Constitution, etc. than your classmates. Please make sure to make your argument respectfully and, when we reach the peer-evaluation stage, read and comment respectfully. All grading will be done solely on the quality of the code, the writing, the visualizations, and the argument - not on the political implications of what you may or may not find.

Also note that this mini-project is intended to be markedly less demanding than Mini-Project #02. At this point in the course, you should be diving into your Course Project, which should consume the majority of your out-of-class time dedicated to this course for the remainder of the semester.

Background

The US Constitution sets the basic rules of electing the President in Section 1 of Article II, which we quote here in part:

Each State shall appoint, in such Manner as the Legislature thereof may direct, a Number of Electors, equal to the whole Number of Senators and Representatives to which the State may be entitled in the Congress: but no Senator or Representative, or Person holding an Office of Trust or Profit under the United States, shall be appointed an Elector.

The Electors shall meet in their respective States, and vote by Ballot for two Persons, of whom one at least shall not be an Inhabitant of the same State with themselves. And they shall make a List of all the Persons voted for, and of the Number of Votes for each; which List they shall sign and certify, and transmit sealed to the Seat of the Government of the United States, directed to the President of the Senate. The President of the Senate shall, in the Presence of the Senate and House of Representatives, open all the Certificates, and the Votes shall then be counted. The Person having the greatest Number of Votes shall be the President, if such Number be a Majority of the whole Number of Electors appointed; and if there be more than one who have such Majority, and have an equal Number of Votes, then the House of Representatives shall immediately chuse by Ballot one of them for President; and if no Person have a Majority, then from the five highest on the List the said House shall in like Manner chuse the President. But in chusing the President, the Votes shall be taken by States, the Representation from each State having one Vote; A quorum for this Purpose shall consist of a Member or Members from two thirds of the States, and a Majority of all the States shall be necessary to a Choice. In every Case, after the Choice of the President, the Person having the greatest Number of Votes of the Electors shall be the Vice President. But if there should remain two or more who have equal Votes, the Senate shall chuse from them by Ballot the Vice President.

Though the details have varied over time due to amendment, statue, and technology, this basic outline of this allocation scheme remains unchanged:

  • Each state gets \(R + 2\) electoral college votes, where \(R\) is the number of Representatives that state has in the US House of Representatives. In this mini-project, you can use the number of districts in a state to determine the number of congressional representatives (one per district).
  • States can allocate those votes however they wish
  • The president is the candidate who receives a majority of electoral college votes

Notably, the Constitution sets essentially no rules on how the \(R + 2\) electoral college votes (ECVs) for a particular state are allocated. At different points in history, different states have elected to use each of the following:

  • Direct allocation of ECVs by state legislature (no vote)
  • Allocation of all ECVs to winner of state-wide popular vote
  • Allocation of all ECVs to winner of nation-wide popular vote
  • Allocation of \(R\) ECVs to popular vote winner by congressional district + allocation of remaining \(2\) ECVs to the state-wide popular vote winner

Currently, only Maine and Nebraska use the final option; the other 48 states and the District of Columbia award all \(R+2\) ECVs to the winner of their state-wide popular vote. We emphasize here that “statewide winner-take-all” is a choice made by the individual states, not dictated by the US constitution, and that states have the power to change it should they wish.1

To my knowledge, no US state uses true proportionate state-wide representation, though I believe such a ECV-allocation scheme would be consistent with the US Constitution. For example, if a state with 5 ECVs had 60,000 votes for Candidate A and 40,000 cast for Candidate B, it could award 3 ECVs to A and 2 to B, regardless of the spatial distribution of those votes within the state.

Mini-Project Objectives

In this project, you will use historical congressional election data to see how the outcome of US presidential elections would have changed under different allocation rules. Like any retrodiction2 task, this analysis has limitations. Notably, if the “rules” had been different, politicians may have run different campaigns and received different vote counts. Still, it is my hope that this is an interesting and informative exercise.

As noted above, your final submission should take the form of a “Fact Check”:

  • Take a statement from a well-known politician or political commentator describing (claimed) bias of the electoral college system
  • Analyze presidential election results under different allocations for presence or abscence of bias (however you define it - see below)
  • Summarize your retrodictive findings
  • Award a “truthfulness” score to the claim you evaluated. (You may use the scale of an existing political fact-check operation or create your own.)

Student Responsbilities

Recall our basic analytic workflow and table of student responsibilities:

  • Data Ingest and Cleaning: Given a single data source, read it into R and transform it to a reasonably useful standardized format.
  • Data Combination and Alignment: Combine multiple data sources to enable insights not possible from a single source.
  • Descriptive Statistical Analysis: Take a data table and compute informative summary statistics from both the entire population and relevant subgroups
  • Data Visualization: Generate insightful data visualizations to spur insights not attainable from point statistics
  • Inferential Statistical Analysis and Modeling: Develop relevant predictive models and statistical analyses to generate insights about the underlying population and not simply the data at hand.
Students’ Responsibilities in Mini-Project Analyses
Ingest and Cleaning Combination and Alignment Descriptive Statistical Analysis Visualization
Mini-Project #01
Mini-Project #02 ½
Mini-Project #03 ½
Mini-Project #04

In this mini-project, you will be working with relatively “clean” electoral data and your main focus should be on the analysis and visualization supporting your fact check. As an analysis of political data, I expect your final submission to have quite a few “red state/blue state” maps.3 Data cleaning and import will play a larger role in Mini-Project #04.

In this project, I am no longer providing code to download and read the necessary data files. The data files I have selected for this mini-project are relatively easy to work with and should not provide a significant challenge, particularly after our in-class discussion of Data Import. See the modified rubric below which now includes a grade for data import.

Rubric

STA/OPR 9750 Mini-Projects are evaluated using peer grading with meta-review by the course GTAs. Specifically, variants of the following rubric will be used for the mini-projects:

Mini-Project Grading Rubric
Course Element Excellent (9-10) Great (7-8) Good (5-6) Adequate (3-4) Needs Improvement (1-2) Extra Credit
Written Communication Report is well-written and flows naturally. Motivation for key steps is clearly explained to reader without excessive detail. Key findings are highlighted and appropriately given context. Report has no grammatical or writing issues. Writing is accessible and flows naturally. Key findings are highlighted, but lack suitable motivation and context. Report has no grammatical or writing issues. Key findings are present but insufficiently highlighted. Writing is intelligible, but has some grammatical errors. Key findings are obscured. Report exhibits significant weakness in written communication. Key points are difficult to discern. Report includes extra context beyond instructor provided information.
Project Skeleton Code completes all instructor-provided tasks correctly. Responses to open-ended tasks are particularly insightful and creative. Code completes all instructor-provided tasks satisfactorially. Response to one instructor provided task is skipped, incorrect, or otherwise incomplete. Responses to two instructor provided tasks are skipped, incorrect, or otherwise incomplete. Response to three or ore instructor provided tasks are skipped, incorrect, or otherwise incomplete. Report exhibits particularly creative insights drawn from thorough student-initiated analyses.
Formatting & Display

Tables and figures are full ‘publication-quality’.

Report includes at least one animated visualization designed to effectively communicate findings.

Tables have well-formatted column names, suitable numbers of digits, and attractive presentation.

Figures are ‘publication-quality’, with suitable axis labels, well-chosen structure, attractive color schemes, titles, subtitles, and captions, etc.

Tables are well-formatted, but still have room for improvement.

Figures are above ‘exploratory-quality’, but do not reach full ‘publication-quality’.

Tables lack significant ‘polish’ and need improvement in substance (filtering and down-selecting of presented data) or style.

Figures are suitable to support claims made, but are ‘exploratory-quality’, reflecting minimal effort to customize and ‘polish’ beyond ggplot2 defaults.

Unfiltered ‘data dump’ instead of curated table.

Baseline figures that do not fully support claims made.

Report includes interactive (not just animated) visual elements.
Code Quality

Code is (near) flawless.

Code passes all styler and lintr type analyses without issue.

Comments give context of the analysis, not simply defining functions used in a particular line. Code has well-chosen variable names and basic comments. Code executes properly, but is difficult to read. Code fails to execute properly. Code takes advantage of advanced Quarto features to improve presentation of results.
Data Preparation Data import is fully-automated and efficient, taking care to only download from web-sources if not available locally. Data is imported and prepared effectively, in an automated fashion with minimal hard-coding of URLs and file paths. Data is imported and prepared effectively, though source and destination file names are hard-coded. Data is imported in a manner likely to have errors. Data is hard-coded and not imported from an external source. Report uses additional data sources in a way that creates novel insights.

Note that this rubric is designed with copious opportunities for extra credit if students go above and beyond the instructor-provided scaffolding. Students pursuing careers in data analytics are strongly encouraged to go beyond the strict ambit of the mini-projects to i) further refine their skills; ii) learn additional techniques that can be used in the final course project; and iii) develop a more impressive professional portfolio.

Because students are encouraged to use STA/OPR 9750 mini-projects as the basis for a professional portfolio, the basic skeleton of each project will be released under a fairly permissive usage license. Take advantage of it!

Submission Instructions

After completing the analysis, write up your findings, showing all of your code, using a dynamic quarto document and post it to your course repository. The qmd file should be named mp03.qmd so the rendered document can be found at docs/mp03.html in the student’s repository and served at the URL:

https://<GITHUB_ID>.github.io/STA9750-2024-FALL/mp03.html

Once you confirm this website works (substituting <GITHUB_ID> for the actual GitHub username provided to the professor in MP#00 of course), open a new issue at

https://github.com/<GITHUB_USERNAME>/STA9750-2024-FALL/issues/new .

Title the issue STA/OPR 9750 <GITHUB_USERNAME> MiniProject #03 and fill in the following text for the issue:

Hi @michaelweylandt!


https://<GITHUB_USERNAME>.github.io/STA9750-2024-FALL/mp03.html

Once the submission deadline passes, the instructor will tag classmates for peer feedback in this issue thread.

Additionally, a PDF export of this report should be submitted on Brightspace. To create a PDF from the uploaded report, simply use your browser’s ‘Print to PDF’ functionality.

NB: The analysis outline below specifies key tasks you need to perform within your write up. Your peer evaluators will check that you complete these. You are encouraged to do extra analysis, but the bolded Tasks are mandatory.

NB: Your final submission should look like a report, not simply a list of facts answering questions. Add introductions, conclusions, and your own commentary. You should be practicing both raw coding skills and written communication in all mini-projects. There is little value in data points stated without context or motivation.

Set-Up and Initial Exploration

Data I: US House Election Votes from 1976 to 2022

The MIT Election Data Science Lab collects votes from all biennial congressional races in all 50 states here. Download this data as a CSV file using your web browser. Note that you will need to provide your contact info and agree to cite this data set in your final report.4 Make sure to include this citation!

Additionally, download statewide presidential vote counts from 1976 to 2022 here. As before, it will likely be easiest to download this data by hand using your web browser.

Data II: Congressional Boundary Files 1976 to 2012

Jeffrey B. Lewis, Brandon DeVine, Lincoln Pritcher, and Kenneth C. Martis have created shapefiles for all US congressional districts from 1789 to 2012; they generously make these available here.

Task 1: Download Congressional Shapefiles 1976-2012

Download congressional shapefiles from Lewis et al. for all US Congresses5 from 1976 to 2012.

Your download code should:

  1. Be fully automated (no “hand-downloading”);
  2. Download files with a systematic and interpretable naming convention
  3. Only download files as needed out of courtesy for the data provider’s web sever. That is, if you already have a copy of the file, do not re-download it repeatedly.

As with the other Mini-Projects, make sure you do not store these data files in git. It will be sufficient to include the qmd file with the download code.

Note that the shape files are distributed as zip folders, containing several files in a directory structure. We will be interested in the shp files within each zip.

Data III: Congressional Boundary Files 2014 to Present

To get district boundaries for more recent congressional elections, we can turn to the US Census Bureau. Unfortunately, these data - while authoritative and highly detailed - are not in quite the same format as our previous congressional boundary files. We can review the US Census Bureau shapefiles online. To download them automatically, I recommend exploring the FTP Archive link near the bottom of the page. In Census-jargon, the CD directory will have shapefiles for Congressional Districts for each year.6

Task 2: Download Congressional Shapefiles 2014-2022

Download congressional shapefiles from the US Census Bureau for all US Congresses from 2014 to 2022.

Your download code should:

  1. Be fully automated (no “hand-downloading”);
  2. Download files with a systematic and interpretable naming convention
  3. Only download files as needed out of courtesy for the data provider’s web sever. That is, if you already have a copy of the file, do not re-download it repeatedly.

As with the other Mini-Projects, make sure you do not store these data files in git. It will be sufficient to include the qmd file with the download code.

Initial Exploration of Vote Count Data

Task 3: Exploration of Vote Count Data

Answer the following using the vote count data files from the MIT Election Data Science Lab. You may answer each with a table or plot as you feel is appropriate.

  1. Which states have gained and lost the most seats in the US House of Representatives between 1976 and 2022?

  2. New York State has a unique “fusion” voting system where one candidate can appear on multiple “lines” on the ballot and their vote counts are totaled. For instance, in 2022, Jerrold Nadler appeared on both the Democrat and Working Families party lines for NYS’ 12th Congressional District. He received 200,890 votes total (184,872 as a Democrat and 16,018 as WFP), easily defeating Michael Zumbluskas, who received 44,173 votes across three party lines (Republican, Conservative, and Parent).

    Are there any elections in our data where the election would have had a different outcome if the “fusion” system was not used and candidates only received the votes their received from their “major party line” (Democrat or Republican) and not their total number of votes across all lines?

  3. Do presidential candidates tend to run ahead of or run behind congressional candidates in the same state? That is, does a Democratic candidate for president tend to get more votes in a given state than all Democratic congressional candidates in the same state?

    Does this trend differ over time? Does it differ across states or across parties? Are any presidents particularly more or less popular than their co-partisans?

Importing and Plotting Shape File Data

As mentioned above, the shape files you downloaded above are distributed in zip archives, with several files. We only need the shp file within each archive. In this section, we’ll practice extracting the shp file, reading it, and using it to create a plot. The key library we need is the sf (“simple features”) library. It provides the read_sf() function which we can use to read it into R. I download how this works below:

library(ggplot2)
library(sf)

if(!file.exists("nyc_borough_boundaries.zip")){
    download.file("https://data.cityofnewyork.us/api/geospatial/tqmj-j8zm?method=export&format=Shapefile", 
              destfile="nyc_borough_boundaries.zip")
}

##-
td <- tempdir(); 
zip_contents <- unzip("nyc_borough_boundaries.zip", 
                      exdir = td)
    
fname_shp <- zip_contents[grepl("shp$", zip_contents)]
nyc_sf <- read_sf(fname_shp)
nyc_sf
Simple feature collection with 5 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -74.25559 ymin: 40.49613 xmax: -73.70001 ymax: 40.91553
Geodetic CRS:  WGS84(DD)
# A tibble: 5 × 5
  boro_code boro_name      shape_area shape_leng                        geometry
      <dbl> <chr>               <dbl>      <dbl>              <MULTIPOLYGON [°]>
1         3 Brooklyn      1934142776.    728147. (((-73.86327 40.58388, -73.863…
2         5 Staten Island 1623618684.    325910. (((-74.05051 40.56642, -74.050…
3         1 Manhattan      636646082.    360038. (((-74.01093 40.68449, -74.011…
4         2 Bronx         1187174772.    463181. (((-73.89681 40.79581, -73.896…
5         4 Queens        3041418004.    888197. (((-73.82645 40.59053, -73.826…

At least one student reported difficulty running the above code on a Windows machine. The default download method (method="internal") uses Windows’ built-in download code, which seems to randomly corrupt certain zip files. Adding method="curl" to the download.file call seems to have helped.

Similarly, using http instead of https in the URL sometimes avoids issues, particularly on Windows machines with aggressive anti-virus settings.

These tweaks may not work for all of you, since the root cause of these errors are a subtle interplay between the opearting system, specific security software (and settings within the security software) and R’s download functionality, but these tweaks may help to resolve mysterious errors.

Task 4: Automate Zip File Extraction

Adapt the code after the ##- symbol above into a function read_shp_from_zip() which takes in a file name, pulls out the .shp file contained there in, and reads it into R using read_sf().

Note: If your platform supports it, you can also use a combination of unzip(..., list=TRUE) and unzip(..., file=...) to extract only one file out of the zip directory instead of unpacking the whole file. This is a bit more efficient, but not necessary here as all files involved are pretty small.

The result of this is a particular sort of data frame. The most important column for us is the geometry column which is of type MULTIPOLYGON. This is, essentially, a list of GPS coordinates which outline a spatial region. Here, each row corresponds to a Borough of NYC. We can pass the geometry column to ggplot2 to make a map:

ggplot(nyc_sf, 
       aes(geometry=geometry)) + 
    geom_sf()

Here, we use the sf geom to get the shape outlines. The sf geom plays well with the fill aesthetic.

ggplot(nyc_sf, 
       aes(geometry=geometry, 
           fill = shape_area)) + 
    geom_sf()

This type of plot is called a Chloropleth Map and it is commonly used to depict election results.

Task 5: Chloropleth Visualization of the 2000 Presidential Election Electoral College Results

Using the data you downloaded earlier, create a chloropleth visualization of the electoral college results for the 2000 presidential election (Bush vs. Gore), coloring each state by the party that won the most votes in that state. Your result should look something like this:

Taken from Wikipedia

It is not required, but to make the very best plot, you may want to look up:

  1. How to “inset” Alaska and Hawaii instead of plotting their true map locations.
  2. How to add labels to a chloropleth in ggplot2
  3. How to label the small states in the North-East

but these steps are not required as they are a bit advanced.

Task 6: Advanced Chloropleth Visualization of Electoral College Results

Modify your previous code to make either an animated faceted version showing election results over time.

You may want to set facet_wrap or facet_grid to use a single column and adjust the figure size for the best reading experience.

There are some subtle issues you might need to be aware of when working with multiple complex shapefiles:

  1. bind_rows will struggle to combine shape files if they are not using the same Coordinate Reference System (CRS). You might want to use st_transform to set all CRS to be the same. CRS 4326, a.k.a. WGS 84 is a good choice.

  2. Plotting complex shapefiles may be slow due to the intricate coastlines and lots of fiddly line segments on state borders. You may want to use st_simplify to make smoother (and more quickly plotted) edge sets. I found st_simplify(dTolerance=0.01) to work decently well, but you may find different values work better.

    For technical reasons, you may need to set sf_use_s2(FALSE) before using st_simplify.

  3. Joining two sf objects is tricky. If you don’t need geometry from both tables, it is easier to remove the geometry column from one and then apply the as_data_frame function to simplify its structure. This will allow regular (non-spatial) joins to be used.

The following example may be useful for you:

## Animated Chloropleth using gganimate

## Add some time "structure" to our data for 
## demonstration purposes only
nyc_sf_repeats <- bind_rows(
    nyc_sf |> mutate(value = rnorm(5), 
                     frame = 1), 
    nyc_sf |> mutate(value = rnorm(5), 
                     frame = 2), 
    nyc_sf |> mutate(value = rnorm(5), 
                     frame = 3), 
    nyc_sf |> mutate(value = rnorm(5), 
                     frame = 4), 
    nyc_sf |> mutate(value = rnorm(5), 
                     frame = 5))

library(gganimate)
ggplot(nyc_sf_repeats, 
       aes(geometry=geometry, 
           fill = value)) + 
    geom_sf() + 
    transition_time(frame)

Now that we have finished exploring our data and building some tools for plots, we are ready to dig into our main question.

Comparing the Effects of ECV Allocation Rules

Go through the historical voting data and assign each state’s ECVs according to various strategies:

  1. State-Wide Winner-Take-All
  2. District-Wide Winner-Take-All + State-Wide “At Large” Votes
  3. State-Wide Proportional
  4. National Proportional

Based on these allocation strategies, compare the winning presidential candidate with the actual historical winner.

What patterns do you see? Are the results generally consistent or are one or more methods systematically more favorable to one party?

For the district-level winner-take-all, you may assume that the presidential candidate of the same party as the congressional representative wins that election.

Task 7: Evaluating Fairness of ECV Allocation Schemes

Write a fact check evaluating the fairness of the different ECV electoral allocation schemes.

To do so, you should first determine which allocation scheme you consider “fairest”. You should then see which schemes give different results, if they ever do. To make your fact check more compelling, select one election where the ECV scheme had the largest impact–if one exists–and explain how the results would have been different under a different ECV scheme.

As you perform your analysis, you may assume that the District of Columbia has three ECVs, which are allocated to the Democratic candidate under all schemes except possibly national popular vote.7

Throughout all of this, note that we are not varying the \(R+2\) ECV allocation scheme specified by the constitution. Our concern here is only what individual states can do to address “fairness” in presidential elections. If we allow the possibility of constitutional amendment, the possibilities are endless. The \(R+2\) rule has several interesting effects; some are well-known, such as the Senate’s equal treatment of small and large states, while others are less well-known, including the fact that congressional representation is based on population, not counts of voters.8

Extra Credit Opportunities

Extra Credit Opportunity (Up to 5 points)

For extra credit, extend your analysis to 2024 electoral results. You will have to find a reliable source of 2024 state- or district-wide vote counts. If the 2024 election is close, this may not be easy to do between the election and the date this mini-project is due.

Extra Credit Opportunity (Up to 8 points)

For extra credit, create an animated plot instead of a facet plot in Task 6.

This is hard, due to what might be a bug a in gganimate’s treatment of changing numbers of sf geometries. If you want to pursue this path, I recommend downloading the state shapefiles from TIGER (the Census page)9 and using it instead of the congressional district shape files. Unlike the congressional district files, the state boundaries should be unchanging over time.


This work ©2024 by Michael Weylandt is licensed under a Creative Commons BY-NC-SA 4.0 license.

Footnotes

  1. I am not aware of “official” reasons from any state on why they select “winner-take-all” allocation. States clearly compete for attention in presidential elections and it seems reasonable to assume that competitive states select “winner-take-all” allocation to attract presidential candidates who will make promises to that state’s voters. By contrast, states whose legislature is dominated by a single party, e.g., New York, may be motivated to award all their votes to the more popular party in that state, denying any ECVs to the other candidate, even if a sizeable minority votes for them. If you find a history of how states select their ECV allocation strategies, I would be interested in reading it.↩︎

  2. Making predictions about a counter-factual past.↩︎

  3. Historically, the “Republicans Red / Democrats Blue” convention was not particularly strong in American journalism. It become standardized during coverage of the 2000 Presidential Election and subsequent Florida recount battles and has not materially changed since. For purposes of this mini-project, we will apply “Republican Red / Democrat Blue” consistently.↩︎

  4. While it may be possible to automate the browser to automatically fill in this pop-up as part of the download process, that’s beyond the scope of this assignment.↩︎

  5. It may be useful to recall that each two year cycle is called “a congress” for district mapping purposes. The 2022 US Election, selecting Representatives to serve 2023-2025, corresponds to the 118th Congress. The upcoming (November 2024) election will select members for the 119th Congress.↩︎

  6. The other shapefiles in this FTP archive may be useful for your final projects.↩︎

  7. The District of Columbia is very Democratic.↩︎

  8. This latter effect is admittedly quite small if we assume political affiliation is unrelated to probability of voting. The relationship between voting likelihood and political leanings is an important one for campaign strategists and actively debated by academics.↩︎

  9. E.g., https://www2.census.gov/geo/tiger/TIGER2018/STATE/↩︎