STA/OPR 9750 Mini-Project #01: Fiscal Characteristics of Major US Public Transit Systems

Due Dates

  • Released to Students: 2024-09-12
  • Initial Submission: 2024-09-25 11:45pm ET on GitHub and Brightspace
  • Peer Feedback:
    • Peer Feedback Assigned: 2024-09-26 on GitHub
    • Peer Feedback Due: 2024-10-02 11:45pm ET on GitHub

Welcome to STA/OPR 9750 Mini Projects!

In the STA/OPR 9750 Mini-Projects, you will perform basic data analyses intended to model best practices for your course final project. (Note, however, that these are mini-projects; your final course project is expected to be far more extensive than any single MP.)

For purposes of MPs, we are dividing the basic data analytic workflow into several major stages:

  • Data Ingest and Cleaning: Given a single data source, read it into R and transform it to a reasonably useful standardized format.
  • Data Combination and Alignment: Combine multiple data sources to enable insights not possible from a single source.
  • Descriptive Statistical Analysis: Take a data table and compute informative summary statistics from both the entire population and relevant subgroups
  • Data Visualization: Generate insightful data visualizations to spur insights not attainable from point statistics
  • Inferential Statistical Analysis and Modeling: Develop relevant predictive models and statistical analyses to generate insights about the underlying population and not simply the data at hand.

In this course, our primary focus is on the first four stages: you will take other courses that develop analytical and modeling techniques for a variety of data types. As we progress through the course, you will eventually be responsible for the first four steps. Specifically, you are responsible for the following stages of each mini-project:

Students’ Responsibilities in Mini-Project Analyses
Ingest and Cleaning Combination and Alignment Descriptive Statistical Analysis Visualization
Mini-Project #01
Mini-Project #02 ½
Mini-Project #03 ½
Mini-Project #04

In early stages of the course, such as this MP, I will ‘scaffold’ much of the analysis for you, leaving only those stages we have discussed in class for you to fill in. As the course progresses, the mini-projects will be more self-directed and results less standardized.

Rubric

STA/OPR 9750 Mini-Projects are evaluated using peer grading with meta-review by the course GTAs. Specifically, variants of the following rubric will be used for the mini-projects:

Mini-Project Grading Rubric
Course Element Excellent (9-10) Great (7-8) Good (5-6) Adequate (3-4) Needs Improvement (1-2) Extra Credit
Written Communication Report is well-written and flows naturally. Motivation for key steps is clearly explained to reader without excessive detail. Key findings are highlighted and appropriately given context. Report has no grammatical or writing issues. Writing is accessible and flows naturally. Key findings are highlighted, but lack suitable motivation and context. Report has no grammatical or writing issues. Key findings are present but insufficiently highlighted. Writing is intelligible, but has some grammatical errors. Key findings are obscured. Report exhibits significant weakness in written communication. Key points are difficult to discern. Report includes extra context beyond instructor provided information.
Project Skeleton Code completes all instructor-provided tasks correctly Response to one instructor provided task is skipped, incorrect, or otherwise incomplete. Responses to two instructor provided tasks are skipped, incorrect, or otherwise incomplete. Response to three instructor provided tasks are skipped, incorrect, or otherwise incomplete. Less than half of the instructor-provided tasks were successfully completed. Report exhibits particularly creative insights beyond instructor specifications.
Formatting & Display Tables have well-formatted column names, suitable numbers of digits, and attractive presentation. Table has a suitable caption. Column names and digits are well-chosen, but formatting could be improved. Bad column names (opaque variable names or other undefined acronyms) Unfiltered ‘data dump’ instead of curated table. No tables. Report includes one or more high-quality graphics (created using R).
Code Quality

Code is (near) flawless.

Code passes all styler and lintr type analyses without issue.

Comments give context of the analysis, not simply defining functions used in a particular line. Code has well-chosen variable names and basic comments. Code executes properly, but is difficult to read. Code fails to execute properly. Code takes advantage of advanced Quarto features to improve presentation of results.
Data Preparation Automatic (10/10). Out of scope for this mini-project Report modifies instructor-provided import code to use additional columns or data sources in a way that creates novel insights.

Note that this rubric is designed with copious opportunities for extra credit if students go above and beyond the instructor-provided scaffolding. Students pursuing careers in data analytics are strongly encouraged to go beyond the strict ambit of the mini-projects to i) further refine their skills; ii) learn additional techniques that can be used in the final course project; and iii) develop a more impressive professional portfolio.

Because students are encouraged to use STA/OPR 9750 mini-projects as the basis for a professional portfolio, the basic skeleton of each project will be released under a fairly permissive usage license. Take advantage of it!

Submission Instructions

After completing the analysis, write up your findings, showing all of your code, using a dynamic quarto document and post it to your course repository. The qmd file should be named mp01.qmd so the rendered document can be found at docs/mp01.html in the student’s repository and served at the URL:

https://<GITHUB_ID>.github.io/STA9750-2024-FALL/mp01.html

Once you confirm this website works (substituting <GITHUB_ID> for the actual GitHub username provided to the professor in MP#00 of course), open a new issue at

https://github.com/<GITHUB_USERNAME>/STA9750-2024-FALL/issues/new .

Title the issue STA/OPR 9750 <GITHUB_USERNAME> MiniProject #01 and fill in the following text for the issue:

Hi @michaelweylandt!

I've uploaded my work for MiniProject #01 - check it out!

https://<GITHUB_USERNAME>.github.io/STA9750-2024-FALL/mp01.html

Once the submission deadline passes, the instructor will tag classmates for peer feedback in this issue thread.

Additionally, a PDF export of this report should be submitted on Brightspace. To create a PDF from the uploaded report, simply use your browser’s ‘Print to PDF’ functionality.

NB: The analysis outline below specifies key tasks you need to perform within your write up. Your peer evaluators will check that you complete these. You are encouraged to do extra analysis, but the bolded Tasks are mandatory.

NB: Your final submission should look like a report, not simply a list of facts answering questions. Add introductions, conclusions, and your own commentary. You should be practicing both raw coding skills and written communication in all mini-projects. There is little value in data points stated without context or motivation.

Mini-Project #01: Fiscal Characteristics of Major US Public Transit Systems

For MP#01, we are taking inspiration from the popular CityNerd YouTube channel. In particular, we’re taking this presentation on farebox recovery, i.e. the fraction of revenues raised from fares instead of taxes, as our starting point:

We will use data from the National Transit Database as our primary source. In particular, since we want to analyze farebox revenues, total number of trips, total number of vehicle miles traveled, and total revenues and expenses by source, we will need to analyze several different tables:

Because this data is reported on a lag, we will use the 2022 version of all reports. Our data may have some post-pandemic irregularities, but that’s ok. We aren’t looking to make any long-term forecasts in this project.

The following code will download, clean, and join the tables. You don’t need to edit it in this project, but you may want to bookmark it as a useful example for later projects where you are responsible for downloading and cleaning the data.

if(!require("tidyverse")) install.packages("tidyverse")

# Let's start with Fare Revenue
library(tidyverse)
if(!file.exists("2022_fare_revenue.xlsx")){
    # This should work _in theory_ but in practice it's still a bit finicky
    # If it doesn't work for you, download this file 'by hand' in your
    # browser and save it as "2022_fare_revenue.xlsx" in your project
    # directory.
    download.file("http://www.transit.dot.gov/sites/fta.dot.gov/files/2024-04/2022%20Fare%20Revenue.xlsx", 
                  destfile="2022_fare_revenue.xlsx", 
                  quiet=FALSE, 
                  method="wget")
}
FARES <- readxl::read_xlsx("2022_fare_revenue.xlsx") |>
    select(-`State/Parent NTD ID`, 
           -`Reporter Type`,
           -`Reporting Module`,
           -`TOS`,
           -`Passenger Paid Fares`,
           -`Organization Paid Fares`) |>
    filter(`Expense Type` == "Funds Earned During Period") |>
    select(-`Expense Type`) |>
    group_by(`NTD ID`,       # Sum over different `TOS` for the same `Mode`
             `Agency Name`,  # These are direct operated and sub-contracted 
             `Mode`) |>      # of the same transit modality
                             # Not a big effect in most munis (significant DO
                             # tends to get rid of sub-contractors), but we'll sum
                             # to unify different passenger experiences
    summarize(`Total Fares` = sum(`Total Fares`)) |>
    ungroup()

# Next, expenses
if(!file.exists("2022_expenses.csv")){
    # This should work _in theory_ but in practice it's still a bit finicky
    # If it doesn't work for you, download this file 'by hand' in your
    # browser and save it as "2022_expenses.csv" in your project
    # directory.
    download.file("https://data.transportation.gov/api/views/dkxx-zjd6/rows.csv?date=20231102&accessType=DOWNLOAD&bom=true&format=true", 
                  destfile="2022_expenses.csv", 
                  quiet=FALSE, 
                  method="wget")
}
EXPENSES <- readr::read_csv("2022_expenses.csv") |>
    select(`NTD ID`, 
           `Agency`,
           `Total`, 
           `Mode`) |>
    mutate(`NTD ID` = as.integer(`NTD ID`)) |>
    rename(Expenses = Total) |>
    group_by(`NTD ID`, `Mode`) |>
    summarize(Expenses = sum(Expenses)) |>
    ungroup()

FINANCIALS <- inner_join(FARES, EXPENSES, join_by(`NTD ID`, `Mode`))

Finally, let’s extract monthly transit numbers:

# Monthly Transit Numbers
library(tidyverse)
if(!file.exists("ridership.xlsx")){
    # This should work _in theory_ but in practice it's still a bit finicky
    # If it doesn't work for you, download this file 'by hand' in your
    # browser and save it as "ridership.xlsx" in your project
    # directory.
    download.file("https://www.transit.dot.gov/sites/fta.dot.gov/files/2024-09/July%202024%20Complete%20Monthly%20Ridership%20%28with%20adjustments%20and%20estimates%29_240903.xlsx", 
                  destfile="ridership.xlsx", 
                  quiet=FALSE, 
                  method="wget")
}
TRIPS <- readxl::read_xlsx("ridership.xlsx", sheet="UPT") |>
            filter(`Mode/Type of Service Status` == "Active") |>
            select(-`Legacy NTD ID`, 
                   -`Reporter Type`, 
                   -`Mode/Type of Service Status`, 
                   -`UACE CD`, 
                   -`TOS`) |>
            pivot_longer(-c(`NTD ID`:`3 Mode`), 
                            names_to="month", 
                            values_to="UPT") |>
            drop_na() |>
            mutate(month=my(month)) # Parse _m_onth _y_ear date specs
MILES <- readxl::read_xlsx("ridership.xlsx", sheet="VRM") |>
            filter(`Mode/Type of Service Status` == "Active") |>
            select(-`Legacy NTD ID`, 
                   -`Reporter Type`, 
                   -`Mode/Type of Service Status`, 
                   -`UACE CD`, 
                   -`TOS`) |>
            pivot_longer(-c(`NTD ID`:`3 Mode`), 
                            names_to="month", 
                            values_to="VRM") |>
            drop_na() |>
            group_by(`NTD ID`, `Agency`, `UZA Name`, 
                     `Mode`, `3 Mode`, month) |>
            summarize(VRM = sum(VRM)) |>
            ungroup() |>
            mutate(month=my(month)) # Parse _m_onth _y_ear date specs

USAGE <- inner_join(TRIPS, MILES) |>
    mutate(`NTD ID` = as.integer(`NTD ID`))

This creates a table as follows:

if(!require("DT")) install.packages("DT")
library(DT)

sample_n(USAGE, 1000) |> 
    mutate(month=as.character(month)) |> 
    DT::datatable()

Instructor’s Note: You might want to explore the functions of the DT package to create more attractive tables. Even more advanced is the gt package.

This is useful, but not exactly what we want. Here, the UPT column refers to Unlinked Passenger Trips, which is a measure of rides (controlling for connections and transfers), and VRM refers to Vehicle Revenue Miles, roughly how far the transit provider travelled in total. Some of the other column names are less helpful, so let’s rename them using the rename function.

Task 1 - Creating Syntatic Names

Rename a column: UZA Name to metro_area.

Because they have no spaces in them, these names will be easier to manipulate in code. Recall that non-syntactic names (names with spaces, punctuation, or strange characters) have to be quoted in backticks or quotes (depending on the context), while most tidyverse functions take syntactic names without quotes.

The Mode column is also helpful, but it uses a set of codes that aren’t interpretable. To make life easier for ourselves, let’s use a case_when statement to transform this into something we can make sense of.

Task 2: Recoding the Mode column

First, find the unique Mode codes in our data using the distinct function. Next, examine the NTD website and find the interpretations of these codes. Complete the following snippet to recode the Mode column.

## This code needs to be modified
USAGE <- USAGE |>
    mutate(Mode=case_when(
        Mode == "HR" ~ "Heavy Rail", 
        ...
        ...
        TRUE ~ "Unknown"))

Now that your data is clean, you may want to create an attractive summary table of your cleaned up USAGE table using the following snippet:

if(!require("DT")) install.packages("DT")
library(DT)

sample_n(USAGE, 1000) |> 
    mutate(month=as.character(month)) |> 
    DT::datatable()

To make your table cleaner, you might want to modify this code to unselect the NTD ID and 3 Mode columns and/or rename the UPT and VRM columns.

Note: The use of sample_n here is just to make a sufficiently small sample to view it in a table. For your actual analysis, you should use the entire data set.

Task 3: Answering Instructor Specified Questions with dplyr

Using functions we have studied in class, including filter, group_by, summarize, arrange, answer the following questions in your analysis:

  1. What transit agency had the most total VRM in our data set?

  2. What transit mode had the most total VRM in our data set?

  3. How many trips were taken on the NYC Subway (Heavy Rail) in May 2024?

  4. What mode of transport had the longest average trip in May 2024?

    Note: This question can’t be answered with vehicle miles. To get average passenger trip length, we need passenger miles.

  5. How much did NYC subway ridership fall between April 2019 and April 2020?

Task 4: Explore and Analyze

Find three more interesting transit facts in this data other than those above.

We are now ready to combine these usage statistics with the revenue and expenses data. Because our fare data is for 2022 total, we need to convert our usage table to 2022 summary info.

Task 5: Table Summarization

Create a new table from USAGE that has annual total (sum) UPT and VRM for 2022. This will require use of the group_by, summarize, and filter functions. You will also want to use the year function, to extract a year from the month column.

The resulting table should have the following columns:

  • NTD ID
  • Agency
  • metro_area
  • Mode
  • UPT
  • VRM

Make sure to ungroup your table after creating it.

Name this table USAGE_2022_ANNUAL.

Once you have created this new table, you can merge it to the FINANCIALS data as follows:

USAGE_AND_FINANCIALS <- left_join(USAGE_2022_ANNUAL, 
           FINANCIALS, 
           join_by(`NTD ID`, Mode)) |>
    drop_na()

Note that the name fields differ between the ridership and financials tables, so it’s a good thing we had the unique identifier NTD ID to rely on.

We are now finally ready to our original question about farebox recovery.

Task 6: Farebox Recovery Among Major Systems

Using the USAGE_AND_FINANCIALS table, answer the following questions:

  1. Which transit system (agency and mode) had the most UPT in 2022?
  2. Which transit system (agency and mode) had the highest farebox recovery, defined as the highest ratio of Total Fares to Expenses?
  3. Which transit system (agency and mode) has the lowest expenses per UPT?
  4. Which transit system (agency and mode) has the highest total fares per UPT?
  5. Which transit system (agency and mode) has the lowest expenses per VRM?
  6. Which transit system (agency and mode) has the highest total fares per VRM?

You may wish to restrict your answer to major transit systems, which you can define as those with 400,000 UPT per annum.

Based on all of this, what do you believe to be the most efficient transit system in the country? (Your answer may differ depending on which form of ‘efficiency’ you care most about)


This work ©2024 by Michael Weylandt is licensed under a Creative Commons BY-NC-SA 4.0 license.