if(!file.exists("data/mp01/nyc_payroll_export.csv")){
dir.create("data/mp01", showWarnings=FALSE, recursive=TRUE)
ENDPOINT <- "https://data.cityofnewyork.us/resource/k397-673e.json"
if(!require("httr2")) install.packages("httr2")
library(httr2)
if(!require("jsonlite")) install.packages("jsonlite")
library(jsonlite)
if(!require("dplyr")) install.packages("dplyr")
library(dplyr)
if(!require("readr")) install.packages("readr")
library(readr)
BATCH_SIZE <- 50000
OFFSET <- 0
END_OF_EXPORT <- FALSE
ALL_DATA <- list()
while(!END_OF_EXPORT){
cat("Requesting items", OFFSET, "to", BATCH_SIZE + OFFSET, "\n")
req <- request(ENDPOINT) |>
req_url_query(`$limit` = BATCH_SIZE,
`$offset` = OFFSET)
resp <- req_perform(req)
batch_data <- fromJSON(resp_body_string(resp))
ALL_DATA <- c(ALL_DATA, list(batch_data))
if(NROW(batch_data) != BATCH_SIZE){
END_OF_EXPORT <- TRUE
cat("End of Data Export Reached\n")
} else {
OFFSET <- OFFSET + BATCH_SIZE
}
}
ALL_DATA <- bind_rows(ALL_DATA)
cat("Data export complete:", NROW(ALL_DATA), "rows and", NCOL(ALL_DATA), "columns.")
write_csv(ALL_DATA, "data/mp01/nyc_payroll_export.csv")
}
STA 9750 Mini-Project #01: Welcome to the Commission to Analyze Taxpayer Spending (CATS)
Due Dates
- Released to Students: 2025-02-13
- Initial Submission: 2025-02-26 11:45pm ET on GitHub and Brightspace
-
Peer Feedback:
- Peer Feedback Assigned: 2025-02-27 on GitHub
- Peer Feedback Due: 2025-03-05 06:45pm ET on GitHub
Estimated Time to Complete: 9 Hours
Estimated Time for Peer Feedback: 1 Hour
Welcome to STA 9750 Mini Projects!
In the STA 9750 Mini-Projects, you will perform basic data analyses intended to model best practices for your course final project. (Note, however, that these are mini-projects; your final course project is expected to be far more extensive than any single MP.)
For purposes of MPs, we are dividing the basic data analytic workflow into several major stages:
- Data Ingest and Cleaning: Given a single data source, read it into
R
and transform it to a reasonably useful standardized format. - Data Combination and Alignment: Combine multiple data sources to enable insights not possible from a single source.
- Descriptive Statistical Analysis: Take a data table and compute informative summary statistics from both the entire population and relevant subgroups
- Data Visualization: Generate insightful data visualizations to spur insights not attainable from point statistics
- Inferential Statistical Analysis and Modeling: Develop relevant predictive models and statistical analyses to generate insights about the underlying population and not simply the data at hand.
In this course, our primary focus is on the first four stages: you will take other courses that develop analytical and modeling techniques for a variety of data types. As we progress through the course, you will eventually be responsible for the first four steps. Specifically, you are responsible for the following stages of each mini-project:
Ingest and Cleaning | Combination and Alignment | Descriptive Statistical Analysis | Visualization | |
---|---|---|---|---|
Mini-Project #01 | ✓ | |||
Mini-Project #02 | ✓ | ✓ | ½ | |
Mini-Project #03 | ½ | ✓ | ✓ | ✓ |
Mini-Project #04 | ✓ | ✓ | ✓ | ✓ |
In early stages of the course, such as this MP, I will ‘scaffold’ much of the analysis for you, leaving only those stages we have discussed in class for you to fill in. As the course progresses, the mini-projects will be more self-directed and results less standardized.
Rubric
STA 9750 Mini-Projects are evaluated using peer grading with meta-review by the course GTAs. Specifically, variants of the following rubric will be used for the mini-projects:
Course Element | Excellent (9-10) | Great (7-8) | Good (5-6) | Adequate (3-4) | Needs Improvement (1-2) | Extra Credit |
---|---|---|---|---|---|---|
Written Communication | Report is well-written and flows naturally. Motivation for key steps is clearly explained to reader without excessive detail. Key findings are highlighted and appropriately given context. | Report has no grammatical or writing issues. Writing is accessible and flows naturally. Key findings are highlighted, but lack suitable motivation and context. | Report has no grammatical or writing issues. Key findings are present but insufficiently highlighted. | Writing is intelligible, but has some grammatical errors. Key findings are obscured. | Report exhibits significant weakness in written communication. Key points are difficult to discern. | Report includes extra context beyond instructor provided information. |
Project Skeleton | Code completes all instructor-provided tasks correctly | Response to one instructor provided task is skipped, incorrect, or otherwise incomplete. | Responses to two instructor provided tasks are skipped, incorrect, or otherwise incomplete. | Response to three instructor provided tasks are skipped, incorrect, or otherwise incomplete. | Less than half of the instructor-provided tasks were successfully completed. | Report exhibits particularly creative insights beyond instructor specifications. |
Formatting & Display | Tables have well-formatted column names, suitable numbers of digits, and attractive presentation. Table has a suitable caption. | Column names and digits are well-chosen, but formatting could be improved. | Bad column names (opaque variable names or other undefined acronyms) | Unfiltered ‘data dump’ instead of curated table. | No tables. | Report includes one or more high-quality graphics (created using R ). |
Code Quality |
Code is (near) flawless. Code passes all |
Comments give context of the analysis, not simply defining functions used in a particular line. | Code has well-chosen variable names and basic comments. | Code executes properly, but is difficult to read. | Code fails to execute properly. | Code takes advantage of advanced Quarto features to improve presentation of results. |
Data Preparation | Automatic (10/10). Out of scope for this mini-project | Report modifies instructor-provided import code to use additional columns or data sources in a way that creates novel insights. |
Note that this rubric is designed with copious opportunities for extra credit if students go above and beyond the instructor-provided scaffolding. Students pursuing careers in data analytics are strongly encouraged to go beyond the strict ambit of the mini-projects to i) further refine their skills; ii) learn additional techniques that can be used in the final course project; and iii) develop a more impressive professional portfolio.
Because students are encouraged to use STA 9750 mini-projects as the basis for a professional portfolio, the basic skeleton of each project will be released under a fairly permissive usage license. Take advantage of it!
Submission Instructions
After completing the analysis, write up your findings, showing all of your code, using a dynamic quarto
document and post it to your course repository. The qmd
file should be named mp01.qmd
so the rendered document can be found at docs/mp01.html
in the student’s repository and served at the URL:1
https://<GITHUB_ID>.github.io/STA9750-2025-SPRING/mp01.html
Once you confirm this website works (substituting <GITHUB_ID>
for the actual GitHub username provided to the professor in MP#00 of course), open a new issue at
https://github.com/<GITHUB_ID>/STA9750-2025-SPRING/issues/new
.
Title the issue STA 9750 <GITHUB_ID> MiniProject #01
and fill in the following text for the issue:
Hi @michaelweylandt!
I've uploaded my work for MiniProject #**01** - check it out!
https://<GITHUB_ID>.github.io/STA9750-2025-SPRING/mp01.html
Once the submission deadline passes, the instructor will tag classmates for peer feedback in this issue thread.
Additionally, a PDF export of this report should be submitted on Brightspace. To create a PDF from the uploaded report, simply use your browser’s ‘Print to PDF’ functionality.
NB: The analysis outline below specifies key tasks you need to perform within your write up. Your peer evaluators will check that you complete these. You are encouraged to do extra analysis, but the bolded Tasks are mandatory.
NB: Your final submission should look like a report, not simply a list of facts answering questions. Add introductions, conclusions, and your own commentary. You should be practicing both raw coding skills and written communication in all mini-projects. There is little value in data points stated without context or motivation.
Mini-Project #01: Welcome to the Commission to Analyze Taxpayer Spending (CATS)
Congratulations! You have just been appointed as a senior technical analyst working with New York City’s new Commission to Analyze Taxpayer Spending (CATS). As a technical analyst, you are tasked with helping the Commissioners understand New York City’s expenses and identifying opportunities to spend taxyper monies more effectively. Specifically, the Commission chair, Mr. Keno Slum, has asked you to analyze the City payroll and to identify instances in which senior agency officials make significantly more than rank-and-file city employees.
In this mini-project, you will analyze City payroll data and write a report highlighting possible savings to be submitted to the CATS Commissioners. In this mini-project, you will:
- Begin to work with NYC Open Data
- Practice Use of
dplyr
for analysis of tabular data - Practice Use of
quarto
and Reproducible Research Tools for Effective Communication of Data Analysis Results
Recall that you are evaluated on writing and communication in these Mini-Projects. You are required write a report in the prescribed style, here an internal research briefing. A submission that performs the instructor-specified tasks, but does not write and give appropriate context and commentary will score very poorly on the relevant rubric elements.
In particular, if a submission is not in “white paper” style, peer evaluators should judge it to have “Good” quality Written Communication (at best) as key findings are not conveyed appropriately.
Quarto’s cold folding functionality is useful for “hiding” code so that it doesn’t break the flow of your writing.
Acquiring Payroll Data
The following code will download the city payroll data and create a file nyc_payroll_export.csv
in a data/mp01
directory. If this doesn’t work for whatever reason, you can download the data directly from NYC OpenData, though you will need to make sure it is in a suitable location and format for use in this mini-project.2
You do not (yet) need to understand the code above, so please use the course discussion board if you have trouble getting it working.
Using the code above, acquire the latest NYC Payroll Data.
git add
Data Files
Make sure that git
is set to ignore data files, such as the one created above. Check the git
pane in RStudio
and make sure that nyc_payroll_export.csv
does not appear. (If you set up your .gitignore
file correctly in MP#00, it should already be ignored.) If it is appearing, you may need to edit your .gitignore
file.
Removing a large data file from git
is possible, but difficult. Don’t get into a bad state!
Importing Data into R
and Preparing for Analysis
Before we can analyze this data, we need to get it into R
and in a suitable format. The read_csv
function from the readr
package can be used to read csv
files into R
(imagine that!). The above download code will handle most of the irregularities in the city data,3 so we only need to make a few changes before beginning our analysis. In particular, let’s change some of the string columns to more conventional punctuation. The str_to_title
function from the stringr
package isn’t perfect (name capitalization rules can be rather idiosyncratic), but it will get us pretty close.
Read your data into R
using the read_csv
function from the readr
package. Before continuing, use a mutate
command and the str_to_title
function from the stringr
package to convert the following columns to more conventional formatting.
- Agency Name
- Last Name
- First Name
- Work Location (Borough)
- Title / Job Description
- Leave Status
It is good practice to always visually check your data to make sure it is properly formatting. You can use the glimpse
function for a quick look here. We will use more refined formatting below.
Initial Exploration
Identifying Individual Records
Any time you are analyzing a new data set, you should begin by making sure you understand what the relevant columns are. In this case, we have basic descriptions of each column from the data page. We seem, however, to lack a few things you would normally want in a data set like this: in particular, we don’t have a unique ID for employees (a “primary key” in database parlance). This will make it harder for us to track individuals across years or across agencies (if they have two jobs at different agencies in the same fiscal year); we can approximate a unique identifier with First+Middle+Last, but it won’t be perfect.
Let’s begin by reviewing the career of the current mayor, Eric L. Adams. Use a combination of filter
, rename
, arrange
to create a table like the following. (You may also need to use group_by
and summarize
.)
Fiscal Year | Position | Agency | Total Salary |
---|---|---|---|
2020 | Game Keeper | Parks and Recreation | 15000 |
2021 | Fish Sorter | Environmental Protection | 25000 |
2022 | Mayor | Office of the Mayor | 35000 |
2023 | Hamburgler | McDonald’s on 23rd | 20000 |
(Obviously, these numbers and details are not correct.) Note that the Mayoral term does not line up with the City’s fiscal year, so you can choose how best to combine years in which Mr. Adams held more than one job. Once you have created the career table for Mr. Adams, you can use the DT
package to create an attractive visualization of your results.
For example,
library(DT)
library(scales)
tbl_txt <- "
Fiscal Year, Position, Agency, Total Salary
2020,Game Keeper, Parks and Recreation, 15000
2021, Fish Sorter, Environmental Protection, 25000
2022, Mayor, Office of the Mayor, 35000
2023, Hamburgler, McDonald's on 23rd,20000"
read_csv(tbl_txt) |>
mutate(`Total Salary` = dollar(`Total Salary`)) |>
datatable(options=list(searching=FALSE,
paging=FALSE,
info=FALSE))
The DT
package wraps the Javascript datatables
library to to provide interactive tables. This library has many options to control formatting. Later in this course, you will also counter the gt
package to create complex tables natively in R
.
Create an employee salary table for Eric L. Adams similar to that shown above, but with real salary and employment records.
Calculating Aggregate Salaries
For high-ranking officials like Mayor Adams, their total compensation is a fixed salary. For other NYC employees, total compensation is computed using their hourly wage and the total hours worked (both regular and overtime). As a general rule, overtime is paid at 1.5x premium, so an employee who worked 40 hours of scheduled time and 20 hours of overtime with a base pay of $25 per hour, will be paid:
\[ \$25 * (40 + 20 * 1.5) = \$1750 \]
Other employees are paid a “day rate” - that is, a fixed amount per day worked. For purposes of this exercise, you can covert hours worked to days at a fixed rate of 7.5 hours per day. That is, if an employee is paid $100 per day and reports 1500 hours worked, you can estimate their pay as:
\[ \$100 * \frac{1500}{7.5} = \$20,000 \]
Use these calculations to compute actual total compensation for each employee record in our data set. You will need to use a case_when()
function inside a mutate()
command to handle different pay structures.
Now that we have computed total compensation for each city employee, we are ready to begin our analysis of the city payroll. Before we go further, answer some general questions about this data set to make sure you are comfortable with it.
Answer the following questions about city payroll data. In your final write-up, include these in the form of a “quick facts” bullet as part of your introduction.
- Which job title has the highest base rate of pay? (If needed, assume a standard 2000 hour work year and no overtime.)
- Which individual & in what year had the single highest city total payroll (regular and overtime combined)?
- Which individual worked the most overtime hours in this data set?
- Which agency has the highest average total annual payroll (base and overtime pay per employee)?
- Which agency has the most employees on payroll in each year?
- Which agency has the highest overtime usage (compared to regular hours)?
- What is the average salary of employees who work outside the five boroughs? (That is, whose
work_location_borough
is not one of the five counties.) - How much has the city’s aggregate payroll grown over the past 10 years?
Our data set includes columns regular_gross_paid
, total_ot_paid
and total_other_paid
that seem like they could be used for this mini-project. If you look closely, you will see that these amounts do not match what you might calculate by hand. This is because these include adjustments for i) use of deferred compensation plans; ii) pre-tax benefit withholding; iii) various ad hoc adjustments negotiated as part of collective bargaining agreements. Understanding all of these is far beyond the scope of this project, so you should use the simplified calculations described above.
Policy Analysis
CATS has asked you to analyze three possible policy changes to analyze their impact on overall spending. Your supervisor has suggested two policies and has asked you to create a third for analysis. For each policy,
- compute its impact on city payroll, i.e., determine how the total payroll expenses would have changed if that policy had been in place historically;
- determine any other staffing adjustments required to implement that policy, e.g., hiring more employees; and
- make a recommendation to the CATS commissioners on whether this policy should be adopted.
Policy I: Capping Salaries at Mayoral Level
Many governments require that the no subordinate employee be paid more than the chief executive (mayor, governor, president). CATS is considering recommending that the city adopt such a policy. To analyze it:
- Compute the total mayor pay for each fiscal year and identify employees who made more than this amount in the same fiscal year.
- Determine total savings if these employees’ compensation were capped at the mayor’s salary.
- Identify which agencies and job titles (if any) would bear the brunt of this policy.
Analyze the impact of capping total aggregate compensation for any employee at the level of the mayor’s annual salary. Make a recommendation to CATS based on your findings.
Policy II: Increasing Staffing to Reduce Overtime Expenses
A major driver of payroll expenditures is the “1.5x” premium associated with overtime work. That is, it may be cheaper to hire two employees to each work 30 hours to complete a task than to hire one employee to work 40 regular hours and 20 overtime hours to complete the same task.4 The CATS Commissioners are considering urging certain city agencies to hire more employees and to reduce the amount of overtime used.
Analyze the potential upside of increasing employment to reduce overtime.
- For each combination of agency and job title, identify the total number of overtime hours worked and see how many full-time employees it would take to replace that much overtime.
- For each combination of agency and job title, calculate the total savings possible by converting all overtime hours to regular time hours for (new employees).
- Determine the aggregate savings possible by agency. This will let the CATS Commission recommend the agencies where this hiring action would have the largest benefit.
Note that much of this analysis must be done on an agency + job title basis. If, e.g., NYPD is paying lots of overtime to Sargents, hiring additional IT Specialists will not help. Similarly, hiring additional IT Specialists into NYPD will not help reduce the need for IT Specialists at the NYC Housing Authority. After completing this analysis at the agency + job title basis, you can aggregate (over job titles) to the agency level, as this is the most likely level at which hiring policy adjustments can be made.
Analyze the impact of authorizing agencies to hire more employees with the intent to reduce overtime expenses. Make sure to report the total potential savings, the total number of employees needed, and the agencies and job titles where such a change in policy would have the largest impact. The CATS Commissioners are particularly interested in a job title analysis as it may be hard to hire additional employees in certain positions.
Policy III: Create Your Own Policy Proposal
Create your own policy proposal and analyze it in the same manner as the previous two policies. You are encouraged to draw upon real-world proposals to reduce city payroll.
Deliverable: Policy Analysis “White Paper”
Write up your findings in the form of a “white paper” to be shared with the CATS Commissioners.5 Your submission should include all three policy analyses as well as sufficient background and context for a reader who is not already familiar with NYC pay structures. Note any assumptions or limitations of your analysis and explain how they might introduce error into your projections. You are encouraged to add images and figures as necessary, as well as linking to prior published analyses of NYC payroll.
Extra Credit
There is no extra credit available on this mini-project beyond that specified in the rubric above. In particular, peer evaluators are authorized to give up to 8 points (total across all categories) for
- Reports that are particularly well-written and include, at a minimum, citations to existing analyses and policy proposal documents raised by local think-tanks and political candidates. Such citations are made stronger by a comparison of results and outcomes: you shouldn’t just say Think-Tank A made a similar proposal, but you should also see if your results differ with theirs and, if so, report any source(s) of the difference.
- Policy proposals that are particularly creative and effective. When awarding extra credit, peer evaluators should be sure to assess the political feasibility of proposals. A proposal to, e.g., eliminate all public schools would certainly reduce the city’s spending on teachers but stands no chance of actually being implemented. A good policy proposal is both effective and effectible.
This work ©2025 by Michael Weylandt is licensed under a Creative Commons BY-NC-SA 4.0 license.
Footnotes
Throughout this section, replace
<GITHUB_ID>
with your GitHub ID from Mini-Project #00, making sure to remove the angle brackets. Note that the automated course infrastructure will be looking for precise formatting, so follow these instructions closely.↩︎For some reason, this data takes up to seven or eight minutes to download from the city website, even though it’s really not that complicated or massive. Don’t fret.↩︎
-
In particular, if you use the CSV export, the names of the columns in the data file will be “non-syntactic.” This means they are tricky to use in
R
and require use of backticks throughout your code. You will want to change these column names before proceeding. Therename
function can be useful here.DATA <- DATA |> rename(column_name=`Column Name`, other_column=`Other Column`)
Note that you will need to change all column names in this fashion.↩︎
A commonly-raised criticism of public sector unions, such as those covering NYC employees, is that they make it difficult to hire additional employees, thereby directing more overtime hours to their (current) membership. I do not know of any reputable analysis supporting or refuting this claim in the context of NYC, though I would be interested in reading one if you come across one in your background reading.↩︎
This recent white paper from the Rockefeller Institute regarding updating NYS’ public school funding formula is a good example of the genre. Clearly your report does not need to be this long or this detailed.↩︎