if(!dir.exists(file.path("data", "mp01"))){
dir.create(file.path("data", "mp01"), showWarnings=FALSE, recursive=TRUE)
}
GLOBAL_TOP_10_FILENAME <- file.path("data", "mp01", "global_top10_alltime.csv")
if(!file.exists(GLOBAL_TOP_10_FILENAME)){
download.file("https://www.netflix.com/tudum/top10/data/all-weeks-global.tsv",
destfile=GLOBAL_TOP_10_FILENAME)
}
COUNTRY_TOP_10_FILENAME <- file.path("data", "mp01", "country_top10_alltime.csv")
if(!file.exists(COUNTRY_TOP_10_FILENAME)){
download.file("https://www.netflix.com/tudum/top10/data/all-weeks-countries.tsv",
destfile=COUNTRY_TOP_10_FILENAME)
}
STA 9750 Mini-Project #01: Gourmet Cheeseburgers Across the Globe: Exploring the Most Popular Programming on Netflix
Due Dates
- Released to Students: 2025-09-16
- Initial Submission: 2025-10-03 11:59pm ET on GitHub and Brightspace
-
Peer Feedback:
- Peer Feedback Assigned: 2025-10-06 on GitHub
- Peer Feedback Due: 2025-10-13 11:59pm ET on GitHub
Estimated Time to Complete: 9 Hours
Estimated Time for Peer Feedback: 1 Hour
Welcome to STA 9750 Mini Projects!
In the STA 9750 Mini-Projects, you will perform basic data analyses intended to model best practices for your course final project. (Note, however, that these are mini-projects; your final course project is expected to be far more extensive than any single MP.)
For purposes of MPs, we are dividing the basic data analytic workflow into several major stages:
- Data Ingest and Cleaning: Given a single data source, read it into
R
and transform it to a reasonably useful standardized format. - Data Combination and Alignment: Combine multiple data sources to enable insights not possible from a single source.
- Descriptive Statistical Analysis: Take a data table and compute informative summary statistics from both the entire population and relevant subgroups
- Data Visualization: Generate insightful data visualizations to spur insights not attainable from point statistics
- Inferential Statistical Analysis and Modeling: Develop relevant predictive models and statistical analyses to generate insights about the underlying population and not simply the data at hand.
In this course, our primary focus is on the first four stages: you will take other courses that develop analytical and modeling techniques for a variety of data types. As we progress through the course, you will eventually be responsible for the first four steps. Specifically, you are responsible for the following stages of each mini-project:
Ingest and Cleaning | Combination and Alignment | Descriptive Statistical Analysis | Visualization | |
---|---|---|---|---|
Mini-Project #01 | ✓ | |||
Mini-Project #02 | ✓ | ✓ | ½ | |
Mini-Project #03 | ½ | ✓ | ✓ | ✓ |
Mini-Project #04 | ✓ | ✓ | ✓ | ✓ |
In early stages of the course, such as this MP, I will ‘scaffold’ much of the analysis for you, leaving only those stages we have discussed in class for you to fill in. As the course progresses, the mini-projects will be more self-directed and results less standardized.
Rubric
STA 9750 Mini-Projects are evaluated using peer grading with meta-review by the course staff. Specifically, variants of the following rubric will be used for the mini-projects:
Course Element | Excellent (9-10) | Great (7-8) | Good (5-6) | Adequate (3-4) | Needs Improvement (1-2) | Extra Credit |
---|---|---|---|---|---|---|
Written Communication | Report is well-written and flows naturally. Motivation for key steps is clearly explained to reader without excessive detail. Key findings are highlighted and appropriately given context. | Report has no grammatical or writing issues. Writing is accessible and flows naturally. Key findings are highlighted, but lack suitable motivation and context. | Report has no grammatical or writing issues. Key findings are present but insufficiently highlighted. | Writing is intelligible, but has some grammatical errors. Key findings are obscured. | Report exhibits significant weakness in written communication. Key points are difficult to discern. | Report includes extra context beyond instructor provided information. |
Project Skeleton | Code completes all instructor-provided tasks correctly | Response to one instructor provided task is skipped, incorrect, or otherwise incomplete. | Responses to two instructor provided tasks are skipped, incorrect, or otherwise incomplete. | Response to three instructor provided tasks are skipped, incorrect, or otherwise incomplete. | Less than half of the instructor-provided tasks were successfully completed. | Report exhibits particularly creative insights beyond instructor specifications. |
Formatting & Display | Tables have well-formatted column names, suitable numbers of digits, and attractive presentation. Table has a suitable caption. | Column names and digits are well-chosen, but formatting could be improved. | Bad column names (opaque variable names or other undefined acronyms) | Unfiltered ‘data dump’ instead of curated table. | No tables. | Report includes one or more high-quality graphics (created using R ). |
Code Quality |
Code is (near) flawless. Code passes all |
Comments give context of the analysis, not simply defining functions used in a particular line. | Code has well-chosen variable names and basic comments. | Code executes properly, but is difficult to read. | Code fails to execute properly. | Code takes advantage of advanced Quarto features to improve presentation of results. |
Data Preparation | Automatic (10/10). Out of scope for this mini-project | Report modifies instructor-provided import code to use additional columns or data sources in a way that creates novel insights. |
Note that this rubric is designed with copious opportunities for extra credit if students go above and beyond the instructor-provided scaffolding. Students pursuing careers in data analytics are strongly encouraged to go beyond the strict ambit of the mini-projects to i) further refine their skills; ii) learn additional techniques that can be used in the final course project; and iii) develop a more impressive professional portfolio.
Because students are encouraged to use STA 9750 mini-projects as the basis for a professional portfolio, the basic skeleton of each project will be released under a fairly permissive usage license. Take advantage of it!
Submission Instructions
After completing the analysis, write up your findings, showing all of your code, using a dynamic quarto
document and post it to your course repository. The qmd
file should be named mp01.qmd
so the rendered document can be found at docs/mp01.html
in the student’s repository and served at the URL:1
https://<GITHUB_ID>.github.io/STA9750-2025-FALL/mp01.html
Once you confirm this website works (substituting <GITHUB_ID>
for the actual GitHub username provided to the professor in MP#00 of course), open a new issue at
https://github.com/michaelweylandt/STA9750-2025-FALL/issues/new
.
Title the issue STA 9750 <GITHUB_ID> MiniProject #01
and fill in the following text for the issue:
Hi @michaelweylandt!
I've uploaded my work for MiniProject #**01** - check it out!
https://<GITHUB_ID>.github.io/STA9750-2025-FALL/mp01.html
Once the submission deadline passes, the instructor will tag classmates for peer feedback in this issue thread.
Additionally, a PDF export of this report should be submitted on Brightspace. To create a PDF from the uploaded report, simply use your browser’s ‘Print to PDF’ functionality.
NB: The analysis outline below specifies key tasks you need to perform within your write up. Your peer evaluators will check that you complete these. You are encouraged to do extra analysis, but the bolded Tasks are mandatory.
NB: Your final submission should look like a report, not simply a list of facts answering questions. Add introductions, conclusions, and your own commentary. You should be practicing both raw coding skills and written communication in all mini-projects. There is little value in data points stated without context or motivation.
Mini-Project #01: Gourmet Cheeseburgers Across the Globe: Exploring the Most Popular Programming on Netflix
Congratulations! You have just started in an exciting new job opportunity: you are a data analyst at Netflix, tasked with supporting the Public Relations team. As you may know, Netflix continues to invest heavily in original television and film content, with the goal of producing successful “gourmet cheeseburgers”, i.e., high-quality content that is still accessible to and popular with a global mass-market audience. Accordingly, the PR team wants to put out a series of press releases highlighting recent successes. In this project, you will mine Netflix’s public Top 10 data to identify and quantify these successes.2
In this mini-project, you will:
- Practice use of
dplyr
for analysis of tabular data - Practice use of
quarto
and Reproducible Research Tools for Effective Communication of Data Analysis Results - Begin your professional data science portfolio.
Recall that you are evaluated on writing and communication in these Mini-Projects. You are required write a report in the prescribed style, culminating in a set of press releases highlighting the success of recent Netflix original releases. A submission that performs the instructor-specified tasks, but does not write and give appropriate context and commentary will score very poorly on the relevant rubric elements.
In particular, if a submission does not include “press releases” and only answers the instructor prompts in narrative text, peer evaluators should judge it to have “Good” quality Written Communication (at best) as key findings are not conveyed appropriately.
Quarto’s code folding functionality is useful for “hiding” code so that it doesn’t break the flow of your writing.
You can also make use of Quarto’s contents
shortcode to present code and findings in an order other than how the code should be executed. This is particularly useful if you want to include a figure or table in an “Executive Summary” at the top of your submission.
Acquire Data
For this project, we will deal with two separate data files, taken from Netflix’s TuDum Top 10. These are:
- Global Top 10
- Country-wide Top 10
The following code will download the Netflix data and create TSV3 files in a data/mp01
directory. If this doesn’t work for whatever reason, you can download the data directly from Netflix, though you will need to make sure you are getting the right files and storing them in a location and format suitable for this mini-project.
You do not (yet) need to understand the code above, so please use the course discussion board if you have trouble getting it working.
Using the code above, acquire the latest Netflix Top 10 Data. Copy the code into your Quarto document and make sure it runs successfully.
git add
Data Files
Make sure that git
is set to ignore data files, such as the one created above. Check the git
pane in RStudio
and make sure that nyc_payroll_export.csv
does not appear. (If you set up your .gitignore
file correctly in MP#00, it should already be ignored.) If it is appearing, you may need to edit your .gitignore
file.
Removing a large data file from git
is possible, but difficult. Don’t get into a bad state!
Data Import and Preparation
Before we can analyze this data, we need to get it into R
and in a suitable format. The read_tsv
function from the readr
package can be used to read tsv
files into R
(imagine that!). This data comes to use quite well-formated, so we will need to only make one small alteration.
Let’s start by reading in the data and using the str()
and glimpse()
functions to examine its structure:
if(!require("tidyverse")) install.packages("tidyverse")
library(readr)
library(dplyr)
GLOBAL_TOP_10 <- read_tsv(GLOBAL_TOP_10_FILENAME)
Rows: 8640 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): category, show_title, season_title
dbl (5): weekly_rank, weekly_hours_viewed, runtime, weekly_views, cumulativ...
date (1): week
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(GLOBAL_TOP_10)
spc_tbl_ [8,640 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ week : Date[1:8640], format: "2025-08-17" "2025-08-17" ...
$ category : chr [1:8640] "Films (English)" "Films (English)" "Films (English)" "Films (English)" ...
$ weekly_rank : num [1:8640] 1 2 3 4 5 6 7 8 9 10 ...
$ show_title : chr [1:8640] "KPop Demon Hunters" "Night Always Comes" "My Oxford Year" "Happy Gilmore 2" ...
$ season_title : chr [1:8640] "N/A" "N/A" "N/A" "N/A" ...
$ weekly_hours_viewed : num [1:8640] 43300000 20800000 20800000 13800000 10200000 8900000 8200000 8200000 3900000 4300000 ...
$ runtime : num [1:8640] 1.67 1.83 1.88 1.97 1.57 ...
$ weekly_views : num [1:8640] 26000000 11300000 11000000 7000000 6500000 5400000 5100000 4100000 2700000 2600000 ...
$ cumulative_weeks_in_top_10: num [1:8640] 9 1 3 4 8 2 1 2 1 19 ...
- attr(*, "spec")=
.. cols(
.. week = col_date(format = ""),
.. category = col_character(),
.. weekly_rank = col_double(),
.. show_title = col_character(),
.. season_title = col_character(),
.. weekly_hours_viewed = col_double(),
.. runtime = col_double(),
.. weekly_views = col_double(),
.. cumulative_weeks_in_top_10 = col_double()
.. )
- attr(*, "problems")=<externalptr>
glimpse(GLOBAL_TOP_10)
Rows: 8,640
Columns: 9
$ week <date> 2025-08-17, 2025-08-17, 2025-08-17, 2025-0…
$ category <chr> "Films (English)", "Films (English)", "Film…
$ weekly_rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, …
$ show_title <chr> "KPop Demon Hunters", "Night Always Comes",…
$ season_title <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "…
$ weekly_hours_viewed <dbl> 43300000, 20800000, 20800000, 13800000, 102…
$ runtime <dbl> 1.6667, 1.8333, 1.8833, 1.9667, 1.5667, 1.6…
$ weekly_views <dbl> 26000000, 11300000, 11000000, 7000000, 6500…
$ cumulative_weeks_in_top_10 <dbl> 9, 1, 3, 4, 8, 2, 1, 2, 1, 19, 2, 6, 1, 1, …
From these print-outs, we can see various useful pieces of information:
- The number of rows and columns
- The names of each column
- The type of each column
Make sure you can find each of these in both print-outs.
Looking more closely, I only see one issue: when the season_title
is missing (e.g., for a movie), it is read into R
as the string "N/A"
instead of a proper NA
value. This however isn’t too hard to fix using a mutate
command.
The per country data set is similarly organized and also has an issue with reading "N/A"
values. While we could repeat the above transformation on this second data set, we can also address this at the time of data import by passing an additional argument to read_csv
.
Modify the call to read_tsv
above to read the per-country data file into R
as an object called COUNTRY_TOP_10
. Add an additional argument to read_tsv
to ensure that "N/A"
values are read as R
’s native NA
values. After reading the data into R
Hint: read_tsv
has many optional arguments. Use the function documentation to find the one you want. This is the same page that ?read_tsv
gives you, but presented in a slightly more readable format.
After completing these two tasks, you should have two data objects:
COUNTRY_TOP_10
GLOBAL_TOP_10
and the season_title
of both data sets should have proper NA
values.
Initial Data Exploration
Before preparing our press releases, we will do a bit of Exploratory Data Analysis (EDA). EDA serves many purposes in data science-quality control, hypothesis generation, outlier identification, etc*-but perhaps the most important is simply knowing what information can be found in a novel data set. Now that our data is imported and cleaned, it’s time to start our EDA.
When faced with a new data set, it is tempting to look only at the first few rows to get a sense of the data: R
does this by default. In practice, I recommend viewing a random selection of rows instead. This won’t guarantee you find any issues, but it increases the probability of finding issues in older parts of a data set.
While we could continue investigating our data using R
’s basic print-outs, this is a good time to introduce a new package DT
which can create more attractive visualizations of your results. The DT
package wraps the Javascript datatables
library to provide interactive tables. This library has many options to control formatting. Later in this course, you will also encounter the gt
package which can be used to create complex tables natively in R
.
You can use the DT
package as follows:
This table has several issues:
- The column names are not well-formatted. They are formatted as
R
variables instead of as proper “Title Case” names, e.g.,weekly_hours_viewed
should beWeekly Hours Viewed
. - Large numbers, e.g.,
weekly_hours_viewed
, are presented without commas or other legibility aids.
Let’s address both of these:
library(stringr)
format_titles <- function(df){
colnames(df) <- str_replace_all(colnames(df), "_", " ") |> str_to_title()
df
}
GLOBAL_TOP_10 |>
format_titles() |>
head(n=20) |>
datatable(options=list(searching=FALSE, info=FALSE)) |>
formatRound(c('Weekly Hours Viewed', 'Weekly Views'))
This is a closer to a publication-quality title. We should also drop the season_title
since here at least we are only showing films.
GLOBAL_TOP_10 |>
select(-season_title) |>
format_titles() |>
head(n=20) |>
datatable(options=list(searching=FALSE, info=FALSE)) |>
formatRound(c('Weekly Hours Viewed', 'Weekly Views'))
Finally, showing runtime in hours is a bit awkward, so let’s convert this to minutes:
GLOBAL_TOP_10 |>
mutate(`runtime_(minutes)` = round(60 * runtime)) |>
select(-season_title,
-runtime) |>
format_titles() |>
head(n=20) |>
datatable(options=list(searching=FALSE, info=FALSE)) |>
formatRound(c('Weekly Hours Viewed', 'Weekly Views'))
Much better.
We are now ready to begin some EDA.
Using dplyr
tools, answer the following questions. If your answer requires several rows, display your result as an interactive table following the example above (make sure to make it ‘publication quality’). If your answer is a single name or value, use Quarto’s inline code functionality to place the values in a sentence; that is, you should answer in complete sentences, written as normal text with inline code for computed values. Note that, depending on the question, you will need to use either the global table or the per-country table. You will not need to ‘combine’ tables to answer any of these questions.
How many different countries does Netflix operate in? (You can use the viewing history as a proxy for countries in which Netflix operates.)
Which non-English-language film has spent the most cumulative weeks in the global top 10? How many weeks did it spend?
-
What is the longest film (English or non-English) to have ever appeared in the Netflix global Top 10? How long is it in minutes?
Note that Netflix does not provide runtime for programs before a certain date, so your answer here may be a bit limited.
For each of the four categories, what program has the most total hours of global viewership?
Which TV show had the longest run in a country’s Top 10? How long was this run and in what country did it occur?
Netflix provides over 200 weeks of service history for all but one country in our data set. Which country is this and when did Netflix cease operations in that country?
What is the total viewership of the TV show Squid Game? Note that there are three seasons total and we are looking for the total number of hours watched across all seasons.
-
The movie Red Notice has a runtime of 1 hour and 58 minutes. Approximately how many views did it receive in 2021?
Hint: The
year()
function from thelubridate
package may be helpful. -
How many Films reached Number 1 in the US but did not originally debut there? That is, find films that first appeared on the Top 10 chart at, e.g., Number 4 but then became more popular and eventually hit Number 1? What is the most recent film to pull this off?
Hint: You will want to create a new variable to identify films that topped the charts at any point during their run.
Which TV show/season hit the top 10 in the most countries in its debut week? In how many countries did it chart?
Preparing Press Releases
Now that you have explored this data, it’s time to prepare some press releases. Each press release should have a catchy headline and about one paragraph of body text.4 Each press release should have at least 3 facts.
Press Release 1: Upcoming Season of Stranger Things
Netflix will release the fifth and final season of its hit show Stranger Things at the end of 2025. In preparation for the release, prepare a press release highlighting the broad impact of the previous four seasons. Your press release should refer to both the total viewership, the length of popularity (how many weeks in the top 10) and the multinational appeal of previous seasons. You should also compare the impact of Stranger Things to other popular English-language TV shows to give a sense of its success.
Write a press release promoting season 5 of Stranger Things. Be sure to include a catchy headline.
Press Release 2: Commercial Success in India
As the most populated country in the world, India represents a major growth opportunity for Netflix. Prepare a press release highlighting Netflix’s recent successes in Hindi-language films and TV. If you assume that all Hindi-language viewing is from customers in India (a strong assumption!), you can use this viewership data to estimate the size of Netflix’s customer base in India: compute a few points and discuss the long term growth trends (if any).
Hint: To get a list of possible films to highlight, look for programs which were very popular in India but did not chart in the US.
Write a press release touting the success of Netflix in India and highlighting recent subscriber growth in the region. Be sure to include a catchy headline.
Press Release 3: Open Topic
Finally, create a press release on a topic of your choosing. Your press release can use either data set (or both) and may focus on TV or films. Your press release should be grounded in a meaningful business purpose, i.e. highlighting success, growth, or opportunities in a region. Your press release may be announcing a new initiative (i.e., a new country in which Netflix is becoming available) or new conent or celebrating something that has already been released. The motivation of your press release should be clear from its content.
Write a press release on a topic of your choosing. Be sure to include a catchy headline.
Extra Credit Opportunities
Peer evaluators may award small amounts of extra credit persuant to the rubric above.
In particular, particularly entertaining press releases may receive up to 2 points of extra credit and highly insightful press releases (particularly around item 3) may receive up to 2 points of extra credit. One point of extra credit is available for inclusion of creative visual element: this may be a statistical chart (generated using R
) or some sort of artistic or thematic visualization generated in other software. (Statistical charts generated in other software, e.g., an Excel bar chart are not eligible for extra credit.)
No more than 4 points of extra credit may be awarded to a single submission, no matter how excellent.
This work ©2025 by Michael Weylandt is licensed under a Creative Commons BY-NC-SA 4.0 license.
Footnotes
Throughout this section, replace
<GITHUB_ID>
with your GitHub ID from Mini-Project #00, making sure to remove the angle brackets. Note that the automated course infrastructure will be looking for precise formatting, so follow these instructions closely.↩︎For this miniproject, we are limited to the data Netflix makes available to the general public. This data - taking the form of weekly Top 10 charts - necessarily omits the long tail of Netflix shows that are not hits, perhaps to minimize reputational damage of big budget flops. We are also only able to access global view counts and will have to make some (rather ridiculous) assumptions about per country popularity measures. If you actually worked at Netflix, you would certainly have far more detailed versions of this data.↩︎
Tab-Separated Values. TSVs are like CSVs, but are a bit more robust when dealing with text data since some titles may have commas in the name.↩︎
I can’t find Netflix’s press release archive online, but here is an example from HBO Max trumpeting the success of the trailer for The Last of Us Season 2. This style uses bullet points instead of a text paragraph, but you can get the general idea.↩︎