STA 9750 Mini-Project #01: Gourmet Cheeseburgers Across the Globe: Exploring the Most Popular Programming on Netflix

Due Dates

Released to Students: 2025-09-16
Initial Submission: 2025-10-03 11:59pm ET on GitHub and Brightspace
Peer Feedback:
- Peer Feedback Assigned: 2025-10-06 on GitHub
- Peer Feedback Due: 2025-10-13 11:59pm ET on GitHub

Estimated Time to Complete: 9 Hours

Estimated Time for Peer Feedback: 1 Hour

Welcome to STA 9750 Mini Projects!

In the STA 9750 Mini-Projects, you will perform basic data analyses intended to model best practices for your course final project. (Note, however, that these are mini-projects; your final course project is expected to be far more extensive than any single MP.)

For purposes of MPs, we are dividing the basic data analytic workflow into several major stages:

Data Ingest and Cleaning: Given a single data source, read it into R and transform it to a reasonably useful standardized format.
Data Combination and Alignment: Combine multiple data sources to enable insights not possible from a single source.
Descriptive Statistical Analysis: Take a data table and compute informative summary statistics from both the entire population and relevant subgroups
Data Visualization: Generate insightful data visualizations to spur insights not attainable from point statistics
Inferential Statistical Analysis and Modeling: Develop relevant predictive models and statistical analyses to generate insights about the underlying population and not simply the data at hand.

In this course, our primary focus is on the first four stages: you will take other courses that develop analytical and modeling techniques for a variety of data types. As we progress through the course, you will eventually be responsible for the first four steps. Specifically, you are responsible for the following stages of each mini-project:

Students’ Responsibilities in Mini-Project Analyses
	Ingest and Cleaning	Combination and Alignment	Descriptive Statistical Analysis	Visualization
Mini-Project #01			✓
Mini-Project #02		✓	✓	½
Mini-Project #03	½	✓	✓	✓
Mini-Project #04	✓	✓	✓	✓

In early stages of the course, such as this MP, I will ‘scaffold’ much of the analysis for you, leaving only those stages we have discussed in class for you to fill in. As the course progresses, the mini-projects will be more self-directed and results less standardized.

Rubric

STA 9750 Mini-Projects are evaluated using peer grading with meta-review by the course staff. Specifically, variants of the following rubric will be used for the mini-projects:

Mini-Project Grading Rubric
Course Element	Excellent (9-10)	Great (7-8)	Good (5-6)	Adequate (3-4)	Needs Improvement (1-2)	Extra Credit
Written Communication	Report is well-written and flows naturally. Motivation for key steps is clearly explained to reader without excessive detail. Key findings are highlighted and appropriately given context.	Report has no grammatical or writing issues. Writing is accessible and flows naturally. Key findings are highlighted, but lack suitable motivation and context.	Report has no grammatical or writing issues. Key findings are present but insufficiently highlighted.	Writing is intelligible, but has some grammatical errors. Key findings are obscured.	Report exhibits significant weakness in written communication. Key points are difficult to discern.	Report includes extra context beyond instructor provided information.
Project Skeleton	Code completes all instructor-provided tasks correctly	Response to one instructor provided task is skipped, incorrect, or otherwise incomplete.	Responses to two instructor provided tasks are skipped, incorrect, or otherwise incomplete.	Response to three instructor provided tasks are skipped, incorrect, or otherwise incomplete.	Less than half of the instructor-provided tasks were successfully completed.	Report exhibits particularly creative insights beyond instructor specifications.
Formatting & Display	Tables have well-formatted column names, suitable numbers of digits, and attractive presentation. Table has a suitable caption.	Column names and digits are well-chosen, but formatting could be improved.	Bad column names (opaque variable names or other undefined acronyms)	Unfiltered ‘data dump’ instead of curated table.	No tables.	Report includes one or more high-quality graphics (created using `R`).
Code Quality	Code is (near) flawless. Code passes all `styler` and `lintr` type analyses without issue.	Comments give context of the analysis, not simply defining functions used in a particular line.	Code has well-chosen variable names and basic comments.	Code executes properly, but is difficult to read.	Code fails to execute properly.	Code takes advantage of advanced `Quarto` features to improve presentation of results.
Data Preparation	Automatic (10/10). Out of scope for this mini-project					Report modifies instructor-provided import code to use additional columns or data sources in a way that creates novel insights.

Note that this rubric is designed with copious opportunities for extra credit if students go above and beyond the instructor-provided scaffolding. Students pursuing careers in data analytics are strongly encouraged to go beyond the strict ambit of the mini-projects to i) further refine their skills; ii) learn additional techniques that can be used in the final course project; and iii) develop a more impressive professional portfolio.

Because students are encouraged to use STA 9750 mini-projects as the basis for a professional portfolio, the basic skeleton of each project will be released under a fairly permissive usage license. Take advantage of it!

Submission Instructions

After completing the analysis, write up your findings, showing all of your code, using a dynamic quarto document and post it to your course repository. The qmd file should be named mp01.qmd (lower case!) so the rendered document can be found at docs/mp01.html in the student’s repository and will be served at the URL:¹

https://<GITHUB_ID>.github.io/STA9750-2025-FALL/mp01.html

You can use the helper function mp_start available at in the Course Helper Functions to create a file with the appropriate name and some meta-data already included. Do so by running the following command at the R Console:

source("https://michael-weylandt.com/STA9750/load_helpers.R"); mp_start(N=01)

After completing this mini-project, upload your rendered output and necessary ancillary files to GitHub to make sure your site works. The mp_submission_ready function in the Course Helper Functions can perform some of these checks automatically. You can run this function by running the following commands at the R Console:

source("https://michael-weylandt.com/STA9750/load_helpers.R"); mp_submission_ready(N=01)

Once you confirm this website works (substituting <GITHUB_ID> for the actual GitHub username provided to the professor in MP#00 of course), open a GitHub issue on the instructor’s repository to submit your completed work.

The easiest way to do so is by use of the mp_submission_create function in the Course Helper Functions, which can be used by running the following command at the R Console:

source("https://michael-weylandt.com/STA9750/load_helpers.R"); mp_submission_create(N=01)

Alternatively, if you wish to submit manually, open a new issue at

https://github.com/michaelweylandt/STA9750-2025-FALL/issues/new .

Title the issue STA 9750 <GITHUB_ID> MiniProject #01 and fill in the following text for the issue:

Hi @michaelweylandt!

I've uploaded my work for MiniProject #**01** - check it out!

<https://<GITHUB_ID>.github.io/STA9750-2025-FALL/mp01.html>

Once the submission deadline passes, the instructor will tag classmates for peer feedback in this issue thread.

Additionally, a PDF export of this report should be submitted on Brightspace. To create a PDF from the uploaded report, simply use your browser’s ‘Print to PDF’ functionality.

NB: The analysis outline below specifies key tasks you need to perform within your write up. Your peer evaluators will check that you complete these. You are encouraged to do extra analysis, but the bolded Tasks are mandatory.

NB: Your final submission should look like a report, not simply a list of facts answering questions. Add introductions, conclusions, and your own commentary. You should be practicing both raw coding skills and written communication in all mini-projects. There is little value in data points stated without context or motivation.

Mini-Project #01: Gourmet Cheeseburgers Across the Globe: Exploring the Most Popular Programming on Netflix

Congratulations! You have just started in an exciting new job opportunity: you are a data analyst at Netflix, tasked with supporting the Public Relations team. As you may know, Netflix continues to invest heavily in original television and film content, with the goal of producing successful “gourmet cheeseburgers”, i.e., high-quality content that is still accessible to and popular with a global mass-market audience. Accordingly, the PR team wants to put out a series of press releases highlighting recent successes. In this project, you will mine Netflix’s public Top 10 data to identify and quantify these successes.²

In this mini-project, you will:

Practice use of dplyr for analysis of tabular data
Practice use of quarto and Reproducible Research Tools for Effective Communication of Data Analysis Results
Begin your professional data science portfolio.

Writing Requirements

Recall that you are evaluated on writing and communication in these Mini-Projects. You are required write a report in the prescribed style, culminating in a set of press releases highlighting the success of recent Netflix original releases. A submission that performs the instructor-specified tasks, but does not write and give appropriate context and commentary will score very poorly on the relevant rubric elements.

In particular, if a submission does not include “press releases” and only answers the instructor prompts in narrative text, peer evaluators should judge it to have “Good” quality Written Communication (at best) as key findings are not conveyed appropriately.

Quarto’s code folding functionality is useful for “hiding” code so that it doesn’t break the flow of your writing.

You can also make use of Quarto’s contents shortcode to present code and findings in an order other than how the code should be executed. This is particularly useful if you want to include a figure or table in an “Executive Summary” at the top of your submission.

Acquire Data

For this project, we will deal with two separate data files, taken from Netflix’s TuDum Top 10. These are:

Global Top 10
Country-wide Top 10

The following code will download the Netflix data and create TSV³ files in a data/mp01 directory. If this doesn’t work for whatever reason, you can download the data directly from Netflix, though you will need to make sure you are getting the right files and storing them in a location and format suitable for this mini-project.

if(!dir.exists(file.path("data", "mp01"))){
    dir.create(file.path("data", "mp01"), showWarnings=FALSE, recursive=TRUE)
}

GLOBAL_TOP_10_FILENAME <- file.path("data", "mp01", "global_top10_alltime.csv")

if(!file.exists(GLOBAL_TOP_10_FILENAME)){
    download.file("https://www.netflix.com/tudum/top10/data/all-weeks-global.tsv", 
                  destfile=GLOBAL_TOP_10_FILENAME)
}

COUNTRY_TOP_10_FILENAME <- file.path("data", "mp01", "country_top10_alltime.csv")

if(!file.exists(COUNTRY_TOP_10_FILENAME)){
    download.file("https://www.netflix.com/tudum/top10/data/all-weeks-countries.tsv", 
                  destfile=COUNTRY_TOP_10_FILENAME)
}

You do not (yet) need to understand the code above, so please use the course discussion board if you have trouble getting it working.

Task 1: Data Acquisition

Using the code above, acquire the latest Netflix Top 10 Data. Copy the code into your Quarto document and make sure it runs successfully.

Do Not git add Data Files

Make sure that git is set to ignore data files, such as the one created above. Check the git pane in RStudio and make sure that nyc_payroll_export.csv does not appear. (If you set up your .gitignore file correctly in MP#00, it should already be ignored.) If it is appearing, you may need to edit your .gitignore file.

Removing a large data file from git is possible, but difficult. Don’t get into a bad state!

Data Import and Preparation

Before we can analyze this data, we need to get it into R and in a suitable format. The read_tsv function from the readr package can be used to read tsv files into R (imagine that!). This data comes to use quite well-formated, so we will need to only make one small alteration.

Let’s start by reading in the data and using the str() and glimpse() functions to examine its structure:

if(!require("tidyverse")) install.packages("tidyverse")
library(readr)
library(dplyr)

GLOBAL_TOP_10 <- read_tsv(GLOBAL_TOP_10_FILENAME)

Rows: 8640 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (3): category, show_title, season_title
dbl  (5): weekly_rank, weekly_hours_viewed, runtime, weekly_views, cumulativ...
date (1): week

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(GLOBAL_TOP_10)

spc_tbl_ [8,640 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ week                      : Date[1:8640], format: "2025-08-17" "2025-08-17" ...
 $ category                  : chr [1:8640] "Films (English)" "Films (English)" "Films (English)" "Films (English)" ...
 $ weekly_rank               : num [1:8640] 1 2 3 4 5 6 7 8 9 10 ...
 $ show_title                : chr [1:8640] "KPop Demon Hunters" "Night Always Comes" "My Oxford Year" "Happy Gilmore 2" ...
 $ season_title              : chr [1:8640] "N/A" "N/A" "N/A" "N/A" ...
 $ weekly_hours_viewed       : num [1:8640] 43300000 20800000 20800000 13800000 10200000 8900000 8200000 8200000 3900000 4300000 ...
 $ runtime                   : num [1:8640] 1.67 1.83 1.88 1.97 1.57 ...
 $ weekly_views              : num [1:8640] 26000000 11300000 11000000 7000000 6500000 5400000 5100000 4100000 2700000 2600000 ...
 $ cumulative_weeks_in_top_10: num [1:8640] 9 1 3 4 8 2 1 2 1 19 ...
 - attr(*, "spec")=
  .. cols(
  ..   week = col_date(format = ""),
  ..   category = col_character(),
  ..   weekly_rank = col_double(),
  ..   show_title = col_character(),
  ..   season_title = col_character(),
  ..   weekly_hours_viewed = col_double(),
  ..   runtime = col_double(),
  ..   weekly_views = col_double(),
  ..   cumulative_weeks_in_top_10 = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

glimpse(GLOBAL_TOP_10)

Rows: 8,640
Columns: 9
$ week                       <date> 2025-08-17, 2025-08-17, 2025-08-17, 2025-0…
$ category                   <chr> "Films (English)", "Films (English)", "Film…
$ weekly_rank                <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, …
$ show_title                 <chr> "KPop Demon Hunters", "Night Always Comes",…
$ season_title               <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "…
$ weekly_hours_viewed        <dbl> 43300000, 20800000, 20800000, 13800000, 102…
$ runtime                    <dbl> 1.6667, 1.8333, 1.8833, 1.9667, 1.5667, 1.6…
$ weekly_views               <dbl> 26000000, 11300000, 11000000, 7000000, 6500…
$ cumulative_weeks_in_top_10 <dbl> 9, 1, 3, 4, 8, 2, 1, 2, 1, 19, 2, 6, 1, 1, …

Note that these both provide essentially the same information, just formatted slightly differently: glimpse is a bit more legible, especially for large tables, while str is more general (working on non-data.frame objects) and provides a bit more detail. In general, you do not need to use both of these - just use the one that has the formatting you prefer.

From these print-outs, we can see various useful pieces of information:

The number of rows and columns
The names of each column
The type of each column

Make sure you can find each of these in both print-outs.

Looking more closely, I only see one issue: when the season_title is missing (e.g., for a movie), it is read into R as the string "N/A" instead of a proper NA value. This however isn’t too hard to fix using a mutate command.

Task 2: Data Cleaning

Complete the following code snippet by replacing the three ... with valid R code and copy it into your Quarto document. This will convert "N/A" values to proper NAs. After fixing this issue, use a str() or glimpse() command to confirm the issue has been addressed.

GLOBAL_TOP_10 <- GLOBAL_TOP_10 |>
    mutate(season_title = if_else(..., ..., ...))

Hint: You do not need to use the is.na function here since you are looking at values that are not yet proper R NAs.

The per country data set is similarly organized and also has an issue with reading "N/A" values. While we could repeat the above transformation on this second data set, we can also address this at the time of data import by passing an additional argument to read_csv.

Task 3: Data Import

Modify the call to read_tsv above to read the per-country data file into R as an object called COUNTRY_TOP_10. Add an additional argument to read_tsv to ensure that "N/A" values are read as R’s native NA values. After reading the data into R, you may want to use a str() or glimpse() call to make sure that the "N/A" to NA transformation was performed as desired.

Hint: read_tsv has many optional arguments. Use the function documentation to find the one you want. This is the same page that ?read_tsv gives you, but presented in a slightly more readable format.

After completing these two tasks, you should have two data objects:

COUNTRY_TOP_10
GLOBAL_TOP_10

and the season_title of both data sets should have proper NA values.

Initial Data Exploration

Before preparing our press releases, we will do a bit of Exploratory Data Analysis (EDA). EDA serves many purposes in data science-quality control, hypothesis generation, outlier identification, etc*-but perhaps the most important is simply knowing what information can be found in a novel data set. Now that our data is imported and cleaned, it’s time to start our EDA.

When faced with a new data set, it is tempting to look only at the first few rows to get a sense of the data: R does this by default. In practice, I recommend viewing a random selection of rows instead. This won’t guarantee you find any issues, but it increases the probability of finding issues in older parts of a data set.

While we could continue investigating our data using R’s basic print-outs, this is a good time to introduce a new package DT which can create more attractive visualizations of your results. The DT package wraps the Javascript datatables library to provide interactive tables. This library has many options to control formatting. Later in this course, you will also encounter the gt package which can be used to create complex tables natively in R.

You can use the DT package as follows:

library(DT)
GLOBAL_TOP_10 |> 
    head(n=20) |>
    datatable(options=list(searching=FALSE, info=FALSE))

Display of NA Values

Note that, due to some fundamental differences in the way R and JavaScript represent missing values, R NAs appear as blank cells in an interactive datatable(). If you are getting some blank cells, that means you completed Tasks 01 and 02 correctly. My examples here maintain the "N/A" because I have not made that adjustment to the data.

This table has several issues:

The column names are not well-formatted. They are formatted as R variables instead of as proper “Title Case” names, e.g., weekly_hours_viewed should be Weekly Hours Viewed.
Large numbers, e.g., weekly_hours_viewed, are presented without commas or other legibility aids.

Let’s address both of these:

library(stringr)
format_titles <- function(df){
    colnames(df) <- str_replace_all(colnames(df), "_", " ") |> str_to_title()
    df
}

GLOBAL_TOP_10 |> 
    format_titles() |>
    head(n=20) |>
    datatable(options=list(searching=FALSE, info=FALSE)) |>
    formatRound(c('Weekly Hours Viewed', 'Weekly Views'))

This is a closer to a publication-quality title. We should also drop the season_title since here at least we are only showing films.

GLOBAL_TOP_10 |> 
    select(-season_title) |>
    format_titles() |>
    head(n=20) |>
    datatable(options=list(searching=FALSE, info=FALSE)) |>
    formatRound(c('Weekly Hours Viewed', 'Weekly Views'))

Finally, showing runtime in hours is a bit awkward, so let’s convert this to minutes:

GLOBAL_TOP_10 |> 
    mutate(`runtime_(minutes)` = round(60 * runtime)) |>
    select(-season_title, 
           -runtime) |>
    format_titles() |>
    head(n=20) |>
    datatable(options=list(searching=FALSE, info=FALSE)) |>
    formatRound(c('Weekly Hours Viewed', 'Weekly Views'))

Much better.

Using DT with a Dark Theme

If you are using a “dark” bootswatch theme, you need to manually adjust the formatting of datatable by passing additional arguments to style and options as follows:

GLOBAL_TOP_10 |> 
    mutate(`runtime_(minutes)` = round(60 * runtime)) |>
    select(-season_title, 
           -runtime) |>
    format_titles() |>
    head(n=20) |>
    datatable(style= "bootstrap5", , 
              options=list(searching=FALSE, 
                           info=FALSE, 
              initComplete = JS(
        'function() {$("html").attr("data-bs-theme", "dark");}'
             ))) |>
    formatRound(c('Weekly Hours Viewed', 'Weekly Views'))

Even if you turn on the search bar and remove info=FALSE, you still need to pass options = list(initComplete = JS('function() {$("html").attr("data-bs-theme", "dark");}')) as an argument to datatable() to make it work correctly with a dark theme.

We are now ready to begin some EDA.

Task 4: Exploratory Questions

Using dplyr tools, answer the following questions. If your answer requires several rows, display your result as an interactive table following the example above (make sure to make it ‘publication quality’). If your answer is a single name or value, use Quarto’s inline code functionality to place the values in a sentence; that is, you should answer in complete sentences, written as normal text with inline code for computed values. Note that, depending on the question, you will need to use either the global table or the per-country table. You will not need to ‘combine’ tables to answer any of these questions.

How many different countries does Netflix operate in? (You can use the viewing history as a proxy for countries in which Netflix operates.)
Which non-English-language film has spent the most cumulative weeks in the global top 10? How many weeks did it spend?
What is the longest film (English or non-English) to have ever appeared in the Netflix global Top 10? How long is it in minutes?

Note that Netflix does not provide runtime for programs before a certain date, so your answer here may be a bit limited.
For each of the four categories, what program has the most total hours of global viewership?
Which TV show had the longest run in a country’s Top 10? How long was this run and in what country did it occur?
Netflix provides over 200 weeks of service history for all but one country in our data set. Which country is this and when did Netflix cease operations in that country?
What is the total viewership of the TV show Squid Game? Note that there are three seasons total and we are looking for the total number of hours watched across all seasons.
The movie Red Notice has a runtime of 1 hour and 58 minutes. Approximately how many views did it receive in 2021? Note that Netflix does not provide the weekly_views values that far back in the past, but you can compute it yourself using the total view time and the runtime.

Hint: The year() function from the lubridate package may be helpful.
How many Films reached Number 1 in the US but did not originally debut there? That is, find films that first appeared on the Top 10 chart at, e.g., Number 4 but then became more popular and eventually hit Number 1? What is the most recent film to pull this off?

Hint: You will want to create a new variable to identify films that topped the charts at any point during their run.
Which TV show/season hit the top 10 in the most countries in its debut week? In how many countries did it chart?

Preparing Press Releases

Now that you have explored this data, it’s time to prepare some press releases. Each press release should have a catchy headline and about one paragraph of body text.⁴ Each press release should have at least 3 facts.

Press Release 1: Upcoming Season of Stranger Things

Netflix will release the fifth and final season of its hit show Stranger Things at the end of 2025. In preparation for the release, prepare a press release highlighting the broad impact of the previous four seasons. Your press release should refer to both the total viewership, the length of popularity (how many weeks in the top 10) and the multinational appeal of previous seasons. You should also compare the impact of Stranger Things to other popular English-language TV shows to give a sense of its success.

Task 5: Stranger Things Press Release

Write a press release promoting season 5 of Stranger Things. Be sure to include a catchy headline.

Press Release 2: Commercial Success in India

As the most populated country in the world, India represents a major growth opportunity for Netflix. Prepare a press release highlighting Netflix’s recent successes in Hindi-language films and TV. If you assume that all Hindi-language viewing is from customers in India (a strong assumption!), you can use this viewership data to estimate the size of Netflix’s customer base in India: compute a few points and discuss the long term growth trends (if any).

Hint: To get a list of possible films to highlight, look for programs which were very popular in India but did not chart in the US.

Task 6: Netflix in India Press Release

Write a press release touting the success of Netflix in India and highlighting recent subscriber growth in the region. Be sure to include a catchy headline.

Press Release 3: Open Topic

Finally, create a press release on a topic of your choosing. Your press release can use either data set (or both) and may focus on TV or films. Your press release should be grounded in a meaningful business purpose, i.e. highlighting success, growth, or opportunities in a region. Your press release may be announcing a new initiative (i.e., a new country in which Netflix is becoming available) or new conent or celebrating something that has already been released. The motivation of your press release should be clear from its content.

Task 7: 3rd Press Release

Write a press release on a topic of your choosing. Be sure to include a catchy headline.

Extra Credit Opportunities

Peer evaluators may award small amounts of extra credit persuant to the rubric above.

In particular, particularly entertaining press releases may receive up to 2 points of extra credit and highly insightful press releases (particularly around item 3) may receive up to 2 points of extra credit. One point of extra credit is available for inclusion of creative visual element: this may be a statistical chart (generated using R) or some sort of artistic or thematic visualization generated in other software. (Statistical charts generated in other software, e.g., an Excel bar chart are not eligible for extra credit.)

No more than 4 points of extra credit may be awarded to a single submission, no matter how excellent.

Footnotes

Throughout this section, replace <GITHUB_ID> with your GitHub ID from Mini-Project #00, making sure to remove the angle brackets. Note that the automated course infrastructure will be looking for precise formatting, so follow these instructions closely.↩︎
For this miniproject, we are limited to the data Netflix makes available to the general public. This data - taking the form of weekly Top 10 charts - necessarily omits the long tail of Netflix shows that are not hits, perhaps to minimize reputational damage of big budget flops. We are also only able to access global view counts and will have to make some (rather ridiculous) assumptions about per country popularity measures. If you actually worked at Netflix, you would certainly have far more detailed versions of this data.↩︎
Tab-Separated Values. TSVs are like CSVs, but are a bit more robust when dealing with text data since some titles may have commas in the name.↩︎
I can’t find Netflix’s press release archive online, but here is an example from HBO Max trumpeting the success of the trailer for The Last of Us Season 2. This style uses bullet points instead of a text paragraph, but you can get the general idea.↩︎