STA/OPR 9750 Mini-Project #02: The Business of Show Business

Warning: package 'glue' was built under R version 4.4.1

Due Dates

  • Released to Students: 2024-09-26
  • Initial Submission: 2024-10-23 11:45pm ET on GitHub and Brightspace
  • Peer Feedback:
    • Peer Feedback Assigned: 2024-10-24 on GitHub
    • Peer Feedback Due: 2024-10-30 11:45pm ET on GitHub

Introduction

Welcome to Mini-Project #02. In this project, you will play the role of a Hollywood development executive; that is, you are the executive in charge of coming up with new movie ideas. Historically, development executives would source the “life rights” necessary to make “based on a true story” movies, would secure production options on promising new novels, and would partner with owners of established intellectual property (IP) to develop movie adaptations. Recently, however, the development process has been criticized by Hollywood insiders and audiences alike for over-reliance on rote sequels. Our goal is to develop a set of data-driven ideas for new movies. Before doing so, however, we will dive into Hollywood history to identify key characteristics of successful movies, to identify successful filmmakers and actors, and to examine some of Hollywood’s most famous flops.

Student Responsbilities

Recall our basic analytic workflow and table of student responsibilities:

  • Data Ingest and Cleaning: Given a single data source, read it into R and transform it to a reasonably useful standardized format.
  • Data Combination and Alignment: Combine multiple data sources to enable insights not possible from a single source.
  • Descriptive Statistical Analysis: Take a data table and compute informative summary statistics from both the entire population and relevant subgroups
  • Data Visualization: Generate insightful data visualizations to spur insights not attainable from point statistics
  • Inferential Statistical Analysis and Modeling: Develop relevant predictive models and statistical analyses to generate insights about the underlying population and not simply the data at hand.
Students’ Responsibilities in Mini-Project Analyses
Ingest and Cleaning Combination and Alignment Descriptive Statistical Analysis Visualization
Mini-Project #01
Mini-Project #02 ½
Mini-Project #03 ½
Mini-Project #04

In this mini-project, you are mainly responsible for data alignment and basic statistical analyses. While not the main focus of this mini-project, you are also expected to provide basic data visualizations to support your findings; the grading rubric, below, emphasizes the dplyr tools used in this project, but reports without any visualization will be penalized severely. Note that data visualization will play a larger role in Mini-Project #03.

As before, I will provide code to automatically download and read in the data used for this project. Also note that, as compared with Mini-Project #01, I am providing significantly less ‘scaffolding’: students are more responsible for directing their own analyses.

Rubric

STA/OPR 9750 Mini-Projects are evaluated using peer grading with meta-review by the course GTAs. Specifically, variants of the following rubric will be used for the mini-projects:

Mini-Project Grading Rubric
Course Element Excellent (9-10) Great (7-8) Good (5-6) Adequate (3-4) Needs Improvement (1-2) Extra Credit
Written Communication Report is well-written and flows naturally. Motivation for key steps is clearly explained to reader without excessive detail. Key findings are highlighted and appropriately given context. Report has no grammatical or writing issues. Writing is accessible and flows naturally. Key findings are highlighted, but lack suitable motivation and context. Report has no grammatical or writing issues. Key findings are present but insufficiently highlighted. Writing is intelligible, but has some grammatical errors. Key findings are obscured. Report exhibits significant weakness in written communication. Key points are difficult to discern. Report includes extra context beyond instructor provided information.
Project Skeleton Code completes all instructor-provided tasks correctly. Responses to open-ended tasks are particularly insightful and creative. Code completes all instructor-provided tasks satisfactorially. Response to one instructor provided task is skipped, incorrect, or otherwise incomplete. Responses to two instructor provided tasks are skipped, incorrect, or otherwise incomplete. Response to three or ore instructor provided tasks are skipped, incorrect, or otherwise incomplete. Report exhibits particularly creative insights drawn from thorough student-initiated analyses.
Formatting & Display

Tables have well-formatted column names, suitable numbers of digits, and attractive presentation.

Figures are ‘publication-quality’, with suitable axis labels, well-chosen structure, attractive color schemes, titles, subtitles, and captions, etc.

Tables are well-formatted, but still have room for improvement.

Figures are above ‘exploratory-quality’, but do not reach full ‘publication-quality’.

Tables lack significant ‘polish’ and need improvement in substance (filtering and down-selecting of presented data) or style.

Figures are suitable to support claims made, but are ‘exploratory-quality’, reflecting minimal effort to customize and ‘polish’ beyond ggplot2 defaults.

Unfiltered ‘data dump’ instead of curated table.

Baseline figures that do not fully support claims made.

Report lacks basic tables; OR report lacks basic figures. Report includes one or more high-quality graphics (created using R) using tools beyond static basic ggplot2. These can be created using extensions toggplot2 or speciality packages for interactive graphics.
Code Quality

Code is (near) flawless.

Code passes all styler and lintr type analyses without issue.

Comments give context of the analysis, not simply defining functions used in a particular line. Code has well-chosen variable names and basic comments. Code executes properly, but is difficult to read. Code fails to execute properly. Code takes advantage of advanced Quarto features to improve presentation of results.
Data Preparation Automatic (10/10). Out of scope for this mini-project Report modifies instructor-provided import code to use additional columns or data sources in a way that creates novel insights.

Note that this rubric is designed with copious opportunities for extra credit if students go above and beyond the instructor-provided scaffolding. Students pursuing careers in data analytics are strongly encouraged to go beyond the strict ambit of the mini-projects to i) further refine their skills; ii) learn additional techniques that can be used in the final course project; and iii) develop a more impressive professional portfolio.

Because students are encouraged to use STA/OPR 9750 mini-projects as the basis for a professional portfolio, the basic skeleton of each project will be released under a fairly permissive usage license. Take advantage of it!

Submission Instructions

After completing the analysis, write up your findings, showing all of your code, using a dynamic quarto document and post it to your course repository. The qmd file should be named mp2.qmd so the rendered document can be found at docs/mp2.html in the student’s repository and served at the URL:

https://<GITHUB_ID>.github.io/STA9750-2024-FALL/mp02.html

Once you confirm this website works (substituting <GITHUB_ID> for the actual GitHub username provided to the professor in MP#00 of course), open a new issue at

https://github.com/<GITHUB_USERNAME>/STA9750-2024-FALL/issues/new .

Title the issue STA/OPR 9750 <GITHUB_USERNAME> MiniProject #02 and fill in the following text for the issue:

Hi @michaelweylandt!

I've uploaded my work for MiniProject #02 - check it out!

https://<GITHUB_USERNAME>.github.io/STA9750-2024-FALL/mp02.html

Once the submission deadline passes, the instructor will tag classmates for peer feedback in this issue thread.

Additionally, a PDF export of this report should be submitted on Brightspace. To create a PDF from the uploaded report, simply use your browser’s ‘Print to PDF’ functionality.

NB: The analysis outline below specifies key tasks you need to perform within your write up. Your peer evaluators will check that you complete these. You are encouraged to do extra analysis, but the bolded Tasks are mandatory.

NB: Your final submission should look like a report, not simply a list of facts answering questions. Add introductions, conclusions, and your own commentary. You should be practicing both raw coding skills and written communication in all mini-projects. There is little value in data points stated without context or motivation.

Mini-Project #02

Data

For this project, we will use data from the Internet Movie Database (IMDb). Specifically, we will use the tables from the IMDb non-commercial release. These files are made freely available by IMDb for non-commercial use.

The following code will automatically download and load these files into R:

get_imdb_file <- function(fname){
    BASE_URL <- "https://datasets.imdbws.com/"
    fname_ext <- paste0(fname, ".tsv.gz")
    if(!file.exists(fname_ext)){
        FILE_URL <- paste0(BASE_URL, fname_ext)
        download.file(FILE_URL, 
                      destfile = fname_ext)
    }
    as.data.frame(readr::read_tsv(fname_ext, lazy=FALSE))
}

NAME_BASICS      <- get_imdb_file("name.basics")
TITLE_BASICS     <- get_imdb_file("title.basics")
TITLE_EPISODES   <- get_imdb_file("title.episode")
TITLE_RATINGS    <- get_imdb_file("title.ratings")
TITLE_CREW       <- get_imdb_file("title.crew")
TITLE_PRINCIPALS <- get_imdb_file("title.principals")

Note that these are large files and it will take some time for them to download the first time. Because these files are so large, it will also take a little while to read them. If you want to speed up this stage, you can cache the code chunk that reads the files. This will ‘save’ the result of the chunk and only require it to be re-executed when it is changed.

Data Sub-Sampling

This data is large enough that we’re going to need to immediately start down-selecting to get to a data set that we can analyze fluidly. For our NAME_BASICS table, we’ll restrict our attention to people with at least two “known for” credits.1

NAME_BASICS <- NAME_BASICS |> 
    filter(str_count(knownForTitles, ",") > 1)

IMDb has a long tail of obscure movies:

TITLE_RATINGS |>
    ggplot(aes(x=numVotes)) + 
    geom_histogram(bins=30) +
    xlab("Number of IMDB Ratings") + 
    ylab("Number of Titles") + 
    ggtitle("Majority of IMDB Titles Have Less than 100 Ratings") + 
    theme_bw() + 
    scale_x_log10(label=scales::comma) + 
    scale_y_continuous(label=scales::comma)

To keep our computers from working too hard, let’s throw out any title with less than 100 ratings. It’s not too hard to see that this drops about 75% of the entire data set:

TITLE_RATINGS |>
    pull(numVotes) |>
    quantile()
     0%     25%     50%     75%    100% 
      5      11      26     101 2942823 

Applying this drop, we significantly reduce the size of our data set:

TITLE_RATINGS <- TITLE_RATINGS |>
    filter(numVotes >= 100)

We want to perform the same filtering on our other TITLE_* tables. This is a rare use for the semi_join. Recall that a semi_join returns only values which have a match,but doesn’t actually add columns.

TITLE_BASICS <- TITLE_BASICS |>
    semi_join(TITLE_RATINGS, 
              join_by(tconst == tconst))

TITLE_CREW <- TITLE_CREW |>
    semi_join(TITLE_RATINGS, 
              join_by(tconst == tconst))

TITLE_EPISODES_1 <- TITLE_EPISODES |>
    semi_join(TITLE_RATINGS, 
              join_by(tconst == tconst))
TITLE_EPISODES_2 <- TITLE_EPISODES |>
    semi_join(TITLE_RATINGS, 
              join_by(parentTconst == tconst))

TITLE_EPISODES <- bind_rows(TITLE_EPISODES_1,
                            TITLE_EPISODES_2) |>
    distinct()

TITLE_PRINCIPALS <- TITLE_PRINCIPALS |>
    semi_join(TITLE_RATINGS, join_by(tconst == tconst))


rm(TITLE_EPISODES_1)
rm(TITLE_EPISODES_2)

At this point, we’ve filtered down our data significantly and are ready to begin analysis in earnest. Note that our sub-sampling may induce some ‘dangling’ references: some of the people dropped from the NAME_BASICS table may only appear in one famous movie, and we’ve likely lost their info.

Processing Large Data

Even with this processing, this a non-trivial amount of data, requiring approximately 2 GB of memory. If your computer is significantly struggling to perform this pre-processing, the instructor may be able to provide smaller data files upon request. (Even on my quite modern laptop, the initial processing phase takes a few minutes: by ‘significant struggling’, I’m referring to processing taking upwards of half an hour or exhausting all available memory.) Please contact the instructor and TA through the course discussion board to discuss this possibility.

Processing large data sets is a skill, however, so we’re starting with the large data set to help you practice it.

Exports of the pre-processed data can be found on the course GitHub repo. If your computer is struggling to handle the full data set, you may choose to use these instead. The readr::read_csv files handles zip compression transparently, but you will need to modify get_imdb_file above to:

  1. Point to my GitHub instead of the IMDB archive
  2. Use .csv.zip files instead of .tsv.gz

Note also that, to get the compressed files small enough to store on GitHub, I had to apply more filtering than the code above uses. Make sure to note if you are using this extra-filtered extract so that a reader knows why you might be getting different answers.

Initial Exploration

At this point, let’s start examining our data more closely. Use the glimpse function to examine each table, taking care to note the type or mode of each column. For this data set, most columns appear to be read in as character (string) vectors, even when they should be numeric. This can occur when “null” values are represented in some non-standard way. For instance, in these files, we see that missing values are represented as \\N. R does not know that these are NA values and so retains them as strings.2

To fix this, we need to use:

  1. the mutate command, since we’re changing the type of a column
  2. the as.numeric command to change the type of the column.

We can clean the NAMES_BASIC command as follows:

NAME_BASICS <- NAME_BASICS |>
    mutate(birthYear = as.numeric(birthYear),
           deathYear = as.numeric(deathYear))
Task 1: Column Type Correction

Correct the column types of the TITLE tables using a combination of mutate and the coercion functions as.numeric and as.logical.

Another non-tidy aspect of this data is that it combines multiple pieces of information in a single cell separated by commas. We already saw a bit of this in the NAME_BASICS table, where both the primaryProfession and knownForTitles columns combine multiple values.

glimpse(NAME_BASICS)
Rows: 3,175,526
Columns: 6
$ nconst            <chr> "nm0000001", "nm0000002", "nm0000003", "nm0000004", …
$ primaryName       <chr> "Fred Astaire", "Lauren Bacall", "Brigitte Bardot", …
$ birthYear         <dbl> 1899, 1924, 1934, 1949, 1918, 1915, 1899, 1924, 1925…
$ deathYear         <dbl> 1987, 2014, NA, 1982, 2007, 1982, 1957, 2004, 1984, …
$ primaryProfession <chr> "actor,miscellaneous,producer", "actress,soundtrack,…
$ knownForTitles    <chr> "tt0072308,tt0050419,tt0053137,tt0027125", "tt003738…

We can use the separate_longer_delim function to break these into multiple rows: for example

NAME_BASICS |> separate_longer_delim(knownForTitles, ",") |> slice_head(n=10)
      nconst     primaryName birthYear deathYear
1  nm0000001    Fred Astaire      1899      1987
2  nm0000001    Fred Astaire      1899      1987
3  nm0000001    Fred Astaire      1899      1987
4  nm0000001    Fred Astaire      1899      1987
5  nm0000002   Lauren Bacall      1924      2014
6  nm0000002   Lauren Bacall      1924      2014
7  nm0000002   Lauren Bacall      1924      2014
8  nm0000002   Lauren Bacall      1924      2014
9  nm0000003 Brigitte Bardot      1934        NA
10 nm0000003 Brigitte Bardot      1934        NA
                    primaryProfession knownForTitles
1        actor,miscellaneous,producer      tt0072308
2        actor,miscellaneous,producer      tt0050419
3        actor,miscellaneous,producer      tt0053137
4        actor,miscellaneous,producer      tt0027125
5  actress,soundtrack,archive_footage      tt0037382
6  actress,soundtrack,archive_footage      tt0075213
7  actress,soundtrack,archive_footage      tt0117057
8  actress,soundtrack,archive_footage      tt0038355
9   actress,music_department,producer      tt0057345
10  actress,music_department,producer      tt0049189

To preserve flexibility, let’s not fully separate NAME_BASICS just yet, but you will need to use separate_longer_delim to answer various questions.

Using your knowledge of dplyr functionality, answer the following questions

Task 2: Instructor-Provided Questions
  1. How many movies are in our data set? How many TV series? How many TV episodes?

  2. Who is the oldest living person in our data set?

  3. There is one TV Episode in this data set with a perfect 10/10 rating and at least 200,000 IMDb ratings. What is it? What series does it belong to?

  4. What four projects is the actor Mark Hamill most known for?

  5. What TV series, with more than 12 episodes, has the highest average rating?

  6. The TV series Happy Days (1974-1984) gives us the common idiom “jump the shark”. The phrase comes from a controversial fifth season episode (aired in 1977) in which a lead character literally jumped over a shark on water skis. Idiomatically, it is used to refer to the moment when a once-great show becomes ridiculous and rapidly looses quality.

    Is it true that episodes from later seasons of Happy Days have lower average ratings than the early seasons?

Hint: It may be useful to create a “map” of which columns map to which tables before attempting these questions. While these can be quite formal, even some basic sketches on a scratch piece of paper are often quite clarifying.

Quantifying Success

Our goal is to proposal successful new movies. To do so, we need a way of measuring the success of a movie given only IMDb ratings.3 While there’s no “magic number” for success, it is logical to assume that a successful project will have both a high average IMDb rating, indicating quality, and a large number of ratings, indicating broad awareness in the public.

Task 3: Custom Success Metric

Design a ‘success’ measure for IMDb entries, reflecting both quality and broad popular awareness. Implement your success metric using a mutate operator to add a new column to the TITLE_RATINGS table.

Validate your success metric as follows:

  1. Choose the top 5-10 movies on your metric and confirm that they were indeed box office successes.
  2. Choose 3-5 movies with large numbers of IMDb votes that score poorly on your success metric and confirm that they are indeed of low quality.
  3. Choose a prestige actor or director and confirm that they have many projects with high scores on your success metric.
  4. Perform at least one other form of ‘spot check’ validation.
  5. Come up with a numerical threshold for a project to be a ‘success’; that is, determine a value \(v\) such that movies above \(v\) are all “solid” or better.

You will use your success metric and threshold to complete the rest of this Mini-Project. You may, if you wish, restrict your attention to movies for the remainder of your analysis, though a good development executive should also consider making TV series.

Examining Success by Genre and Decade

Now that you have a working proxy for success, it’s time to look at trends in success over time. Answer the following questions. Your responses should include at least 2 graphics.

Task 4: Trends in Success Over Time

Using questions like the following, identify a good “genre” for your next film. You do not need to answer these questions precisely, but these are may help guide your thinking.

  1. What was the genre with the most “successes” in each decade?
  2. What genre consistently has the most “successes”? What genre used to reliably produced “successes” and has fallen out of favor?
  3. What genre has produced the most “successes” since 2010? Does it have the highest success rate or does it only have a large number of successes because there are many productions in that genre?
  4. What genre has become more popular in recent years?

Based on your findings, select a genre for your next project. Note that you may wish to avoid an “oversatured” genre; you just need to make the argument that your proposal is a good investment, not necessarily the most studio-produced focus-grouped committee-designed generic satisfying choice, so feel free to lean in to your own artistic preferences, as long as you can make an argument for them.

Successful Personnel in the Genre

Now that you have selected a target genre, identify two actors and one director who will anchor your project. You want to identify key personnel who have worked in the genre before, with at least modest success, and who have at least one major success to their credit.

As you develop your team, you may want to consider the following possibilities:

  • An older established actor and an up-and-coming actor
  • An actor/director pair who have been successful together
  • An actor/director pair who are both highly successful but have never worked together
  • A pair of established actors who have had success in many genres

As you select your key personnel, consider what IMDb says they are known for; this will be useful in developing your marketing materials.

Task 5: Key Personnel

Identify (at least) two actors and one director who you will target as the key talent for your movie. Write a short “pitch” as to why they are likely to be successful. You should support your pitch with at least one graphic and one table.

Nostalgia and Remakes

Now that you have found a target genre and key talent for your project, you need a story. Like any good development executive, your first instinct should be to produce a remake of a classic film in the genre.

Task 6: Finding a Classic Movie to Remake

Find a classic movie to remake with your key talent. The original should have a large number of IMDb ratings, a high average rating, and not have been remade in the past 25 years.4

Once you have found your classic movie to remake, confirm whether key actors, directors, or writers from the original are still alive. If so, you need to contact your legal department to ensure they can secure the rights to the project. You may also want to include the classic actors as “fan service.”

Putting It Together

Task 7: Write and Deliver Your Pitch

Now that you have completed your analysis, write an “elevator pitch” of approximately 200-250 words for your proposed Hollywood project. This is the pitch you will bring to the studio head (your boss); if the studio head likes your pitch, you will be given a small sum of money to start securing the story rights and locking down tentative deals with key talent.

Your pitch needs to synthesize the analysis above into two to three quick and compelling points. (E.g., “The market for animated young adult horror musicals has grown 200% in the past decade” or “Over 90% of Director D’s movies are successes.”) You need to present the strongest argument for each element of your pitch, including genre, director, actors, and story.

If your boss approves the pitch, you will need to have a brief trailer ready for the next quarterly earnings call. The marketing department has asked that you prepare a classic 90’s style teaser for them. Adapt the following cliched formula for your pitch.

From director D, the visionary mind between N1; and From actor A, beloved star of N2; and From actor A2, Hollywood icon of genre G, Comes the timeless tail N3 A story of TOPIC, TOPIC, and TOPIC Coming soon to a theater near you.

If you’re creatively-minded, you could have some fun here using Generative tools to draft a script or mock up a movie poster for your pitch.

General Remarks

As you approach this project, recall there are no right or wrong answers. You are exploring data looking for exciting and actionable findings. You have several key decisions to make and you can support them with data, but the decisions are ultimately yours. This project is an exercise both in the “nuts-and-bolts” of analyzing a large data set and in using data to inform and refine what is ultimately still a “gut feeling” qualitative business decision.

As you iterate on this project, you will see that seemingly small different choices can produce very different results. That’s ok! As data analysts, we are constantly faced with small and essentially arbitrary decisions. An important “meta-skill” is knowing which of these decisions radically change our findings and which are meaningless. (An arbitrary decision with no impact on the bottom line is harmless; an arbitrary decision that could entirely change the plan for the next ten years is a problem.) Our responsibility is to clearly communicate these choices to our partners and clients: then we can receive their feedback on which way they would like to proceed.

Working in tools like Quarto and R helps here: if we provide clean and reproducible code, it should be easy to modify to see how our final conclusions are changed. Graphics also play an essential role in this form of clear communication: a ‘point estimate’ like “Action A is the best” is far less interpretable than a chart showing the predicted outcomes of several different actions.

As you approach this project, focus on justifying and communicating the choices you make. Structure your argument to communicate both key findings and uncertainties around them. Think about how you can use both document structure (headings vs subsections) and graphics to communicate with both clarity and nuance.

Good luck!


This work ©2024 by Michael Weylandt is licensed under a Creative Commons BY-NC-SA 4.0 license.

Footnotes

  1. It’s not entirely transparent who IMDb decides what projects an actor or director is “known for”. Still, it’s a reasonable filter that leaves us with more than enough to work with for this project.↩︎

  2. Recall that strings can contain essentially any data type and so are a safe fall-back. For instance, a column containing 1 and a can be losslessly represented by the string vector c("1", "a") but coercion to the numeric vector c(1, NA) is lossy. R tries very hard not to destroy any information and so it doesn’t perform this conversion for us unless we explicitly request it.↩︎

  3. Sadly, I couldn’t find permissively licensed movie box office data. If you are aware of some, please let me know!↩︎

  4. In order to see that a movie has not been recently remade, it is sufficient to confirm that no movie has been made with the same name in the past 25 years.↩︎