STA 9750 Mini-Project #03: Creating the Ultimate Playlist

Due Dates

  • Released to Students: 2025-03-20
  • Initial Submission: 2025-04-23 11:45pm ET on GitHub and Brightspace
  • Peer Feedback:
    • Peer Feedback Assigned: 2025-04-24 on GitHub
    • Peer Feedback Due: 2025-04-30 11:45pm ET on GitHub

Estimated Time to Complete: 5 Hours

Estimated Time for Peer Feedback: 1 Hour


Introduction

Welcome to Mini-Project #03! In this project, you will dive into the world of music analytics in an attempt to create The Ultimate Playlist. Specifically, we will explore two data exports made available by Spotify to identify i) the most popular songs on the platform and ii) the characteristics of those songs. From this data, you will create the ultimate playlist. Note that, while this project is inspired by the work of the Great Sage, Mr Barney Stinson, pioneer of the “All Rise” playlist, you can create whatever type of playlist you want, as long as it is Ultimate.

Also note that this mini-project is intended to be a bit less demanding than Mini-Project #02. At this point in the course, you should be diving into your Course Project, which should consume the majority of your out-of-class time dedicated to this course for the remainder of the semester.

Student Responsbilities

Recall our basic analytic workflow and table of student responsibilities:

  • Data Ingest and Cleaning: Given a single data source, read it into R and transform it to a reasonably useful standardized format.
  • Data Combination and Alignment: Combine multiple data sources to enable insights not possible from a single source.
  • Descriptive Statistical Analysis: Take a data table and compute informative summary statistics from both the entire population and relevant subgroups
  • Data Visualization: Generate insightful data visualizations to spur insights not attainable from point statistics
  • Inferential Statistical Analysis and Modeling: Develop relevant predictive models and statistical analyses to generate insights about the underlying population and not simply the data at hand.
Students’ Responsibilities in Mini-Project Analyses
Ingest and Cleaning Combination and Alignment Descriptive Statistical Analysis Visualization
Mini-Project #01
Mini-Project #02 ½
Mini-Project #03 ½
Mini-Project #04

In this project, I am no longer providing code to download and read the necessary data files. The data files I have selected for this mini-project are relatively easy to work with and should not provide a significant challenge, particularly after our in-class discussion of Data Import. See the modified rubric below which now includes a grade for data import.

Rubric

STA 9750 Mini-Projects are evaluated using peer grading with meta-review by the course GTAs. Specifically, variants of the following rubric will be used for the mini-projects:

Mini-Project Grading Rubric
Course Element Excellent (9-10) Great (7-8) Good (5-6) Adequate (3-4) Needs Improvement (1-2) Extra Credit
Written Communication Report is well-written and flows naturally. Motivation for key steps is clearly explained to reader without excessive detail. Key findings are highlighted and appropriately given context. Report has no grammatical or writing issues. Writing is accessible and flows naturally. Key findings are highlighted, but lack suitable motivation and context. Report has no grammatical or writing issues. Key findings are present but insufficiently highlighted. Writing is intelligible, but has some grammatical errors. Key findings are obscured. Report exhibits significant weakness in written communication. Key points are difficult to discern. Report includes extra context beyond instructor provided information.
Project Skeleton Code completes all instructor-provided tasks correctly. Responses to open-ended tasks are particularly insightful and creative. Code completes all instructor-provided tasks satisfactorially. Response to one instructor provided task is skipped, incorrect, or otherwise incomplete. Responses to two instructor provided tasks are skipped, incorrect, or otherwise incomplete. Response to three or ore instructor provided tasks are skipped, incorrect, or otherwise incomplete. Report exhibits particularly creative insights drawn from thorough student-initiated analyses.
Formatting & Display

Tables and figures are full ‘publication-quality’.

Report includes at least one animated visualization designed to effectively communicate findings.

Tables have well-formatted column names, suitable numbers of digits, and attractive presentation.

Figures are ‘publication-quality’, with suitable axis labels, well-chosen structure, attractive color schemes, titles, subtitles, and captions, etc.

Tables are well-formatted, but still have room for improvement.

Figures are above ‘exploratory-quality’, but do not reach full ‘publication-quality’.

Tables lack significant ‘polish’ and need improvement in substance (filtering and down-selecting of presented data) or style.

Figures are suitable to support claims made, but are ‘exploratory-quality’, reflecting minimal effort to customize and ‘polish’ beyond ggplot2 defaults.

Unfiltered ‘data dump’ instead of curated table.

Baseline figures that do not fully support claims made.

Report includes interactive (not just animated) visual elements.
Code Quality

Code is (near) flawless.

Code passes all styler and lintr type analyses without issue.

Comments give context of the analysis, not simply defining functions used in a particular line. Code has well-chosen variable names and basic comments. Code executes properly, but is difficult to read. Code fails to execute properly. Code takes advantage of advanced Quarto features to improve presentation of results.
Data Preparation Data import is fully-automated and efficient, taking care to only download from web-sources if not available locally. Data is imported and prepared effectively, in an automated fashion with minimal hard-coding of URLs and file paths. Data is imported and prepared effectively, though source and destination file names are hard-coded. Data is imported in a manner likely to have errors. Data is hard-coded and not imported from an external source. Report uses additional data sources in a way that creates novel insights.

Note that this rubric is designed with copious opportunities for extra credit if students go above and beyond the instructor-provided scaffolding. Students pursuing careers in data analytics are strongly encouraged to go beyond the strict ambit of the mini-projects to i) further refine their skills; ii) learn additional techniques that can be used in the final course project; and iii) develop a more impressive professional portfolio.

Because students are encouraged to use STA 9750 mini-projects as the basis for a professional portfolio, the basic skeleton of each project will be released under a fairly permissive usage license. Take advantage of it!

Submission Instructions

After completing the analysis, write up your findings, showing all of your code, using a dynamic quarto document and post it to your course repository. The qmd file should be named mp03.qmd so the rendered document can be found at docs/mp03.html in the student’s repository and served at the URL:1

https://<GITHUB_ID>.github.io/STA9750-2025-SPRING/mp03.html

Once you confirm this website works (substituting <GITHUB_ID> for the actual GitHub username provided to the professor in MP#00 of course), open a new issue at

https://github.com/<GITHUB_ID>/STA9750-2025-SPRING/issues/new .

Title the issue STA 9750 <GITHUB_ID> MiniProject #03 and fill in the following text for the issue:

Hi @michaelweylandt!

I've uploaded my work for MiniProject #**03** - check it out!

https://<GITHUB_ID>.github.io/STA9750-2025-SPRING/mp03.html

Once the submission deadline passes, the instructor will tag classmates for peer feedback in this issue thread.

Additionally, a PDF export of this report should be submitted on Brightspace. To create a PDF from the uploaded report, simply use your browser’s ‘Print to PDF’ functionality.

NB: The analysis outline below specifies key tasks you need to perform within your write up. Your peer evaluators will check that you complete these. You are encouraged to do extra analysis, but the bolded Tasks are mandatory.

NB: Your final submission should look like a report, not simply a list of facts answering questions. Add introductions, conclusions, and your own commentary. You should be practicing both raw coding skills and written communication in all mini-projects. There is little value in data points stated without context or motivation.

Mini-Project #03: Creating the Ultimate Playlist

Data Acquisition

We will use two Spotify data exports:

  1. A data set of songs and their characteristics
  2. An export of user-created playlists

Interestingly, Spotify no longer makes these datasets available directly, but nothing ever leaves the internet and we can use mirrors of the original data posted by other data scientists.

Song Characteristics

First, we can download the song properties data from the mirror posted by GitHub user gabminamendez. Download the data and import it into R.

Task 1: Song Characteristics Dataset

Write a function called load_songs to

  1. download the Spotify song analytics dataset (if needed) from https://raw.githubusercontent.com/gabminamedez/spotify-data/refs/heads/master/data.csv
  2. read it into R.

Your function should return a well-formatted data frame.

To be responsible, your download code should:

  1. Only download the file if it is not already present
  2. Create a directory titled data/mp03 if one is not already present
  3. Save the file in data/mp03 and, if necessary, decompress it.
  4. Use built-in R functions like download.file

See prior mini-projects for examples of responsible downloading code.

The artists column of this data set is a bit oddly formatted: it contains multiple artists in a “list-type” format: e.g., the song “Blinding Lights” has artists as ['The Weeknd'] and the song “Uptown Funk (feat. Bruno Mars)” has artists as ['Mark Ronson', 'Bruno Mars']. The following code will split the artists across multiple rows, yielding e.g., two rows for Uptown Funk, one each for Mark Ronson and Bruno Mars.

library(tidyr)
library(stringr)
clean_artist_string <- function(x){
    str_replace_all(x, "\\['", "") |> 
        str_replace_all("'\\]", "") |>
        str_replace_all(" '", "")
}
SONGS |> 
  separate_longer_delim(artists, ",") |>
  mutate(artist = clean_artist_string(artists)) |>
  select(-artists)

Adapt it as needed for your analysis.

Playlists

Next, we’ll download the Spotify Million Playlist dataset from GitHub user DevinOgrady. Because this dataset is large, DevinOgrady has uploaded it as a series of JSON files in his spotify_million_playlist_dataset repository. Write a function to download all files from this repository (data1 directory), store them locally, and read them into R. As above, your download function should be responsible (only downloading the data once as needed).

Task 2: Playlist Dataset

Write a function called load_playlists to

  1. download the Spotify million playlist dataset (if needed);
  2. read all files into R; and
  3. concatenate them into a list object.

As before, your download code should be “responsible” and avoid unnecessary duplicate downloads. (If your code is not ‘responsible’, GitHub may put a short-term block up to stop you from making too many requests. Be careful!)

You may not hard-code invidual file names. You need to use a loop-type construct and create individual file names programmatically.

Unlike the song characteristics data, this is hierarchical data, not trivially represented as a tidy data frame. To proceed we will need to process the playlists data into a more standard format. In particular, we want a table of the following columns:

  • Playlist Name (playlist_name)
  • Playlist ID (playlist_id)
  • Playlist Position (playlist_position)
  • Playlist Followers (playlist_followers)
  • Artist Name (artist_name)
  • Artist ID (artist_id)
  • Track Name (track_name)
  • Track ID (track_id)
  • Album Name (album_name)
  • Album ID (album_id)
  • Duration (duration)

where each row is one “track” from a playlist. (Certain songs will be repeated because they appear on multiple playlists.)

Task 3: ‘Rectangle’ the Playlist Data

Using functions from the tidyr, purrr, and dplyr packages, convert the playlist data from Task 2 into the rectangular format described above.

To clean up the ID columns, you can use the following function to strip the spotify:type: prefix.

strip_spotify_prefix <- function(x){
    library(stringr)
    str_extract(x, ".*:.*:(.*)", group=1)
}

Initial Exploration

Now that your data is imported and cleaned, it is time to begin exploring it and seeing how comprehensive it is. (Note that these exports were created at different times, so they will not have fully overlapping coverage.)

Task 4: Initial Exploration
  1. How many distinct tracks and artists are represented in the playlist data?
  2. What are the 5 most popular tracks in the playlist data?
  3. What is the most popular track in the playlist data that does not have a corresponding entry in the song characteristics data?
  4. According to the song characteristics data, what is the most “danceable” track? How often does it appear in a playlist?
  5. Which playlist has the longest average track length?
  6. What is the most popular playlist on Spotify?

Building a Playlist from Anchor Songs

To begin building your playlist, pick one or two “anchor” songs that you like. (You can adjust these if you don’t like how your playlist comes out.)

You can find songs that work well in a playlist with these songs using various heuristics:

  1. What other songs commonly appear on playlists along side this song?

  2. What other songs are in the same key2 and have a similar tempo? (This makes it easy for a skilled DJ to transition from one song to the next.)

  3. What other songs were released by the same artist?

  4. What other songs were released in the same year and have similar levels of acousticness, danceability, etc.?

Task 6: Finding Related Songs

Implement the four heuristics above plus one of your own to find potential songs that belong on a playlist with your anchor songs. You should identify at least 20 candidates, at least 8 of which are not “popular” based on the threshold you set above.

Create Your Playlist

Given your anchor songs and playlist candidates, it is now time to filter down to a playlist of around 12 songs, suitably ordered. Using all the data available to you, determine an optimal playlist. Your playlist should include at least 2 songs you were not previously familiar with and at least 3 which are not “popular.”

Once you have created your playlist, visualize its evolution on the various quantitative metrics Spotify provides. Does it, per Stinson, ‘rise and fall’ or is it ‘all rise’?

Task 7: Curate and Analyze Your Ultimate Playlist

Curate and Analyze the Ultimate Playlist per instructions above. In addition to its musical structure, make sure to consider thematic unity, e.g., songs about cars, and to give your playlist a creative name.

Deliverable: The Ultimate Playlist

Now that you have created the Ultimate Playlist, it is time to nominate it for the Internet’s Best Playlist award. Write a title, description, and “design principles” for your playlist. Make sure to include at least one visualization that argues why your playlist is Ultimate. Describe how you used statistical and visual analysis to create this playlist, taking pride to showcase lesser known pieces that are integral to your playlist structure.

Extra Credit

In addition to the rubric-specified extra credit, extra credit may be given for the following:

  • Up to 2 points: use of Quarto’s video support to embed recordings of tracks from the playlist.

This work ©2025 by Michael Weylandt is licensed under a Creative Commons BY-NC-SA 4.0 license.

Footnotes

  1. Throughout this section, replace <GITHUB_ID> with your GitHub ID from Mini-Project #00, making sure to remove the angle brackets. Note that the automated course infrastructure will be looking for precise formatting, so follow these instructions closely.↩︎

  2. As coded by the key column, 0 is C, 1 is C#, 2 is D, etc. For non-musicians, this just means that the two songs share some harmonic structure that makes it easy to tie them together.↩︎