STA 9750 Week 12 In-Class Activity: Strings

Slides

Review and Warm-Up

Hadley Wickham is the author of many of the “tidy tools” we use in this course. He is also an excellent bartender and chef. In this exercise, we are going to web scrape his cocktail recipe book which can be found at https://cocktails.hadley.nz/.

Our goal is to create a data frame that records all 150+ recipes on this site (as rows) and the different ingredients (as columns). This week, we are going to pull the different recipes into R: next week we are going to process the text and create our final data frame (so stay tuned!).

Working with your project team, first go through the following steps to build a scraping strategy:

Poke around the website to see how it is organized. Is there a single page listing all of the cocktails? If not, how else can you make sure that you’ve explored the entire site?
Once you know how you’re going to explore the whole site, use your browser tools to see if you can identify an HTML element that corresponds to a single recipe. (This element will occur several times per page) Remember that you want to select “as small as possible” but no smaller.
Once you have found the right HTML element for a recipe, identify an HTML element that corresponds to i) the title; and ii) individual ingredients.

For this task, you will likely see several recipes more than once. Don’t worry about this for now - we can distinct out the duplicates later in our analysis. It’s better to be over-inclusive than under-inclusive.

After you have built your plan, it’s time to start putting this all to code.

Write code to get a list of all pages you will need to process. Construct full URLs for your future requests.

Solution

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(httr2)
library(rvest)


Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding

BASE_URL <- "https://cocktails.hadley.nz/"

PAGES <- read_html(BASE_URL) |> 
            html_elements("nav a") |> 
            html_attr("href")

PAGE_URLS <- paste0(BASE_URL, PAGES)


PAGE_URLS

 [1] "https://cocktails.hadley.nz/ingredient-absinthe.html"             
 [2] "https://cocktails.hadley.nz/ingredient-allspice-liqueur.html"     
 [3] "https://cocktails.hadley.nz/ingredient-ancho-reyes.html"          
 [4] "https://cocktails.hadley.nz/ingredient-angostura.html"            
 [5] "https://cocktails.hadley.nz/ingredient-aperol.html"               
 [6] "https://cocktails.hadley.nz/ingredient-apricot-liqueur.html"      
 [7] "https://cocktails.hadley.nz/ingredient-averna.html"               
 [8] "https://cocktails.hadley.nz/ingredient-banana-liqueur.html"       
 [9] "https://cocktails.hadley.nz/ingredient-benedictine.html"          
[10] "https://cocktails.hadley.nz/ingredient-blackberry-liqueur.html"   
[11] "https://cocktails.hadley.nz/ingredient-blanc-vermouth.html"       
[12] "https://cocktails.hadley.nz/ingredient-bourbon.html"              
[13] "https://cocktails.hadley.nz/ingredient-brandy.html"               
[14] "https://cocktails.hadley.nz/ingredient-campari.html"              
[15] "https://cocktails.hadley.nz/ingredient-champagne.html"            
[16] "https://cocktails.hadley.nz/ingredient-cherry-liqueur.html"       
[17] "https://cocktails.hadley.nz/ingredient-coconut-cream.html"        
[18] "https://cocktails.hadley.nz/ingredient-coffee-liqueur.html"       
[19] "https://cocktails.hadley.nz/ingredient-cream.html"                
[20] "https://cocktails.hadley.nz/ingredient-cr-me-de-cacao.html"       
[21] "https://cocktails.hadley.nz/ingredient-cr-me-de-cassis.html"      
[22] "https://cocktails.hadley.nz/ingredient-cynar.html"                
[23] "https://cocktails.hadley.nz/ingredient-dry-vermouth.html"         
[24] "https://cocktails.hadley.nz/ingredient-elderflower-liqueur.html"  
[25] "https://cocktails.hadley.nz/ingredient-espresso.html"             
[26] "https://cocktails.hadley.nz/ingredient-fernet.html"               
[27] "https://cocktails.hadley.nz/ingredient-gin.html"                  
[28] "https://cocktails.hadley.nz/ingredient-ginger-liqueur.html"       
[29] "https://cocktails.hadley.nz/ingredient-grapefruit-bitters.html"   
[30] "https://cocktails.hadley.nz/ingredient-grapefruit-juice.html"     
[31] "https://cocktails.hadley.nz/ingredient-green-chartreuse.html"     
[32] "https://cocktails.hadley.nz/ingredient-grenadine.html"            
[33] "https://cocktails.hadley.nz/ingredient-herbstura.html"            
[34] "https://cocktails.hadley.nz/ingredient-honey-syrup.html"          
[35] "https://cocktails.hadley.nz/ingredient-lemon-juice.html"          
[36] "https://cocktails.hadley.nz/ingredient-lillet-blanc.html"         
[37] "https://cocktails.hadley.nz/ingredient-lime-juice.html"           
[38] "https://cocktails.hadley.nz/ingredient-luxardo-bitter-bianco.html"
[39] "https://cocktails.hadley.nz/ingredient-maple-syrup.html"          
[40] "https://cocktails.hadley.nz/ingredient-maraschino-liqueur.html"   
[41] "https://cocktails.hadley.nz/ingredient-meletti.html"              
[42] "https://cocktails.hadley.nz/ingredient-mezcal.html"               
[43] "https://cocktails.hadley.nz/ingredient-orange-bitters.html"       
[44] "https://cocktails.hadley.nz/ingredient-orange-liqueur.html"       
[45] "https://cocktails.hadley.nz/ingredient-orgeat.html"               
[46] "https://cocktails.hadley.nz/ingredient-peychauds.html"            
[47] "https://cocktails.hadley.nz/ingredient-pimms.html"                
[48] "https://cocktails.hadley.nz/ingredient-pineapple.html"            
[49] "https://cocktails.hadley.nz/ingredient-pineapple-juice.html"      
[50] "https://cocktails.hadley.nz/ingredient-ramazotti.html"            
[51] "https://cocktails.hadley.nz/ingredient-rhubarb-bitters.html"      
[52] "https://cocktails.hadley.nz/ingredient-rhum-agricole.html"        
[53] "https://cocktails.hadley.nz/ingredient-rum.html"                  
[54] "https://cocktails.hadley.nz/ingredient-rye.html"                  
[55] "https://cocktails.hadley.nz/ingredient-scotch.html"               
[56] "https://cocktails.hadley.nz/ingredient-sherry.html"               
[57] "https://cocktails.hadley.nz/ingredient-simple-syrup.html"         
[58] "https://cocktails.hadley.nz/ingredient-sloe-gin.html"             
[59] "https://cocktails.hadley.nz/ingredient-soda-water.html"           
[60] "https://cocktails.hadley.nz/ingredient-swedish-punsch.html"       
[61] "https://cocktails.hadley.nz/ingredient-sweet-vermouth.html"       
[62] "https://cocktails.hadley.nz/ingredient-tequila.html"              
[63] "https://cocktails.hadley.nz/ingredient-velvet-falernum.html"      
[64] "https://cocktails.hadley.nz/ingredient-vodka.html"                
[65] "https://cocktails.hadley.nz/ingredient-yellow-chartreuse.html"    
[66] "https://cocktails.hadley.nz/tag-3-star.html"                      
[67] "https://cocktails.hadley.nz/tag-adventurous.html"                 
[68] "https://cocktails.hadley.nz/tag-bubbles.html"                     
[69] "https://cocktails.hadley.nz/tag-daiquiri.html"                    
[70] "https://cocktails.hadley.nz/tag-fizz.html"                        
[71] "https://cocktails.hadley.nz/tag-flip.html"                        
[72] "https://cocktails.hadley.nz/tag-manhattan.html"                   
[73] "https://cocktails.hadley.nz/tag-negroni.html"                     
[74] "https://cocktails.hadley.nz/tag-sazerac.html"                     
[75] "https://cocktails.hadley.nz/tag-sour.html"                        
[76] "https://cocktails.hadley.nz/tag-tiki.html"                        
[77] "https://cocktails.hadley.nz/tag-vieux-carr.html"

Write a function that takes a single URL and extracts all recipes on that page as HTML elements.

Try this out on a fixed URL and confirm that it gets the right number of recipes

Solution

library(tidyverse)
library(httr2)
library(rvest)
get_recipes <- function(url){
    read_html(url) |> html_elements("article")
}

We can test this against this page and see that it works as expected:

BUBBLES <- get_recipes("https://cocktails.hadley.nz/tag-bubbles.html")

BUBBLES

{xml_nodeset (7)}
[1] <article class="cocktail" id="air-mail"><div class="title">\n      <h2>\n ...
[2] <article class="cocktail" id="campari-spritz"><div class="title">\n       ...
[3] <article class="cocktail" id="dandelo"><div class="title">\n      <h2>\n  ...
[4] <article class="cocktail" id="french-75"><div class="title">\n      <h2>\ ...
[5] <article class="cocktail" id="hinky-dinks-fizzy"><div class="title">\n    ...
[6] <article class="cocktail" id="kir-royale"><div class="title">\n      <h2> ...
[7] <article class="cocktail" id="negroni-sbagliato"><div class="title">\n    ...

Write a function that takes in a single recipe and returns a small data frame with the title and ingredients:

Solution

library(tidyverse)
library(httr2)
library(rvest)
process_recipe <- function(art){
    title <- art |> html_element(".title") |> html_text2()
    ingredients <- art |> html_elements("li") |> html_text2()
    
    tibble(title=title, ingredients=ingredients)
}

We can again test this using the recipes we extracted above:

process_recipe(BUBBLES[[1]])

# A tibble: 4 × 2
  title    ingredients      
  <chr>    <chr>            
1 Air mail 3 oz champagne   
2 Air mail 1 oz rum, white  
3 Air mail ½ oz lime juice  
4 Air mail ¼ oz simple syrup

Write a function that combines your results from the previous two steps to build a data frame with all recipes on a single page.

Try this out on a fixed URL and confirm it works as expected.

Solution

library(tidyverse)
library(httr2)
library(rvest)

process_url <- function(url){
    html_recipes <- get_recipes(url)
    
    RECIPES <- data.frame()
    
    for(r in html_recipes){
        RECIPES <- rbind(RECIPES, process_recipe(r))
    }
    
    RECIPES
}

We can test this on the page of bubbly recipes:

process_url("https://cocktails.hadley.nz/tag-bubbles.html")

# A tibble: 29 × 2
   title                    ingredients               
   <chr>                    <chr>                     
 1 "Air mail"               3 oz champagne            
 2 "Air mail"               1 oz rum, white           
 3 "Air mail"               ½ oz lime juice           
 4 "Air mail"               ¼ oz simple syrup         
 5 "Campari spritz\n\nfizz" 3 oz champagne            
 6 "Campari spritz\n\nfizz" 1½ oz Campari             
 7 "Campari spritz\n\nfizz" 1 oz soda water           
 8 "Campari spritz\n\nfizz" 2 drops lime acid         
 9 "Dandelo"                2 oz luxardo bitter bianco
10 "Dandelo"                ½ oz lemon juice          
# ℹ 19 more rows

Use your function from the first step with the list of URLs from Step 1 so that you get all of the recipes on the site. Combine your results into one large data frame that looks something like:

Name Ingredient

Daiquiri 2 oz Rum

Daiquiri 1 oz Lime Juice

Daiquiri 0.75 oz Simple Syrup

To be polite, you may want to add req_cache to your requests so you are not requesting the same page many times over.

Solution

library(tidyverse)
library(httr2)
library(rvest)

ALL_RECIPES <- data.frame()

for(page in PAGE_URLS){
    ALL_RECIPES <- rbind(ALL_RECIPES, process_url(page))
}

ALL_RECIPES

# A tibble: 3,548 × 2
   title                                ingredients          
   <chr>                                <chr>                
 1 "Bachelor"                           1 oz rum, dark       
 2 "Bachelor"                           1 oz Meletti         
 3 "Bachelor"                           ¼ oz absinthe        
 4 "Bachelor"                           1 dash orange bitters
 5 "Bachelor"                           1 dash Angostura     
 6 "Banana Manhattan"                   1½ oz rum, aged      
 7 "Banana Manhattan"                   ½ oz sweet vermouth  
 8 "Banana Manhattan"                   ½ oz banana liqueur  
 9 "Banana Manhattan"                   1 dash absinthe      
10 "Industry sour\n\nadventurous, sour" 1 oz absinthe, pernod
# ℹ 3,538 more rows

FP Solution

If you are comfortable with (or want to become comfortable with) the functional programming tools of purrr, you can write this entire analysis in one swoop:

library(tidyverse)
library(rvest)
process_recipe <- function(recipe){
    title <- recipe |> html_element(".title") |> html_text2()
    ingredients <- recipe |> html_elements("li") |> html_text2()
    
    tibble(title=title, ingredients=ingredients)
}

BASE_URL <- "https://cocktails.hadley.nz/"

read_html(BASE_URL) |> 
    html_elements("nav a") |> 
    html_attr("href") |>
    map(\(p) request(BASE_URL) |> req_url_path(p)) |>
    map(req_perform) |>
    map(resp_body_html) |> 
    map(html_elements, "article") |> 
    reduce(c) |>
    map(process_recipe) |> 
    list_rbind()

# A tibble: 3,548 × 2
   title                                ingredients          
   <chr>                                <chr>                
 1 "Bachelor"                           1 oz rum, dark       
 2 "Bachelor"                           1 oz Meletti         
 3 "Bachelor"                           ¼ oz absinthe        
 4 "Bachelor"                           1 dash orange bitters
 5 "Bachelor"                           1 dash Angostura     
 6 "Banana Manhattan"                   1½ oz rum, aged      
 7 "Banana Manhattan"                   ½ oz sweet vermouth  
 8 "Banana Manhattan"                   ½ oz banana liqueur  
 9 "Banana Manhattan"                   1 dash absinthe      
10 "Industry sour\n\nadventurous, sour" 1 oz absinthe, pernod
# ℹ 3,538 more rows

This is clearly quite terse and would benefit from some additional comments, but I’m omitting them so you can try to deconstruct this code yourself if you want to practice functional programming idioms.

At this point, we’re almost done, but we would really like to transform a table like:

Name	Ingredient
Daiquiri	2 oz Rum
Daiquiri	1 oz Lime Juice
Daiquiri	0.75 oz Simple Syrup

into

Name	Ingredient	Amount
Daiquiri	Rum	2
Daiquiri	Lime Juice	1
Daiquiri	Simple Syrup	0.75

by splitting the amount (number) from the ingredient (string). This type of manipulation is the major goal of today’s class.

Regular Expression Practice

Complete the following exercises using functionality from the stringr package.

In the following sentence, extract all plural nouns¹:

todo <- "Yesterday, I needed to buy four cups of flour, a piece of Parmesan cheese, two gallons of ice cream, and a six-pack of bottled (non-alcoholic) beers."

library(stringr)
todo <- "Yesterday, I needed to buy four cups of flour, a piece of Parmesan cheese, two gallons of ice cream, and a six-pack of bottled (non-alcoholic) beers."
str_extract_all(todo, " [A-Za-z]+s[., ]", simplify=TRUE)

If you recall character classes from the pre-assignment, this can also be written as:

or, if we want to trim the whitespace and punctuation:

This gives back our answer in a slightly weird format - a list of string matrices - but it will suffice for now.

In the following sentence, compute the total number of fruits on my shopping list:

shopping <- "Today, I need to purchase 3 apples, 5 limes, and 2 lemons."

library(stringr)
shopping <- "Today, I need to purchase 3 apples, 5 limes, and 2 lemons."
sum(as.numeric(str_extract_all(shopping, "\\d+", simplify=TRUE)))

The following text is adapted from the Taylor Swift wikipedia page, with some changes made to the punctuation to make things easier.

Taylor Alison Swift (born December 13, 1989) is an American singer-songwriter. A subject of widespread public interest, she has influenced the music industry and popular culture through her artistry, especially in songwriting, and entrepreneurship. She is an advocate of artists rights and womens empowerment. Swift began professional songwriting at age 14. She signed with Big Machine Records in 2005 and achieved prominence as a country pop singer with the albums Taylor Swift (2006) and Fearless (2008). Their singles ‘Teardrops on My Guitar’, ‘Love Story’, and ‘You Belong with Me’ were crossover successes on country and pop radio formats and brought Swift mainstream fame. She experimented with rock and electronic styles on her next albums, Speak Now (2010) and Red (2012), respectively, with the latter featuring her first Billboard Hot 100 number-one single, ‘We Are Never Ever Getting Back Together’. Swift recalibrated her image from country to pop with 1989 (2014), a synth-pop album containing the chart-topping songs ‘Shake It Off’, ‘Blank Space’, and ‘Bad Blood’. Media scrutiny inspired the hip-hop-influenced Reputation (2017) and its number-one single ‘Look What You Made Me Do’. After signing with Republic Records in 2018, Swift released the eclectic pop album Lover (2019) and the autobiographical documentary Miss Americana (2020). She explored indie folk styles on the 2020 albums Folklore and Evermore, subdued electropop on Midnights (2022), and re-recorded four albums subtitled Taylors Version after a dispute with Big Machine. These albums spawned the number-one songs ‘Cruel Summer’, ‘Cardigan’, ‘Willow’, ‘Anti-Hero’, ‘All Too Well’, and ‘Is It Over Now?’. Her Eras Tour (2023-2024) and its accompanying concert film became the highest-grossing tour and concert film of all time, respectively. Swift has directed videos and films such as Folklore: The Long Pond Studio Sessions (2020) and All Too Well: The Short Film (2021). Swift is one of the worlds best-selling artists, with 200 million records sold worldwide as of 2019. She is the most-streamed artist on Spotify, the highest-grossing female touring act, and the first billionaire with music as the main source of income. Six of her albums have opened with over one million sales in a week. The 2023 Time Person of the Year, Swift has appeared on lists such as Rolling Stones 100 Greatest Songwriters of All Time, Billboards Greatest of All Time Artists, and Forbes Worlds 100 Most Powerful Women. Her accolades include 14 Grammy Awards, a Primetime Emmy Award, 40 American Music Awards, 39 Billboard Music Awards, and 23 MTV Video Music Awards; she has won the Grammy Award for Album of the Year, the MTV Video Music Award for Video of the Year, and the IFPI Global Recording Artist of the Year a record four times each.

How many times does Taylor Swift’s last name appear?

library(stringr)
swift <- "Taylor Alison Swift (born December 13, 1989) is an American singer-songwriter. A subject of widespread public interest, she has influenced the music industry and popular culture through her artistry, especially in songwriting, and entrepreneurship. She is an advocate of artists rights and womens empowerment. Swift began professional songwriting at age 14. She signed with Big Machine Records in 2005 and achieved prominence as a country pop singer with the albums Taylor Swift (2006) and Fearless (2008). Their singles 'Teardrops on My Guitar', 'Love Story', and 'You Belong with Me' were crossover successes on country and pop radio formats and brought Swift mainstream fame. She experimented with rock and electronic styles on her next albums, Speak Now (2010) and Red (2012), respectively, with the latter featuring her first Billboard Hot 100 number-one single, 'We Are Never Ever Getting Back Together'. Swift recalibrated her image from country to pop with 1989 (2014), a synth-pop album containing the chart-topping songs 'Shake It Off', 'Blank Space', and 'Bad Blood'. Media scrutiny inspired the hip-hop-influenced Reputation (2017) and its number-one single 'Look What You Made Me Do'. After signing with Republic Records in 2018, Swift released the eclectic pop album Lover (2019) and the autobiographical documentary Miss Americana (2020). She explored indie folk styles on the 2020 albums Folklore and Evermore, subdued electropop on Midnights (2022), and re-recorded four albums subtitled Taylors Version after a dispute with Big Machine. These albums spawned the number-one songs 'Cruel Summer', 'Cardigan', 'Willow', 'Anti-Hero', 'All Too Well', and 'Is It Over Now?'. Her Eras Tour (2023-2024) and its accompanying concert film became the highest-grossing tour and concert film of all time, respectively. Swift has directed videos and films such as Folklore: The Long Pond Studio Sessions (2020) and All Too Well: The Short Film (2021). Swift is one of the worlds best-selling artists, with 200 million records sold worldwide as of 2019. She is the most-streamed artist on Spotify, the highest-grossing female touring act, and the first billionaire with music as the main source of income. Six of her albums have opened with over one million sales in a week. The 2023 Time Person of the Year, Swift has appeared on lists such as Rolling Stones 100 Greatest Songwriters of All Time, Billboards Greatest of All Time Artists, and Forbes Worlds 100 Most Powerful Women. Her accolades include 14 Grammy Awards, a Primetime Emmy Award,  40 American Music Awards, 39 Billboard Music Awards, and 23 MTV Video Music Awards; she has won the Grammy Award for Album of the Year, the MTV Video Music Award for Video of the Year, and the IFPI Global Recording Artist of the Year a record four times each."
str_count(swift, "Swift")

In the above quote, how many different years (strings of exactly 4 digits) appear?

library(stringr)
swift <- "Taylor Alison Swift (born December 13, 1989) is an American singer-songwriter. A subject of widespread public interest, she has influenced the music industry and popular culture through her artistry, especially in songwriting, and entrepreneurship. She is an advocate of artists rights and womens empowerment. Swift began professional songwriting at age 14. She signed with Big Machine Records in 2005 and achieved prominence as a country pop singer with the albums Taylor Swift (2006) and Fearless (2008). Their singles 'Teardrops on My Guitar', 'Love Story', and 'You Belong with Me' were crossover successes on country and pop radio formats and brought Swift mainstream fame. She experimented with rock and electronic styles on her next albums, Speak Now (2010) and Red (2012), respectively, with the latter featuring her first Billboard Hot 100 number-one single, 'We Are Never Ever Getting Back Together'. Swift recalibrated her image from country to pop with 1989 (2014), a synth-pop album containing the chart-topping songs 'Shake It Off', 'Blank Space', and 'Bad Blood'. Media scrutiny inspired the hip-hop-influenced Reputation (2017) and its number-one single 'Look What You Made Me Do'. After signing with Republic Records in 2018, Swift released the eclectic pop album Lover (2019) and the autobiographical documentary Miss Americana (2020). She explored indie folk styles on the 2020 albums Folklore and Evermore, subdued electropop on Midnights (2022), and re-recorded four albums subtitled Taylors Version after a dispute with Big Machine. These albums spawned the number-one songs 'Cruel Summer', 'Cardigan', 'Willow', 'Anti-Hero', 'All Too Well', and 'Is It Over Now?'. Her Eras Tour (2023-2024) and its accompanying concert film became the highest-grossing tour and concert film of all time, respectively. Swift has directed videos and films such as Folklore: The Long Pond Studio Sessions (2020) and All Too Well: The Short Film (2021). Swift is one of the worlds best-selling artists, with 200 million records sold worldwide as of 2019. She is the most-streamed artist on Spotify, the highest-grossing female touring act, and the first billionaire with music as the main source of income. Six of her albums have opened with over one million sales in a week. The 2023 Time Person of the Year, Swift has appeared on lists such as Rolling Stones 100 Greatest Songwriters of All Time, Billboards Greatest of All Time Artists, and Forbes Worlds 100 Most Powerful Women. Her accolades include 14 Grammy Awards, a Primetime Emmy Award,  40 American Music Awards, 39 Billboard Music Awards, and 23 MTV Video Music Awards; she has won the Grammy Award for Album of the Year, the MTV Video Music Award for Video of the Year, and the IFPI Global Recording Artist of the Year a record four times each."
str_count(swift, "\\d{4}")

Extract the names of all songs mentioned in the biography above. (Note that song names are surrounded by single quotes.)

You will need to use a lazy regular expression.

library(stringr)
swift <- "Taylor Alison Swift (born December 13, 1989) is an American singer-songwriter. A subject of widespread public interest, she has influenced the music industry and popular culture through her artistry, especially in songwriting, and entrepreneurship. She is an advocate of artists rights and womens empowerment. Swift began professional songwriting at age 14. She signed with Big Machine Records in 2005 and achieved prominence as a country pop singer with the albums Taylor Swift (2006) and Fearless (2008). Their singles 'Teardrops on My Guitar', 'Love Story', and 'You Belong with Me' were crossover successes on country and pop radio formats and brought Swift mainstream fame. She experimented with rock and electronic styles on her next albums, Speak Now (2010) and Red (2012), respectively, with the latter featuring her first Billboard Hot 100 number-one single, 'We Are Never Ever Getting Back Together'. Swift recalibrated her image from country to pop with 1989 (2014), a synth-pop album containing the chart-topping songs 'Shake It Off', 'Blank Space', and 'Bad Blood'. Media scrutiny inspired the hip-hop-influenced Reputation (2017) and its number-one single 'Look What You Made Me Do'. After signing with Republic Records in 2018, Swift released the eclectic pop album Lover (2019) and the autobiographical documentary Miss Americana (2020). She explored indie folk styles on the 2020 albums Folklore and Evermore, subdued electropop on Midnights (2022), and re-recorded four albums subtitled Taylors Version after a dispute with Big Machine. These albums spawned the number-one songs 'Cruel Summer', 'Cardigan', 'Willow', 'Anti-Hero', 'All Too Well', and 'Is It Over Now?'. Her Eras Tour (2023-2024) and its accompanying concert film became the highest-grossing tour and concert film of all time, respectively. Swift has directed videos and films such as Folklore: The Long Pond Studio Sessions (2020) and All Too Well: The Short Film (2021). Swift is one of the worlds best-selling artists, with 200 million records sold worldwide as of 2019. She is the most-streamed artist on Spotify, the highest-grossing female touring act, and the first billionaire with music as the main source of income. Six of her albums have opened with over one million sales in a week. The 2023 Time Person of the Year, Swift has appeared on lists such as Rolling Stones 100 Greatest Songwriters of All Time, Billboards Greatest of All Time Artists, and Forbes Worlds 100 Most Powerful Women. Her accolades include 14 Grammy Awards, a Primetime Emmy Award,  40 American Music Awards, 39 Billboard Music Awards, and 23 MTV Video Music Awards; she has won the Grammy Award for Album of the Year, the MTV Video Music Award for Video of the Year, and the IFPI Global Recording Artist of the Year a record four times each."
str_extract_all(swift, "'.*?'", simplify=TRUE)

Scraping Practice I: Cocktails

Last week, we began to scrape Hadley’s Cocktails with an (eventual) goal of creating a “spreadsheet” of recipes by ingredients.

We found the following:

library(rvest)
BASE_URL <- "https://cocktails.hadley.nz/"

PAGES <- read_html(BASE_URL) |> 
    html_elements("nav a") |> 
    html_attr("href")

read_article <- function(article){
    title <- article |> html_element("h2") |> html_text()
    ingredients <- article |> html_elements("li") |> html_text()
    
    data.frame(title=title, ingredient=ingredients)
}

read_page <- function(stub){
    URL <- paste0(BASE_URL, stub)
    COCKTAILS <- read_html(URL) |> html_elements("article")
    
    map(COCKTAILS, read_article) |> list_rbind()
}

RECIPES_LONG <- map(PAGES, read_page) |> list_rbind()

Take this output and use stringr and tidyr to complete the transition to a well-formatted “wide” set of recipes.

Clean up the title column using str_trim and remove duplicate rows:

Solution

RECIPES_LONG <- RECIPES_LONG |>
    mutate(title = str_trim(title)) |>
    distinct()

Split the ingredient column into three new columns:
- Amount
- Unit
- Ingredient Name
We will approach this in three steps:
1. First, write a function to pull out the “unit” (oz, dash, leaves) etc.
  
  The units found in this data are oz, dash, drop, t, chunk, leaves, and cm.
  
  Hint: Use a str_extract and make sure to require a space before and after the unit so you don’t pick up ts in the names of ingredients.

Solution

RECIPES_LONG <- RECIPES_LONG |>
    mutate(unit = str_extract(ingredient, 
                              " (oz|dash|drop|t|chunk|leaves|cm) ", 
                              group=1))

2. Next, pull out the "number" part at the beginning of each ingredient.
Note that some fractional amounts are included, so the following function
may be useful:

standardize_number <- function(x){
    library(stringr)
    x |> str_replace("(\\d)½", "\\1.5") |> 
        str_replace("½", "0.5") |>
        str_replace("(\\d)¾", "\\1.75") |> 
        str_replace("¾", "0.75") |>
        str_replace("(\\d)¼", "\\1.25") |> 
        str_replace("¼", "0.25")
}

x <- c("½ oz allspice liqueur", "¾ oz Campari", "1¾ oz gin")
standardize_number(x)

[1] "0.5 oz allspice liqueur" "0.75 oz Campari"        
[3] "1.75 oz gin"

Solution

RECIPES_LONG <- RECIPES_LONG |>
    mutate(std_ingredient = standardize_number(ingredient), 
           number = str_extract(std_ingredient, "(^[.[:digit:]]+) ", group=1), 
           number = as.double(number)) |> 
    select(-ingredient)

Finally, everything that you haven’t already extracted can be assumed to be the name of an ingredient. Use str_remove twice to get the ingredient names.

Solution

RECIPES_LONG <- RECIPES_LONG |>
    mutate(ingredient = std_ingredient |> 
               str_remove(as.character(number)) |>  # temporarily return to string
               str_remove(unit) |> 
               str_trim()) |>
    select(-std_ingredient)

You should wind up with a table that looks something like

Cocktail	Ingredient	Unit	Amount
Bachelor	rum, dark	oz	1
Bachelor	Meletti	oz	1

Combine the ingredient and unit columns so that Meletti and oz in two separate columns becomes Meletti (oz) in a single column.

Solution

RECIPES_LONG <- RECIPES_LONG |>
    mutate(ingredient = paste0(ingredient, " (", unit, ")")) |> 
    select(-unit)

Use a pivot_* function to create a new wide table with each ingredient as a column.

Hint: Which pivot operation do you want to use here? How should the empty cells be treated? Look at the values_fill argument.

Note that there are some duplicate rows when the same ingredient is used twice in a single cocktail. You might want to simply add up the relevant numbers before proceeding.

Solution

RECIPES_LONG |>
    group_by(title, ingredient) |>
    summarize(number = sum((number))) |>
    ungroup() |>
    pivot_wider(names_from = ingredient, values_from=number, values_fill=0)

`summarise()` has grouped output by 'title'. You can override using the
`.groups` argument.

# A tibble: 161 × 168
   title            `NA (NA)` `brandy (oz)` maraschino liqueur (…¹ `mezcal (oz)`
   <chr>                <dbl>         <dbl>                  <dbl>         <dbl>
 1 1910                     2          0.75                    0.5          0.75
 2 A wish for Grace         0          0                       0            0   
 3 A-go flip                1          0                       0            0   
 4 Abricot vieux            0          0                       0            0   
 5 Accoutrement             5          0                       0            0   
 6 Across 110th st…         2          0                       0            0   
 7 Air mail                 0          0                       0            0   
 8 Aku aku                  5          0                       0            0   
 9 Alexander               NA          1                       0            0   
10 Algonquin               NA          0                       0            0   
# ℹ 151 more rows
# ℹ abbreviated name: ¹`maraschino liqueur (oz)`
# ℹ 163 more variables: `sweet vermouth (oz)` <dbl>, `Angostura (dash)` <dbl>,
#   `Madeira (oz)` <dbl>, `lemon juice (oz)` <dbl>,
#   `orange liqueur (oz)` <dbl>, `rum, Smith & Cross (oz)` <dbl>,
#   `simple syrup (oz)` <dbl>, `Angostura (oz)` <dbl>, `rum (oz)` <dbl>,
#   `sherry, pedro ximinez (oz)` <dbl>, `simple syrup, demerera (oz)` <dbl>, …

This isn’t quite perfect (see some problems with the NA values), but it’s close.

Here is a compact solution for the whole exercise:

Combined Solution

library(tidyverse)
library(rvest)
process_recipe <- function(recipe){
    title <- recipe |> html_element(".title") |> html_text2()
    ingredients <- recipe |> html_elements("li") |> html_text2()
    
    tibble(title=title, ingredient=ingredients)
}

BASE_URL <- "https://cocktails.hadley.nz/"

RECIPES_LONG <- read_html(BASE_URL) |> 
    html_elements("nav a") |> 
    html_attr("href") |>
    map(\(p) request(BASE_URL) |> req_url_path(p)) |>
    map(req_perform) |>
    map(resp_body_html) |> 
    map(html_elements, "article") |> 
    reduce(c) |>
    map(process_recipe) |> 
    list_rbind()

get_ingredient_name <- function(x){
    str_remove_all(x, "[0-9¼½¾]|( oz )|( dashes )|( dash )|( drops )|( drop )|( t )|( chunks )|( chunk )|( leaves )|( leaf )|( cm )")
}

standardize_number <- function(x){
    library(stringr)
    x |> str_replace("(\\d)½", "\\1.5") |> 
        str_replace("½", "0.5") |>
        str_replace("(\\d)¾", "\\1.75") |> 
        str_replace("¾", "0.75") |>
        str_replace("(\\d)¼", "\\1.25") |> 
        str_replace("¼", "0.25")
}

x <- c("½ oz allspice liqueur", "¾ oz Campari", "1¾ oz gin")
standardize_number(x)

[1] "0.5 oz allspice liqueur" "0.75 oz Campari"        
[3] "1.75 oz gin"

RECIPES_LONG |>
    # Remove duplicates from import process
    # When we import each ingredient page, we get duplicates
    # as drinks are listed on multiple ingredient pages.
    distinct() |>
    mutate(title  = str_trim(title), 
           std_ingredient = standardize_number(ingredient), 
           amount = str_extract(std_ingredient, "(^[.[:digit:]]+) ", group=1), 
           amount = as.double(amount), 
           unit = str_extract(ingredient, 
                              " (oz|dash|drop|t|chunk|leaves|cm) ", 
                              group=1),
           ingredient = str_remove_all(ingredient, unit), 
           ingredient = str_remove_all(ingredient, as.character(amount)),
           ingredient = str_trim(ingredient)) |>
    # For some ingredients, e.g. a lemon twist, the implied
    # but unstated quantity is 1
    mutate(amount = case_when(
        is.na(amount) ~ 1, 
        TRUE ~ amount), 
        ingredient = str_to_title(ingredient)) |>
    rename(Cocktail = title, 
           Amount = amount, 
           Unit = unit, 
           Ingredient = ingredient) |>
    mutate(Ingredient = case_when(
        is.na(Unit) ~ Ingredient, # Handle unit-less ingredients
        TRUE ~ paste0(Ingredient, " (", Unit, ")"))) |>
    # Spanish Coffee lists orange liqueur twice, so let's 
    # add up repeated ingredients before pivoting.
    # (I think this is the only one)
    group_by(Cocktail, Ingredient) |>
    summarize(Amount = sum(Amount)) |>
    ungroup() |>
    pivot_wider(id_cols = Cocktail, 
                names_from = Ingredient, 
                values_from = Amount, 
                values_fill = 0) |>
    select("Cocktail", sort(tidyselect::peek_vars()))

`summarise()` has grouped output by 'Cocktail'. You can override using the
`.groups` argument.

# A tibble: 293 × 268
   Cocktail    ½  Allspice Liqueur …¹ `½  Angostura (oz)` ½  Apricot Liqueur (…²
   <chr>                        <dbl>               <dbl>                  <dbl>
 1 "1910"                           0                   0                    0  
 2 "A wish fo…                      0                   0                    0  
 3 "A wish fo…                      0                   0                    0  
 4 "A-go flip"                      0                   0                    0  
 5 "A-go flip…                      0                   0                    0  
 6 "Abricot v…                      0                   0                    0.5
 7 "Abricot v…                      0                   0                    0.5
 8 "Accoutrem…                      0                   0                    0  
 9 "Accoutrem…                      0                   0                    0  
10 "Across 11…                      0                   0                    0  
# ℹ 283 more rows
# ℹ abbreviated names: ¹`½  Allspice Liqueur (oz)`, ²`½  Apricot Liqueur (oz)`
# ℹ 264 more variables: `½  Banana Liqueur (oz)` <dbl>,
#   `½  Benedictine (oz)` <dbl>, `½  Blackberry Liqueur (oz)` <dbl>,
#   `½  Brandy (oz)` <dbl>, `½  Cachaca, Barrel Aged (oz)` <dbl>,
#   `½  Campari (oz)` <dbl>, `½  Cherry Liqueur (oz)` <dbl>,
#   `½  Coffee Liqueur (oz)` <dbl>, `½  Creme De Cacao (oz)` <dbl>, …

Scraping Practice II: Quotes

Let’s analyze the website https://quotes.toscrape.com, a website designed to practice web-scraping.

Scrape the contents of that website–note that quotes continue for multiple pages–and answer the following questions. You can check most of these “by hand” but you need to compute your answers in code!

How many pages of quotes are on this site?

Hint: Write code to keep paging until no more “Next” button is present.

Solution

library(httr2)
library(rvest)
library(tidyverse)

N_PAGES <- 1
URL <- "https://quotes.toscrape.com/"

repeat{
    missing_next_button <- request(URL) |>
        req_url_path("page", N_PAGES) |>
        req_perform() |>
        resp_body_html() |>
        html_element(".next") |>
        is.na()
    
    if(missing_next_button){
        break
    } else {
        N_PAGES <- N_PAGES + 1
    }
}

There are N_PAGES pages worth of quotes on this site.

How many quotes are on this website (all pages)?

Solution

library(httr2)
library(tidyverse)
library(rvest)

N_TOTAL_QUOTES <- 0

for(page in 1:N_PAGES){
    n_quotes <- request(URL) |>
        req_url_path("page", page) |>
        req_perform() |>
        resp_body_html() |>
        html_elements(".quote") |>
        length()
    
    N_TOTAL_QUOTES <- N_TOTAL_QUOTES + n_quotes
    
}

or using purrr-style programming:

library(httr2)
library(tidyverse)
library(rvest)

seq(N_PAGES) |>
    map(\(p) request(URL) |> req_url_path("page", p)) |>
    map(req_perform) |>
    map(resp_body_html) |>
    map(html_elements, ".quote") |>
    map_int(length) |>
    sum()

[1] 100

How many quotes are tagged “Death”?

Solution

library(httr2)
library(tidyverse)
library(rvest)

N_DEATH_QUOTES <- 0

for(page in 1:N_PAGES){
    death_quotes <- request(URL) |>
        req_url_path("page", page) |>
        req_perform() |>
        resp_body_html() |>
        html_elements(".quote .tags") |>
        html_text2() |>
        str_detect("death") |>
        sum()
    
    N_DEATH_QUOTES <- N_DEATH_QUOTES + death_quotes
    
}

N_DEATH_QUOTES

[1] 4

or using purrr-style programming:

library(httr2)
library(tidyverse)
library(rvest)

seq(N_PAGES) |>
    map(\(p) request(URL) |> req_url_path("page", p)) |>
    map(req_perform) |>
    map(resp_body_html) |>
    map(html_elements, ".quote .tags") |>
    map(html_text2) |>
    map_int(\(x) sum(str_detect(x, "death"))) |>
    sum()

[1] 4

How many quotes are by (or at least are attributed to) Albert Einstein?

Solution

library(httr2)
library(tidyverse)
library(rvest)

N_EINSTEIN_QUOTES <- 0

for(page in 1:N_PAGES){
    einstein_quotes <- request(URL) |>
        req_url_path("page", page) |>
        req_perform() |>
        resp_body_html() |>
        html_elements(".quote .author") |>
        html_text2() |>
        str_detect("Einstein") |>
        sum()
    
    N_EINSTEIN_QUOTES <- N_EINSTEIN_QUOTES + einstein_quotes
    
}

N_EINSTEIN_QUOTES

[1] 10

or using purrr-style programming:

library(httr2)
library(tidyverse)
library(rvest)

seq(N_PAGES) |>
    map(\(p) request(URL) |> req_url_path("page", p)) |>
    map(req_perform) |>
    map(resp_body_html) |>
    map(html_elements, ".quote .author") |>
    map(html_text2) |>
    map_int(\(x) sum(str_detect(x, "Einstein"))) |>
    sum()

[1] 10

What is the longest quote (by number of characters)? The str_length and which.max functions will be helpful.

You can use them like this:
```
x <- c("short", "medium", "very quite long")
longest <- x[which.max(str_length(x))]
```

Solution

library(httr2)
library(tidyverse)
library(rvest)

LONGEST_QUOTE <- ""

for(page in 1:N_PAGES){
    all_quotes <- request(URL) |>
        req_url_path("page", page) |>
        req_perform() |>
        resp_body_html() |>
        html_elements(".quote .text") |>
        html_text2()
        
    longest_on_page <- all_quotes[which.max(str_length(all_quotes))]
    
    if(str_length(longest_on_page) > str_length(LONGEST_QUOTE)){
        LONGEST_QUOTE <- longest_on_page
    }
}

cat("The longest quote is: ")

The longest quote is:

cat(LONGEST_QUOTE)

“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”

cat("which is ", str_length(LONGEST_QUOTE), " characters long.")

which is  1084  characters long.

Of all authors quoted, who has the earliest (estimated) birthday?

Hint: Use the as.Date function to parse these dates. E.g.,
```
x <- "March 14, 1879"
as.Date(x, "%b %d, %Y")
```
```
[1] "1879-03-14"
```
Here, the second string is a format string, which specifies how the date is written.

Solution

library(httr2)
library(tidyverse)
library(rvest)

ALL_AUTHOR_LINKS <- c()

for(page in 1:N_PAGES){
    author_links <- request(URL) |>
        req_url_path("page", page) |>
        req_perform() |>
        resp_body_html() |>
        html_elements(".quote a") |>
        html_attr("href") |>
        str_subset("author")
    
    ALL_AUTHOR_LINKS <- c(ALL_AUTHOR_LINKS, paste0(URL, author_links))
    
}

ALL_AUTHOR_LINKS <- unique(ALL_AUTHOR_LINKS)

AUTHORS <- map(ALL_AUTHOR_LINKS, read_html) |>
    map(\(h) data.frame(name=h |> html_element(".author-title") |> html_text2(),
                        date=h |> html_element(".author-born-date") |> html_text2())) |>
    list_rbind() |>
    mutate(date = as.Date(date, "%b %d, %Y"))

AUTHORS |> slice_min(date)

         name       date
1 Jane Austen 1775-12-16

Footnotes

While English pluralization rules are tricky, you can just find the words ending with an s.↩︎