STA 9750 Week 11 In-Class Activity: HTML Import

Slides

Review Practice

The Open Trivia Database provides a simple API for trivia questions. You are going to use this API to create a trivia game for you and some friends.

  1. Before beginning, explore the question bank and the API call generator to understand the types of data available and the structure of the API call.
  2. Use the API call generator to create a URL to download 5 multiple-choice questions on a topic of your choosing. Copy this URL.
  3. Now, use the httr2 package in R to make a request to that URL. At first, simply paste the whole URL into request and confirm it works. Then, break the URL into its constitutent parts using the req_url_path() and req_url_query() functions to construct the URL. (We won’t be parametrizing the API call in this brief exercise, but this is important to do if you are going to programmatically be making several API calls.)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(httr2)
req1 <- request("https://opentdb.com/api.php?amount=50&category=18&type=multiple") 

# Decompose the URL into its various components: 
req2 <- request("https://opentdb.com") |>
          req_url_path("api.php") |>
          req_url_query(amount   = 50, 
                        category = 18, 
                        type     = "multiple")

# Confirm equivalence
all.equal(req1, req2)
[1] TRUE

I’m using Computer Science (Category 18) but you can use others.

Now that we have built our request objects, we can use req_perform() to actually make the request:

resp <- req2 |> req_perform()
resp
<httr2_response>
GET https://opentdb.com/api.php?amount=50&category=18&type=multiple
Status: 200 OK
Content-Type: application/json
Body: In memory (12669 bytes)
  1. The default response format is json, so use the resp_body_json function to convert the response into R. Print the repsonse to screen and examine the format. This will help you know what you need to do to manipulate (“rectangle”) the data into a useful structure.

    Hint: To avoid printing so much to screen that it’s hard to make sense of the output, you may wish to modify your API call to temporarily only request 3-5 responses, instead of 50, even if your eventual goal is to get more responses than that.

request("https://opentdb.com") |>
          req_url_path("api.php") |>
          req_url_query(amount   = 3, 
                        category = 18, 
                        type     = "multiple") |>
    # Some stuff to respect API rate limiting
    req_throttle(capacity=2) |>
    req_retry(max_tries = 5) |>
    req_perform() |>
    resp_body_json() |>
    str()
Waiting 4s for retry backoff ■■■■■■■■                        
Waiting 4s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■■■     
Waiting 4s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 
List of 2
 $ response_code: int 0
 $ results      :List of 3
  ..$ :List of 6
  .. ..$ type             : chr "multiple"
  .. ..$ difficulty       : chr "medium"
  .. ..$ category         : chr "Science: Computers"
  .. ..$ question         : chr "What was the first commerically available computer processor?"
  .. ..$ correct_answer   : chr "Intel 4004"
  .. ..$ incorrect_answers:List of 3
  .. .. ..$ : chr "Intel 486SX"
  .. .. ..$ : chr "TMS 1000"
  .. .. ..$ : chr "AMD AM386"
  ..$ :List of 6
  .. ..$ type             : chr "multiple"
  .. ..$ difficulty       : chr "easy"
  .. ..$ category         : chr "Science: Computers"
  .. ..$ question         : chr "What does the &quot;MP&quot; stand for in MP3?"
  .. ..$ correct_answer   : chr "Moving Picture"
  .. ..$ incorrect_answers:List of 3
  .. .. ..$ : chr "Music Player"
  .. .. ..$ : chr "Multi Pass"
  .. .. ..$ : chr "Micro Point"
  ..$ :List of 6
  .. ..$ type             : chr "multiple"
  .. ..$ difficulty       : chr "medium"
  .. ..$ category         : chr "Science: Computers"
  .. ..$ question         : chr "All of the following programs are classified as raster graphics editors EXCEPT:"
  .. ..$ correct_answer   : chr "Inkscape"
  .. ..$ incorrect_answers:List of 3
  .. .. ..$ : chr "Paint.NET"
  .. .. ..$ : chr "GIMP"
  .. .. ..$ : chr "Adobe Photoshop"

We see here that the response comes in a list of two components: response_code and results, with the content we want being in the results. The 3 results look pretty straightforward, except for the incorrect_answers list, which we will need to keep an eye on.

  1. Use the pluck function to extract the relevant part of the JSON response. The response will have several similarly formatted response objects. Use the map(as_tibble) |> list_rbind() paradigm to convert these into a large data frame.
request("https://opentdb.com") |>
    req_url_path("api.php") |>
    req_url_query(amount   = 3, 
                  category = 18, 
                  type     = "multiple") |>
    # Some stuff to respect API rate limiting
    req_throttle(capacity=2) |>
    req_retry(max_tries = 5) |>
    req_perform() |>
    resp_body_json() |>
    pluck("results") |>
    map(as_tibble) |>
    list_rbind()
Waiting 5s for retry backoff ■■■■■■■                         
Waiting 5s for retry backoff ■■■■■■■■■■■■■■■■■■■             
Waiting 5s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  
# A tibble: 9 × 6
  type     difficulty category         question correct_answer incorrect_answers
  <chr>    <chr>      <chr>            <chr>    <chr>          <list>           
1 multiple hard       Science: Comput… Which k… Secret sharin… <chr [1]>        
2 multiple hard       Science: Comput… Which k… Secret sharin… <chr [1]>        
3 multiple hard       Science: Comput… Which k… Secret sharin… <chr [1]>        
4 multiple medium     Science: Comput… .rs is … Serbia         <chr [1]>        
5 multiple medium     Science: Comput… .rs is … Serbia         <chr [1]>        
6 multiple medium     Science: Comput… .rs is … Serbia         <chr [1]>        
7 multiple medium     Science: Comput… How man… 23             <chr [1]>        
8 multiple medium     Science: Comput… How man… 23             <chr [1]>        
9 multiple medium     Science: Comput… How man… 23             <chr [1]>        

Everything looks pretty good here, with two notes:

  • The incorrect_answers column appears to be a list type column
  • Each of our responses was repeated across three rows.
  1. Use the unnest and pivot_wider functions to conver this data into a more usable structure.

    When applied to a list column, the unnest function will try to convert it into a more standard vector-type column.

    Hint: Before using pivot_wider, you will need to create a new column with column names for the pivoted data.

request("https://opentdb.com") |>
    req_url_path("api.php") |>
    req_url_query(amount   = 3, 
                  category = 18, 
                  type     = "multiple") |>
    # Some stuff to respect API rate limiting
    req_throttle(capacity=2) |>
    req_retry(max_tries = 5) |>
    req_perform() |>
    resp_body_json() |>
    pluck("results") |>
    map(as_tibble) |>
    list_rbind() |>
    unnest(incorrect_answers) |>
    group_by(question) |>
    mutate(pivot_name = paste0("incorrect-", row_number())) |>
    pivot_wider(names_from = pivot_name, 
                values_from = incorrect_answers) |>
    select(-type, 
           -category, 
           -difficulty)
Waiting 5s for retry backoff ■■■■■■■                         
Waiting 5s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■       
Waiting 5s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  
# A tibble: 3 × 5
# Groups:   question [3]
  question              correct_answer `incorrect-1` `incorrect-2` `incorrect-3`
  <chr>                 <chr>          <chr>         <chr>         <chr>        
1 HTML is what type of… Markup Langua… Macro Langua… Programming … Scripting La…
2 Laserjet and inkjet … Non-impact pr… Impact print… Daisywheel p… Dot matrix p…
3 America Online (AOL)… Quantum Link   CompuServe    Prodigy       GEnie        

You can use this to play a bit of trivia amongst your group. Set amount higher for more questions.

Web Scraping

In today’s class, we’re going to work through three web scraping exercises:

  1. Reading a simple static table
  2. Extending our ‘CUNY Map’ from Lab #01
  3. Scraping Cocktail Recipes 🍸 from Hadley Wickham

Exercise 1: Reading HTML Tables

Using rvest, read the following table into R. Your results should be a data frame that captures the structure of the table.

Table 1: Selected Statistics Courses Offered at Baruch College
Department Course Number Course Name
STA 2000 Business Statistics I
STA 3000 Statistical Computing
STA 3154 Business Statistics II
STA 3950 Data Mining
STA 4155 Regression & Forecasting Models
STA 4157 Analysis of Variance: Principles & Applications
STA 4950 Machine Learning and AI
STA 9700 Modern Regression Analysis
STA 9705 Multivariate Statistical Methods
STA 9708 Managerial Statistics
STA 9710 Statistical Methods for Sampling and Audit
STA 9715 Applied Probability
STA 9719 Foundations of Statistical Inference
STA 9750 Software Tools for Data Analysis
STA 9890 Statistical Learning for Data Mining
  1. Use the Selector Gadget or similar to identify the HTML elements defining this table. You can do this in one of two ways:

    • By a specific named element, accessed as html_element("#name")
    • By a specific element type, accessed as html_element("type")
  2. Use the read_html function to read this entire page into R, then use the html_element function with the selector from the previous step to extract the table.

  3. Use the html_table function to conver this table into a data frame.

This page is pretty simple, so a little bit of element selection and html_table will suffice. For more complex tables, you may need to do more to select the specific table and more to ‘clean up’ that table once it is in R.

library(rvest)
read_html("https://michael-weylandt.com/STA9750/labs/lab11.html") |>
    html_element("#tbl-baruch-stats") |> 
    html_table()

# OR - Since this page has only a single table
read_html("https://michael-weylandt.com/STA9750/labs/lab11.html") |>
    html_element("table") |> 
    html_table()

Exercise 2: CUNY Map

In this exercise, we’re going to take our Baruch map from Lab #01 and extend it to pull all CUNY campuses. This exercise is intended to help you practice scraping data from Wikipedia, which is a good example of a relatively plain HTML site that has a bit of extra Javascript that can make it a bit tricky to find the element(s) you want.

The following code is a cleaned-up version of the map creation activity from Lab #01. Review it before beginning this activity:

library(tidyverse)
library(rvest)
library(leaflet)

baruch_info <- read_html("https://en.wikipedia.org/wiki/Baruch_College") |> 
    html_element(".geo") |> 
    html_text() |> 
    str_split_1(";") |> 
    as.numeric() |> 
    set_names(c("latitude", "longitude")) |> 
    bind_rows()

leaflet() |>
    addTiles() |>
    setView(lat = baruch_info$latitude, 
            lng = baruch_info$longitude, 
            zoom=17) |>
    addPopups(lat = baruch_info$latitude, 
              lng = baruch_info$longitude, 
               "Look! It's <b>Baruch College</b>!") |>
    print()
  1. Building on the code above, write a function that takes the Wikipedia URL for a CUNY as an argument and returns a data frame with two columns:

    • Latitude
    • Longitude

    Confirm that your function works by applying it to Baruch’s Wikipedia page.

get_cuny_coordinates <- function(url){
    read_html(url) |> 
        html_element(".geo") |> 
        html_text() |> 
        str_split_1(";") |> 
        as.numeric() |> 
        set_names(c("latitude", "longitude")) |> 
        bind_rows()
}

get_cuny_coordinates("https://en.wikipedia.org/wiki/Baruch_College")
# A tibble: 1 × 2
  latitude longitude
     <dbl>     <dbl>
1     40.7     -74.0
  1. Investigate the Wikipedia List of CUNYs and identify which table can be used to get links to each CUNY’s page.
Wikipedia Table Scraping

Wikipedia has tons of excellent data tables which have been collected by incredible volunteers. It can be an incredibly useful source, but there are a few difficulties in getting data out of Wikipedia. In particular, Wikipedia uses a rich JavaScript library to make tables slightly interactive (column sorting, etc.) that modifies the page HTML. R will get the “raw” HTML of the page, so you will need to either disable JavaScript temporarily (tricky) or make sure you are looking at the actual page source that R will see (“View Source”).

As a general rule, try to use standard HTML elements (table, tbody) instead of fancier alternatives when possible.

  1. Parse the table to pull links to each CUNY’s individual Wikipedia page.
CUNYs <- read_html("https://en.wikipedia.org/wiki/List_of_City_University_of_New_York_institutions") |> 
    html_element("tbody") |>
    html_elements("tr td:nth-child(2)") |>
    html_elements("a") 

CUNYs <- tibble(name = CUNYs |> html_text(), 
                link = CUNYs |> html_attr("href"))

If you are comfortable with the purrr family of functions, this can be written a bit more compactly as:

CUNYs <- read_html("https://en.wikipedia.org/wiki/List_of_City_University_of_New_York_institutions") |> 
    html_element("tbody") |>
    html_elements("tr td:nth-child(2)") |>
    html_elements("a") |> 
    map(\(tr) tibble(name = tr |> html_text(), 
                     link = tr |> html_attr("href"))) |> 
    bind_rows()
  1. Modify the links extracted from the Wikipedia table to form full URLs.

    Proper URL handling is a bit subtle - it’s a common avenue for security holes

    • but a simple paste0 will suffice here.
CUNYs <- CUNYs |> mutate(link = paste0("https://en.wikipedia.org", link))
  1. Apply your HTML parsing function from Step 1 to each CUNY page using the map function or the rowwise() grouping structure:

    Hint: You may need to use unnest to convert a list column into something more usable.

CUNYs <- CUNYs |> rowwise() |> mutate(info = get_cuny_coordinates(link))
## Or with `map`
CUNYs <- CUNYs |> mutate(info = map(link, get_cuny_coordinates))

## Then unnest the info column

CUNYs <- CUNYs |> unnest(info)

CUNYs
# A tibble: 26 × 4
   name                                             link      latitude longitude
   <chr>                                            <chr>        <dbl>     <dbl>
 1 Baruch College                                   https://…     40.7     -74.0
 2 The City College                                 https://…     40.8     -74.0
 3 Graduate Center                                  https://…     40.7     -74.0
 4 Graduate School of Public Health & Health Policy https://…     40.8     -73.9
 5 Guttman Community College                        https://…     40.8     -74.0
 6 Hunter College                                   https://…     40.8     -74.0
 7 John Jay College of Criminal Justice             https://…     40.8     -74.0
 8 Macaulay Honors College                          https://…     40.8     -74.0
 9 Manhattan Community College                      https://…     40.7     -74.0
10 Newmark Graduate School of Journalism            https://…     40.8     -74.0
# ℹ 16 more rows
  1. Use your results to build a CUNY-wide map
MAP <- leaflet() |> 
    addTiles() |> 
    addMarkers(lng = CUNYs$longitude, 
               lat=CUNYs$latitude, 
               popup=CUNYs$name)

MAP

Exercise 3: Cocktails Data

Hadley Wickham is the author of many of the “tidy tools” we use in this course. He is also an excellent bartender and chef. In this exercise, we are going to web scrape his cocktail recipe book which can be found at https://cocktails.hadley.nz/.

Our goal is to create a data frame that records all 150+ recipes on this site (as rows) and the different ingredients (as columns). This week, we are going to pull the different recipes into R: next week we are going to process the text and create our final data frame (so stay tuned!).

Working with your project team, first go through the following steps to build a scraping strategy:

  1. Poke around the website to see how it is organized. Is there a single page listing all of the cocktails? If not, how else can you make sure that you’ve explored the entire site?
  2. Once you know how you’re going to explore the whole site, use your browser tools to see if you can identify an HTML element that corresponds to a single recipe. (This element will occur several times per page) Remember that you want to select “as small as possible” but no smaller.
  3. Once you have found the right HTML element for a recipe, identify an HTML element that corresponds to i) the title; and ii) individual ingredients.

For this task, you will likely see several recipes more than once. Don’t worry about this for now - we can distinct out the duplicates later in our analysis. It’s better to be over-inclusive than under-inclusive.

After you have built your plan, it’s time to start putting this all to code.

  1. Write code to get a list of all pages you will need to process. Construct full URLs for your future requests.
library(tidyverse)
library(httr2)
library(rvest)

BASE_URL <- "https://cocktails.hadley.nz/"

PAGES <- read_html(BASE_URL) |> 
            html_elements("nav a") |> 
            html_attr("href")

PAGE_URLS <- paste0(BASE_URL, PAGES)


PAGE_URLS
 [1] "https://cocktails.hadley.nz/ingredient-absinthe.html"             
 [2] "https://cocktails.hadley.nz/ingredient-allspice-liqueur.html"     
 [3] "https://cocktails.hadley.nz/ingredient-ancho-reyes.html"          
 [4] "https://cocktails.hadley.nz/ingredient-angostura.html"            
 [5] "https://cocktails.hadley.nz/ingredient-aperol.html"               
 [6] "https://cocktails.hadley.nz/ingredient-apricot-liqueur.html"      
 [7] "https://cocktails.hadley.nz/ingredient-averna.html"               
 [8] "https://cocktails.hadley.nz/ingredient-banana-liqueur.html"       
 [9] "https://cocktails.hadley.nz/ingredient-benedictine.html"          
[10] "https://cocktails.hadley.nz/ingredient-blackberry-liqueur.html"   
[11] "https://cocktails.hadley.nz/ingredient-blanc-vermouth.html"       
[12] "https://cocktails.hadley.nz/ingredient-bourbon.html"              
[13] "https://cocktails.hadley.nz/ingredient-brandy.html"               
[14] "https://cocktails.hadley.nz/ingredient-campari.html"              
[15] "https://cocktails.hadley.nz/ingredient-champagne.html"            
[16] "https://cocktails.hadley.nz/ingredient-cherry-liqueur.html"       
[17] "https://cocktails.hadley.nz/ingredient-coconut-cream.html"        
[18] "https://cocktails.hadley.nz/ingredient-coffee-liqueur.html"       
[19] "https://cocktails.hadley.nz/ingredient-cream.html"                
[20] "https://cocktails.hadley.nz/ingredient-cr-me-de-cacao.html"       
[21] "https://cocktails.hadley.nz/ingredient-cr-me-de-cassis.html"      
[22] "https://cocktails.hadley.nz/ingredient-cynar.html"                
[23] "https://cocktails.hadley.nz/ingredient-dry-vermouth.html"         
[24] "https://cocktails.hadley.nz/ingredient-elderflower-liqueur.html"  
[25] "https://cocktails.hadley.nz/ingredient-espresso.html"             
[26] "https://cocktails.hadley.nz/ingredient-fernet.html"               
[27] "https://cocktails.hadley.nz/ingredient-gin.html"                  
[28] "https://cocktails.hadley.nz/ingredient-ginger-liqueur.html"       
[29] "https://cocktails.hadley.nz/ingredient-grapefruit-bitters.html"   
[30] "https://cocktails.hadley.nz/ingredient-grapefruit-juice.html"     
[31] "https://cocktails.hadley.nz/ingredient-green-chartreuse.html"     
[32] "https://cocktails.hadley.nz/ingredient-grenadine.html"            
[33] "https://cocktails.hadley.nz/ingredient-herbstura.html"            
[34] "https://cocktails.hadley.nz/ingredient-honey-syrup.html"          
[35] "https://cocktails.hadley.nz/ingredient-lemon-juice.html"          
[36] "https://cocktails.hadley.nz/ingredient-lillet-blanc.html"         
[37] "https://cocktails.hadley.nz/ingredient-lime-juice.html"           
[38] "https://cocktails.hadley.nz/ingredient-luxardo-bitter-bianco.html"
[39] "https://cocktails.hadley.nz/ingredient-maple-syrup.html"          
[40] "https://cocktails.hadley.nz/ingredient-maraschino-liqueur.html"   
[41] "https://cocktails.hadley.nz/ingredient-meletti.html"              
[42] "https://cocktails.hadley.nz/ingredient-mezcal.html"               
[43] "https://cocktails.hadley.nz/ingredient-orange-bitters.html"       
[44] "https://cocktails.hadley.nz/ingredient-orange-liqueur.html"       
[45] "https://cocktails.hadley.nz/ingredient-orgeat.html"               
[46] "https://cocktails.hadley.nz/ingredient-peychauds.html"            
[47] "https://cocktails.hadley.nz/ingredient-pimms.html"                
[48] "https://cocktails.hadley.nz/ingredient-pineapple.html"            
[49] "https://cocktails.hadley.nz/ingredient-pineapple-juice.html"      
[50] "https://cocktails.hadley.nz/ingredient-ramazotti.html"            
[51] "https://cocktails.hadley.nz/ingredient-rhubarb-bitters.html"      
[52] "https://cocktails.hadley.nz/ingredient-rhum-agricole.html"        
[53] "https://cocktails.hadley.nz/ingredient-rum.html"                  
[54] "https://cocktails.hadley.nz/ingredient-rye.html"                  
[55] "https://cocktails.hadley.nz/ingredient-scotch.html"               
[56] "https://cocktails.hadley.nz/ingredient-sherry.html"               
[57] "https://cocktails.hadley.nz/ingredient-simple-syrup.html"         
[58] "https://cocktails.hadley.nz/ingredient-sloe-gin.html"             
[59] "https://cocktails.hadley.nz/ingredient-soda-water.html"           
[60] "https://cocktails.hadley.nz/ingredient-swedish-punsch.html"       
[61] "https://cocktails.hadley.nz/ingredient-sweet-vermouth.html"       
[62] "https://cocktails.hadley.nz/ingredient-tequila.html"              
[63] "https://cocktails.hadley.nz/ingredient-velvet-falernum.html"      
[64] "https://cocktails.hadley.nz/ingredient-vodka.html"                
[65] "https://cocktails.hadley.nz/ingredient-yellow-chartreuse.html"    
[66] "https://cocktails.hadley.nz/tag-3-star.html"                      
[67] "https://cocktails.hadley.nz/tag-adventurous.html"                 
[68] "https://cocktails.hadley.nz/tag-bubbles.html"                     
[69] "https://cocktails.hadley.nz/tag-daiquiri.html"                    
[70] "https://cocktails.hadley.nz/tag-fizz.html"                        
[71] "https://cocktails.hadley.nz/tag-flip.html"                        
[72] "https://cocktails.hadley.nz/tag-manhattan.html"                   
[73] "https://cocktails.hadley.nz/tag-negroni.html"                     
[74] "https://cocktails.hadley.nz/tag-sazerac.html"                     
[75] "https://cocktails.hadley.nz/tag-sour.html"                        
[76] "https://cocktails.hadley.nz/tag-tiki.html"                        
[77] "https://cocktails.hadley.nz/tag-vieux-carr.html"                  
  1. Write a function that takes a single URL and extracts all recipes on that page as HTML elements.

    Try this out on a fixed URL and confirm that it gets the right number of recipes

library(tidyverse)
library(httr2)
library(rvest)
get_recipes <- function(url){
    read_html(url) |> html_elements("article")
}

We can test this against this page and see that it works as expected:

BUBBLES <- get_recipes("https://cocktails.hadley.nz/tag-bubbles.html")

BUBBLES
{xml_nodeset (7)}
[1] <article class="cocktail" id="air-mail"><div class="title">\n      <h2>\n ...
[2] <article class="cocktail" id="campari-spritz"><div class="title">\n       ...
[3] <article class="cocktail" id="dandelo"><div class="title">\n      <h2>\n  ...
[4] <article class="cocktail" id="french-75"><div class="title">\n      <h2>\ ...
[5] <article class="cocktail" id="hinky-dinks-fizzy"><div class="title">\n    ...
[6] <article class="cocktail" id="kir-royale"><div class="title">\n      <h2> ...
[7] <article class="cocktail" id="negroni-sbagliato"><div class="title">\n    ...
  1. Write a function that takes in a single recipe and returns a small data frame with the title and ingredients:
library(tidyverse)
library(httr2)
library(rvest)
process_recipe <- function(art){
    title <- art |> html_element(".title") |> html_text2()
    ingredients <- art |> html_elements("li") |> html_text2()
    
    tibble(title=title, ingredients=ingredients)
}

We can again test this using the recipes we extracted above:

process_recipe(BUBBLES[[1]])
# A tibble: 4 × 2
  title    ingredients      
  <chr>    <chr>            
1 Air mail 3 oz champagne   
2 Air mail 1 oz rum, white  
3 Air mail ½ oz lime juice  
4 Air mail ¼ oz simple syrup
  1. Write a function that combines your results from the previous two steps to build a data frame with all recipes on a single page.

    Try this out on a fixed URL and confirm it works as expected.

library(tidyverse)
library(httr2)
library(rvest)

process_url <- function(url){
    html_recipes <- get_recipes(url)
    
    RECIPES <- data.frame()
    
    for(r in html_recipes){
        RECIPES <- rbind(RECIPES, process_recipe(r))
    }
    
    RECIPES
}

We can test this on the page of bubbly recipes:

process_url("https://cocktails.hadley.nz/tag-bubbles.html")
# A tibble: 29 × 2
   title                    ingredients               
   <chr>                    <chr>                     
 1 "Air mail"               3 oz champagne            
 2 "Air mail"               1 oz rum, white           
 3 "Air mail"               ½ oz lime juice           
 4 "Air mail"               ¼ oz simple syrup         
 5 "Campari spritz\n\nfizz" 3 oz champagne            
 6 "Campari spritz\n\nfizz" 1½ oz Campari             
 7 "Campari spritz\n\nfizz" 1 oz soda water           
 8 "Campari spritz\n\nfizz" 2 drops lime acid         
 9 "Dandelo"                2 oz luxardo bitter bianco
10 "Dandelo"                ½ oz lemon juice          
# ℹ 19 more rows
  1. Use your function from the first step with the list of URLs from Step 1 so that you get all of the recipes on the site. Combine your results into one large data frame that looks something like:

    Name Ingredient
    Daiquiri 2 oz Rum
    Daiquiri 1 oz Lime Juice
    Daiquiri 0.75 oz Simple Syrup

    To be polite, you may want to add req_cache to your requests so you are not requesting the same page many times over.

library(tidyverse)
library(httr2)
library(rvest)

ALL_RECIPES <- data.frame()

for(page in PAGE_URLS){
    ALL_RECIPES <- rbind(ALL_RECIPES, process_url(page))
}

ALL_RECIPES
# A tibble: 3,548 × 2
   title                                ingredients          
   <chr>                                <chr>                
 1 "Bachelor"                           1 oz rum, dark       
 2 "Bachelor"                           1 oz Meletti         
 3 "Bachelor"                           ¼ oz absinthe        
 4 "Bachelor"                           1 dash orange bitters
 5 "Bachelor"                           1 dash Angostura     
 6 "Banana Manhattan"                   1½ oz rum, aged      
 7 "Banana Manhattan"                   ½ oz sweet vermouth  
 8 "Banana Manhattan"                   ½ oz banana liqueur  
 9 "Banana Manhattan"                   1 dash absinthe      
10 "Industry sour\n\nadventurous, sour" 1 oz absinthe, pernod
# ℹ 3,538 more rows
  1. Save your code for next week. Next week we will investigate string processing and learn how to turn a table like:

    Name Ingredient
    Daiquiri 2 oz Rum
    Daiquiri 1 oz Lime Juice
    Daiquiri 0.75 oz Simple Syrup

    into

    Name Ingredient Amount
    Daiquiri Rum 2
    Daiquiri Lime Juice 1
    Daiquiri Simple Syrup 0.75

If you are comfortable with (or want to become comfortable with) the functional programming tools of purrr, you can write this entire analysis in one swoop:

library(tidyverse)
library(rvest)
process_recipe <- function(recipe){
    title <- recipe |> html_element(".title") |> html_text2()
    ingredients <- recipe |> html_elements("li") |> html_text2()
    
    tibble(title=title, ingredients=ingredients)
}

BASE_URL <- "https://cocktails.hadley.nz/"

read_html(BASE_URL) |> 
    html_elements("nav a") |> 
    html_attr("href") |>
    map(\(p) request(BASE_URL) |> req_url_path(p)) |>
    map(req_perform) |>
    map(resp_body_html) |> 
    map(html_elements, "article") |> 
    reduce(c) |>
    map(process_recipe) |> 
    list_rbind()
# A tibble: 3,548 × 2
   title                                ingredients          
   <chr>                                <chr>                
 1 "Bachelor"                           1 oz rum, dark       
 2 "Bachelor"                           1 oz Meletti         
 3 "Bachelor"                           ¼ oz absinthe        
 4 "Bachelor"                           1 dash orange bitters
 5 "Bachelor"                           1 dash Angostura     
 6 "Banana Manhattan"                   1½ oz rum, aged      
 7 "Banana Manhattan"                   ½ oz sweet vermouth  
 8 "Banana Manhattan"                   ½ oz banana liqueur  
 9 "Banana Manhattan"                   1 dash absinthe      
10 "Industry sour\n\nadventurous, sour" 1 oz absinthe, pernod
# ℹ 3,538 more rows

This is clearly quite terse and would benefit from some additional comments, but I’m omitting them so you can try to deconstruct this code yourself if you want to practice functional programming idioms.