The Open Trivia Database provides a simple API for trivia questions. You are going to use this API to create a trivia game for you and some friends.
Before beginning, explore the question bank and the API call generator to understand the types of data available and the structure of the API call.
Use the API call generator to create a URL to download 5 multiple-choice questions on a topic of your choosing. Copy this URL.
Now, use the httr2 package in R to make a request to that URL. At first, simply paste the whole URL into request and confirm it works. Then, break the URL into its constitutent parts using the req_url_path() and req_url_query() functions to construct the URL. (We won’t be parametrizing the API call in this brief exercise, but this is important to do if you are going to programmatically be making several API calls.)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.2
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(httr2)req1<-request("https://opentdb.com/api.php?amount=50&category=18&type=multiple")# Decompose the URL into its various components: req2<-request("https://opentdb.com")|>req_url_path("api.php")|>req_url_query(amount =50, category =18, type ="multiple")# Confirm equivalenceall.equal(req1, req2)
[1] TRUE
I’m using Computer Science (Category 18) but you can use others.
Now that we have built our request objects, we can use req_perform() to actually make the request:
<httr2_response>
GET https://opentdb.com/api.php?amount=50&category=18&type=multiple
Status: 200 OK
Content-Type: application/json
Body: In memory (12669 bytes)
The default response format is json, so use the resp_body_json function to convert the response into R. Print the repsonse to screen and examine the format. This will help you know what you need to do to manipulate (“rectangle”) the data into a useful structure.
Hint: To avoid printing so much to screen that it’s hard to make sense of the output, you may wish to modify your API call to temporarily only request 3-5 responses, instead of 50, even if your eventual goal is to get more responses than that.
Waiting 4s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■■■
Waiting 4s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
List of 2
$ response_code: int 0
$ results :List of 3
..$ :List of 6
.. ..$ type : chr "multiple"
.. ..$ difficulty : chr "medium"
.. ..$ category : chr "Science: Computers"
.. ..$ question : chr "What was the first commerically available computer processor?"
.. ..$ correct_answer : chr "Intel 4004"
.. ..$ incorrect_answers:List of 3
.. .. ..$ : chr "Intel 486SX"
.. .. ..$ : chr "TMS 1000"
.. .. ..$ : chr "AMD AM386"
..$ :List of 6
.. ..$ type : chr "multiple"
.. ..$ difficulty : chr "easy"
.. ..$ category : chr "Science: Computers"
.. ..$ question : chr "What does the "MP" stand for in MP3?"
.. ..$ correct_answer : chr "Moving Picture"
.. ..$ incorrect_answers:List of 3
.. .. ..$ : chr "Music Player"
.. .. ..$ : chr "Multi Pass"
.. .. ..$ : chr "Micro Point"
..$ :List of 6
.. ..$ type : chr "multiple"
.. ..$ difficulty : chr "medium"
.. ..$ category : chr "Science: Computers"
.. ..$ question : chr "All of the following programs are classified as raster graphics editors EXCEPT:"
.. ..$ correct_answer : chr "Inkscape"
.. ..$ incorrect_answers:List of 3
.. .. ..$ : chr "Paint.NET"
.. .. ..$ : chr "GIMP"
.. .. ..$ : chr "Adobe Photoshop"
We see here that the response comes in a list of two components: response_code and results, with the content we want being in the results. The 3 results look pretty straightforward, except for the incorrect_answers list, which we will need to keep an eye on.
Use the pluck function to extract the relevant part of the JSON response. The response will have several similarly formatted response objects. Use the map(as_tibble) |> list_rbind() paradigm to convert these into a large data frame.
Waiting 5s for retry backoff ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
# A tibble: 9 × 6
type difficulty category question correct_answer incorrect_answers
<chr> <chr> <chr> <chr> <chr> <list>
1 multiple hard Science: Comput… Which k… Secret sharin… <chr [1]>
2 multiple hard Science: Comput… Which k… Secret sharin… <chr [1]>
3 multiple hard Science: Comput… Which k… Secret sharin… <chr [1]>
4 multiple medium Science: Comput… .rs is … Serbia <chr [1]>
5 multiple medium Science: Comput… .rs is … Serbia <chr [1]>
6 multiple medium Science: Comput… .rs is … Serbia <chr [1]>
7 multiple medium Science: Comput… How man… 23 <chr [1]>
8 multiple medium Science: Comput… How man… 23 <chr [1]>
9 multiple medium Science: Comput… How man… 23 <chr [1]>
Everything looks pretty good here, with two notes:
The incorrect_answers column appears to be a list type column
Each of our responses was repeated across three rows.
Use the unnest and pivot_wider functions to conver this data into a more usable structure.
When applied to a list column, the unnest function will try to convert it into a more standard vector-type column.
Hint: Before using pivot_wider, you will need to create a new column with column names for the pivoted data.
Using rvest, read the following table into R. Your results should be a data frame that captures the structure of the table.
Table 1: Selected Statistics Courses Offered at Baruch College
Department
Course Number
Course Name
STA
2000
Business Statistics I
STA
3000
Statistical Computing
STA
3154
Business Statistics II
STA
3950
Data Mining
STA
4155
Regression & Forecasting Models
STA
4157
Analysis of Variance: Principles & Applications
STA
4950
Machine Learning and AI
STA
9700
Modern Regression Analysis
STA
9705
Multivariate Statistical Methods
STA
9708
Managerial Statistics
STA
9710
Statistical Methods for Sampling and Audit
STA
9715
Applied Probability
STA
9719
Foundations of Statistical Inference
STA
9750
Software Tools for Data Analysis
STA
9890
Statistical Learning for Data Mining
Use the Selector Gadget or similar to identify the HTML elements defining this table. You can do this in one of two ways:
By a specific named element, accessed as html_element("#name")
By a specific element type, accessed as html_element("type")
Use the read_html function to read this entire page into R, then use the html_element function with the selector from the previous step to extract the table.
Use the html_table function to conver this table into a data frame.
This page is pretty simple, so a little bit of element selection and html_table will suffice. For more complex tables, you may need to do more to select the specific table and more to ‘clean up’ that table once it is in R.
In this exercise, we’re going to take our Baruch map from Lab #01 and extend it to pull all CUNY campuses. This exercise is intended to help you practice scraping data from Wikipedia, which is a good example of a relatively plain HTML site that has a bit of extra Javascript that can make it a bit tricky to find the element(s) you want.
The following code is a cleaned-up version of the map creation activity from Lab #01. Review it before beginning this activity:
# A tibble: 1 × 2
latitude longitude
<dbl> <dbl>
1 40.7 -74.0
Investigate the Wikipedia List of CUNYs and identify which table can be used to get links to each CUNY’s page.
Wikipedia Table Scraping
Wikipedia has tons of excellent data tables which have been collected by incredible volunteers. It can be an incredibly useful source, but there are a few difficulties in getting data out of Wikipedia. In particular, Wikipedia uses a rich JavaScript library to make tables slightly interactive (column sorting, etc.) that modifies the page HTML. R will get the “raw” HTML of the page, so you will need to either disable JavaScript temporarily (tricky) or make sure you are looking at the actual page source that R will see (“View Source”).
As a general rule, try to use standard HTML elements (table, tbody) instead of fancier alternatives when possible.
Parse the table to pull links to each CUNY’s individual Wikipedia page.
Apply your HTML parsing function from Step 1 to each CUNY page using the map function or the rowwise() grouping structure:
Hint: You may need to use unnest to convert a list column into something more usable.
Solution
CUNYs<-CUNYs|>rowwise()|>mutate(info =get_cuny_coordinates(link))## Or with `map`CUNYs<-CUNYs|>mutate(info =map(link, get_cuny_coordinates))## Then unnest the info columnCUNYs<-CUNYs|>unnest(info)CUNYs
# A tibble: 26 × 4
name link latitude longitude
<chr> <chr> <dbl> <dbl>
1 Baruch College https://… 40.7 -74.0
2 The City College https://… 40.8 -74.0
3 Graduate Center https://… 40.7 -74.0
4 Graduate School of Public Health & Health Policy https://… 40.8 -73.9
5 Guttman Community College https://… 40.8 -74.0
6 Hunter College https://… 40.8 -74.0
7 John Jay College of Criminal Justice https://… 40.8 -74.0
8 Macaulay Honors College https://… 40.8 -74.0
9 Manhattan Community College https://… 40.7 -74.0
10 Newmark Graduate School of Journalism https://… 40.8 -74.0
# ℹ 16 more rows
Hadley Wickham is the author of many of the “tidy tools” we use in this course. He is also an excellent bartender and chef. In this exercise, we are going to web scrape his cocktail recipe book which can be found at https://cocktails.hadley.nz/.
Our goal is to create a data frame that records all 150+ recipes on this site (as rows) and the different ingredients (as columns). This week, we are going to pull the different recipes into R: next week we are going to process the text and create our final data frame (so stay tuned!).
Working with your project team, first go through the following steps to build a scraping strategy:
Poke around the website to see how it is organized. Is there a single page listing all of the cocktails? If not, how else can you make sure that you’ve explored the entire site?
Once you know how you’re going to explore the whole site, use your browser tools to see if you can identify an HTML element that corresponds to a single recipe. (This element will occur several times per page) Remember that you want to select “as small as possible” but no smaller.
Once you have found the right HTML element for a recipe, identify an HTML element that corresponds to i) the title; and ii) individual ingredients.
For this task, you will likely see several recipes more than once. Don’t worry about this for now - we can distinct out the duplicates later in our analysis. It’s better to be over-inclusive than under-inclusive.
After you have built your plan, it’s time to start putting this all to code.
Write code to get a list of all pages you will need to process. Construct full URLs for your future requests.
We can again test this using the recipes we extracted above:
process_recipe(BUBBLES[[1]])
# A tibble: 4 × 2
title ingredients
<chr> <chr>
1 Air mail 3 oz champagne
2 Air mail 1 oz rum, white
3 Air mail ½ oz lime juice
4 Air mail ¼ oz simple syrup
Write a function that combines your results from the previous two steps to build a data frame with all recipes on a single page.
Try this out on a fixed URL and confirm it works as expected.
# A tibble: 29 × 2
title ingredients
<chr> <chr>
1 "Air mail" 3 oz champagne
2 "Air mail" 1 oz rum, white
3 "Air mail" ½ oz lime juice
4 "Air mail" ¼ oz simple syrup
5 "Campari spritz\n\nfizz" 3 oz champagne
6 "Campari spritz\n\nfizz" 1½ oz Campari
7 "Campari spritz\n\nfizz" 1 oz soda water
8 "Campari spritz\n\nfizz" 2 drops lime acid
9 "Dandelo" 2 oz luxardo bitter bianco
10 "Dandelo" ½ oz lemon juice
# ℹ 19 more rows
Use your function from the first step with the list of URLs from Step 1 so that you get all of the recipes on the site. Combine your results into one large data frame that looks something like:
Name
Ingredient
Daiquiri
2 oz Rum
Daiquiri
1 oz Lime Juice
Daiquiri
0.75 oz Simple Syrup
To be polite, you may want to add req_cache to your requests so you are not requesting the same page many times over.
# A tibble: 3,548 × 2
title ingredients
<chr> <chr>
1 "Bachelor" 1 oz rum, dark
2 "Bachelor" 1 oz Meletti
3 "Bachelor" ¼ oz absinthe
4 "Bachelor" 1 dash orange bitters
5 "Bachelor" 1 dash Angostura
6 "Banana Manhattan" 1½ oz rum, aged
7 "Banana Manhattan" ½ oz sweet vermouth
8 "Banana Manhattan" ½ oz banana liqueur
9 "Banana Manhattan" 1 dash absinthe
10 "Industry sour\n\nadventurous, sour" 1 oz absinthe, pernod
# ℹ 3,538 more rows
Save your code for next week. Next week we will investigate string processing and learn how to turn a table like:
Name
Ingredient
Daiquiri
2 oz Rum
Daiquiri
1 oz Lime Juice
Daiquiri
0.75 oz Simple Syrup
into
Name
Ingredient
Amount
Daiquiri
Rum
2
Daiquiri
Lime Juice
1
Daiquiri
Simple Syrup
0.75
FP Solution
If you are comfortable with (or want to become comfortable with) the functional programming tools of purrr, you can write this entire analysis in one swoop:
# A tibble: 3,548 × 2
title ingredients
<chr> <chr>
1 "Bachelor" 1 oz rum, dark
2 "Bachelor" 1 oz Meletti
3 "Bachelor" ¼ oz absinthe
4 "Bachelor" 1 dash orange bitters
5 "Bachelor" 1 dash Angostura
6 "Banana Manhattan" 1½ oz rum, aged
7 "Banana Manhattan" ½ oz sweet vermouth
8 "Banana Manhattan" ½ oz banana liqueur
9 "Banana Manhattan" 1 dash absinthe
10 "Industry sour\n\nadventurous, sour" 1 oz absinthe, pernod
# ℹ 3,538 more rows
This is clearly quite terse and would benefit from some additional comments, but I’m omitting them so you can try to deconstruct this code yourself if you want to practice functional programming idioms.