STA 9750 Week 9 In-Class Activity: Data Import
Slides
Review Practice
The data can be found online at https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/candy-power-ranking/candy-data.csv.
The data looks like this:
Rows: 85
Columns: 13
$ competitorname <chr> "100 Grand", "3 Musketeers", "One dime", "One quarter…
$ chocolate <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
$ fruity <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1,…
$ caramel <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ peanutyalmondy <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ nougat <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
$ crispedricewafer <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ hard <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,…
$ bar <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
$ pluribus <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1,…
$ sugarpercent <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.604, 0.31…
$ pricepercent <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.767, 0.51…
$ winpercent <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.…
Read this data into R using the read_csv function from the readr package; note that the read_csv function can take a URL as an input so you don’t need to directly download this data set.
Once you have downloaded the candy data, create visualizations that address the following questions:
- Do people prefer more sugary candies? (Think OLS)
- Do people prefer more expensive candies? (Think OLS)
- Do people prefer chocolate candies? (Think ANOVA)
For this question, we only know whether a candy is chocolate-based or not, so a linear model is not the most straightforward approach. We can instead use an ANOVA-inspired bar plot.1
Using the File System
Exercises #01
Use the
getwd()to find your current working directory. For the following exercises to be most useful, you will want to ensure that this is yourSTA9750-2025-FALLcourse directory. If you have somehow changed it please use thesetwd()function to return to the proper directory.In your
STA9750-2025-FALLdirectory, you have a file calledindex.qmd. Use thepathfunction from thefspackage to create apathobject pointing to that file and then use thepath_absfunction to convert that relative path to an absolute path.
Note that these solutions will give different answers on your personal computer as opposed to this web interface. That is to be expected since we are going outside of R and interacting with the whole system and should not surprise you.
-
A common pattern in working with files is wanting to find all files whose names match a certain pattern. This is known colloquially as globbing. The
dir_lsfunction can be used to list all files in a directory. First use it to list all files in yourSTA9750-2025-FALLcourse directory; then use theglobargument to list only theqmdfiles in that directory.Hint: The
globargument should be a string with asterisks in it where you will allow any match. For example,*.csvwill match all CSV files,abc*will match all files whose names start withabc, andabc*.csvwill match files whose names start withabcand end with.csv.
- The
dir_infofunction can be used to create a data frame with information about all files in a directory. Use it anddplyrfunctionality to find the largest file in your project folder. Then return its absolute path.
-
Often, we want to perform a search over all files and folders in a directory recursively - that is, looking inside subfolders, not just at files in the top level. The
recurse=TRUEargument todir_infowill return information about all files contained in any level of the directory. Usedir_infoto search thedatadirectory you have created to store Mini-Project data and find the smallest data file.Hint: You will need to pass a
pathargument todir_infoso that you search only yourdatafolder and not your entire course directory.
- Determine how much storage is used by the 5 largest files in any part of your project directory.
- Use the
file_existsfunction to determine whether you have started Mini-Project #03.
HTTP and Web Access
Exercises #02
Earlier, we used the fact that read_csv could download a file automatically for us to read in the 538 Candy Data. We are now going to download the same data using httr2.
-
Use the
httr2package to build a suitablerequestobject. You should build your request in two steps:Specify the domain using
requestAdd the path using
req_url_path
- Now that you have built a request, perform it and check to make sure your request was successful.
- Because your request was successful, it will have a body. Extract this body as a string and pass it to
read_csvto read the data intoR.
- Modify your analysis to download your
mp01.qmdfile. To find the appropriate URL, you will need to first find the file in the web interface for your GitHub repository and then click theRawbutton to a direct link to the file. Make a request to get this data and to check whether the request was successful.
- Next, modify our analysis to check whether
mp04.qmdhas already been uploaded. Note that this will throw an error because a404is returned, indicating that you have not yet uploadedmp04.qmd. We will address this in the next step.
- Obviously, that error is a bit of a problem. Let’s change how our request handles errors. The
req_errorfunction can be used to modify a request to change whether an error is thrown. As its argument, it takes a function that checks for an error. Since we want to never throw an error, define a function with one argumentxthat always returns false and pass this toreq_error. (Note thatreq_errormust come beforereq_performsince it changes how a request is performed.) Use this in conjunction with theresp_is_errorfunction to check that your request fails.
Using an anonymous function, we can write this a bit more compactly as
-
Package your code into a function that takes an integer argument and tests whether that mini-project is missing from your GitHub.
Hint: The
gluepackage may be useful to make strings here:glue("mp0{N}.qmd")will automatically substitute the value of the variableNfor you.
-
Finally, check which of your MPs have not yet been submitted.
Unfortunately, your function is not vectorized so this is not as simple as
func(1:4). Instead, we want to apply the same function separately to each element of the vectorc(1,2,3,4). This is a use case for themapfamily of functions. Since your function returns a logical value, we’ll usemap_lglhere.
:::
This example is maybe a bit silly, but it is essentially how I check whether your mini-projects have been submitted on time. (I map over students rather than the project numbers.)
API Usage
Next, we are going to practice accessing data from a nice API using httr2. Specifically, we are going to interact with the cranlogs server, which keeps records of the most popular R packages based on download frequency.
Documentation for cranlogs can be found in its GitHub README with a very small example at here.2
The cranlogs documentation give the following example of how the curl program can call the API from the command line:
curl https://cranlogs.r-pkg.org/downloads/total/last-week/ggplot2 % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 82 100 82 0 0 654 0 --:--:-- --:--:-- --:--:-- 656
[{"start":"2025-10-22","end":"2025-10-28","downloads":510746,"package":"ggplot2"}]
Even though this is not R code, we can emulate this action in R.
And if we want to get download information for other packages, we can simply modify the URL:
and so on. But this quickly becomes repetitive and we would prefer a programmatic interface. This is where the httr2 package comes in.
httr2
httr2 takes a “three-stage” approach to handling HTTP requests:
- First, we build a request, specifying the URL to be queried, the mode of that query, and any relevant information (data, passwords, etc.)
- Then, we execute that request to get a response
- Finally, we handle that response, transforming its contents into
Ras appropriate
Let’s look at these one at a time.
Build a Request
We can build a request using the request function:
function (base_url)
{
new_request(base_url)
}
<bytecode: 0x1387b36b8>
<environment: namespace:httr2>
As seen here, we start a request by putting in a “base URL” - this is the unchanging part of the URL that won’t really depend on what query we are making.
For the cranlogs API, we can take the base URL to be
https://cranlogs.r-pkg.org
so our base request is:
<httr2_request>
GET https://cranlogs.r-pkg.org
Body: empty
We see here that this is a GET request by default, indicating we would like a response from the server, but we are not
We then modify the path of the request to point to the specific resource or endpoint we want to query. For example, if we want to get the “top” R packages for the last day, we can run the following:
my_req <- my_req |>
req_url_path_append("top") |>
req_url_path_append("last-day")
print(my_req)<httr2_request>
GET https://cranlogs.r-pkg.org/top/last-day
Body: empty
Execute a Request to get a Response
Now that we have built our request, we pass it to req_perform3 to execute (perform) the request:
my_resp <- req_perform(my_req)
print(my_resp)<httr2_response>
GET https://cranlogs.r-pkg.org/top/last-day
Status: 200 OK
Content-Type: application/json
Body: In memory (454 bytes)
The result of performing this request is a response object. We see several things in this response:
- We received back a “200 OK” response, indicating that our query worked perfectly
- We received back data in a
jsonformat - Our results are currently in memory (as opposed to be saved to a file)
Process the Response for Use in R
Since we know our response is in JSON format, we can use the resp_body_json to get the “body” (content) of the response and parse it as json:
downloads_raw <- resp_body_json(my_resp)
print(downloads_raw)$start
[1] "2025-10-28T00:00:00.000Z"
$end
[1] "2025-10-28T00:00:00.000Z"
$downloads
$downloads[[1]]
$downloads[[1]]$package
[1] "ggplot2"
$downloads[[1]]$downloads
[1] "92101"
$downloads[[2]]
$downloads[[2]]$package
[1] "dplyr"
$downloads[[2]]$downloads
[1] "74923"
$downloads[[3]]
$downloads[[3]]$package
[1] "rlang"
$downloads[[3]]$downloads
[1] "71041"
$downloads[[4]]
$downloads[[4]]$package
[1] "tibble"
$downloads[[4]]$downloads
[1] "68155"
$downloads[[5]]
$downloads[[5]]$package
[1] "cli"
$downloads[[5]]$downloads
[1] "67979"
$downloads[[6]]
$downloads[[6]]$package
[1] "rmarkdown"
$downloads[[6]]$downloads
[1] "65480"
$downloads[[7]]
$downloads[[7]]$package
[1] "xml2"
$downloads[[7]]$downloads
[1] "63635"
$downloads[[8]]
$downloads[[8]]$package
[1] "magrittr"
$downloads[[8]]$downloads
[1] "60958"
$downloads[[9]]
$downloads[[9]]$package
[1] "lifecycle"
$downloads[[9]]$downloads
[1] "60082"
This gives us the type of data we were looking for!
Note that httr2 is designed for “piped” work, so we can write the entire process as
request("https://cranlogs.r-pkg.org") |>
req_url_path_append("top") |>
req_url_path_append("last-day") |>
req_perform() |>
resp_body_json()$start
[1] "2025-10-28T00:00:00.000Z"
$end
[1] "2025-10-28T00:00:00.000Z"
$downloads
$downloads[[1]]
$downloads[[1]]$package
[1] "ggplot2"
$downloads[[1]]$downloads
[1] "92101"
$downloads[[2]]
$downloads[[2]]$package
[1] "dplyr"
$downloads[[2]]$downloads
[1] "74923"
$downloads[[3]]
$downloads[[3]]$package
[1] "rlang"
$downloads[[3]]$downloads
[1] "71041"
$downloads[[4]]
$downloads[[4]]$package
[1] "tibble"
$downloads[[4]]$downloads
[1] "68155"
$downloads[[5]]
$downloads[[5]]$package
[1] "cli"
$downloads[[5]]$downloads
[1] "67979"
$downloads[[6]]
$downloads[[6]]$package
[1] "rmarkdown"
$downloads[[6]]$downloads
[1] "65480"
$downloads[[7]]
$downloads[[7]]$package
[1] "xml2"
$downloads[[7]]$downloads
[1] "63635"
$downloads[[8]]
$downloads[[8]]$package
[1] "magrittr"
$downloads[[8]]$downloads
[1] "60958"
$downloads[[9]]
$downloads[[9]]$package
[1] "lifecycle"
$downloads[[9]]$downloads
[1] "60082"
This data is not super helpful for us, since it’s in a “list of lists” format. This is not uncommon with json responses and it is usually at this point that we have a bit of work to do in order to make the data useable. Thankfully, API data is typically well-structured, so this doesn’t wind up being too hard. I personally find this type of complex R output a bit hard to parse, so I instead print it as a “string” (the ‘raw text’ of the unparsed JSON) and use the prettify() function from the jsonlite package to make it extra readable:
Attaching package: 'jsonlite'
The following object is masked from 'package:purrr':
flatten
my_resp |>
resp_body_string() |>
prettify(){
"start": "2025-10-28T00:00:00.000Z",
"end": "2025-10-28T00:00:00.000Z",
"downloads": [
{
"package": "ggplot2",
"downloads": "92101"
},
{
"package": "dplyr",
"downloads": "74923"
},
{
"package": "rlang",
"downloads": "71041"
},
{
"package": "tibble",
"downloads": "68155"
},
{
"package": "cli",
"downloads": "67979"
},
{
"package": "rmarkdown",
"downloads": "65480"
},
{
"package": "xml2",
"downloads": "63635"
},
{
"package": "magrittr",
"downloads": "60958"
},
{
"package": "lifecycle",
"downloads": "60082"
}
]
}
This is the same data as before, but much easier to read. At this point, we should pause and make an ‘attack plan’ for our analysis. I see several things here:
- I really only want the
"downloads"part of the response. - Each element inside
downloadshas the same flat structure, so they can be easily built into a one-row data frame. - The column names are the same for each
downloadselement, so we will be able to put them into one big easy-to-use data frame.
To do these steps, we will need to use functionality from the purrr package, which we will discuss in more detail next week. For now, it suffices to run:
Here, we see we
- Pulled out the
"downloads"portion of the JSON (pluck) - Converted each row to a data frame (
map(as_tibble)) - Combined the results rowwise (
list_rbind)
The result is a very nice little data frame:
downloads_df# A tibble: 9 × 2
package downloads
<chr> <chr>
1 ggplot2 92101
2 dplyr 74923
3 rlang 71041
4 tibble 68155
5 cli 67979
6 rmarkdown 65480
7 xml2 63635
8 magrittr 60958
9 lifecycle 60082
Exercises #03
Now it’s your turn! In your breakout rooms, try the following:
Make sure you can run all of the code above.
-
Modify the above code to get the top 100
Rpackages.This is a minor change to the request only, but you will need to read the documentation to see where and how the request needs to be changed.
-
Modify your query to get the daily downloads for the
ggplot2package over the last month. This will require changes to how you process the response, so be sure to look at the raw JSON first.Hint: The
pluckfunction can also take a number as input. This will say which list item (by position) to return.
-
‘Functionize’ your daily downloads query as a function which takes an arbitrary package name and gets its daily downloads.
Hint: Use a
mutatecommand to add the package name as a new column to the resulting data frame and to convert thedaycolumn to aDateobject (day=as.Date(day)).
- Use your function to get daily downloads for the following
Rpackages:ggplot2dplyrhttr2tibblepurrr-
tidyrand combine your results into a single data frame. Then plot the download trends for each package usingggplot2.
map() |> list_rbind()idiom here as well.
Footnotes
Note, however, that standard ANOVA is basically just a very restricted linear model; cf. https://lindeloev.github.io/tests-as-linear/.↩︎
In this case, there is a
cranlogspackage for interacting with this API. This type of package is commonly called a “wrapper” because it shields the user from the details of the API and exposes a more idiomatic (and more useful) interface. In general, when you can find anRpackage that wraps an API, it is a good idea to use it. For certain very complex APIs, e.g., the API that powers Bloomberg Financial Information Services, use of the associatedRpackage is almost mandatory because the underlying API is so complicated. For this in-class exercise, we will use the “raw” API as practice since you can’t always assume a niceRpackage will exist.↩︎We will only use the basic
req_performfor now, buthttr2provides options for parallel execution, delayed execution, etc..↩︎