library(tidyverse)
STA 9750 Week 9 In-Class Activity: Data Import
This week, we are going to practice accessing data from a nice API using httr2
. Specifically, we are going to interact with the cranlogs
server, which keeps records of the most popular R
packages based on download frequency.
Documentation for cranlogs
can be found in its GitHub README
with a very small example at here.1
The cranlogs
documentation give the following example of how the curl
program can call the API from the command line:
curl https://cranlogs.r-pkg.org/downloads/total/last-week/ggplot2
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 82 100 82 0 0 294 0 --:--:-- --:--:-- --:--:-- 294
[{"start":"2025-03-27","end":"2025-04-02","downloads":421451,"package":"ggplot2"}]
Even though this is not R
code, we can emulate this action in R
.
library(jsonlite)
fromJSON("https://cranlogs.r-pkg.org/downloads/total/last-week/ggplot2")
start end downloads package
1 2025-03-27 2025-04-02 421451 ggplot2
And if we want to get download information for other packages, we can simply modify the URL:
library(jsonlite)
fromJSON("https://cranlogs.r-pkg.org/downloads/total/last-week/dplyr")
start end downloads package
1 2025-03-27 2025-04-02 373318 dplyr
library(jsonlite)
fromJSON("https://cranlogs.r-pkg.org/downloads/total/last-week/readr")
start end downloads package
1 2025-03-27 2025-04-02 200738 readr
and so on. But this quickly becomes repetitive and we would prefer a programmatic interface. This is where the httr2
package comes in.
httr2
httr2
takes a “three-stage” approach to handling HTTP requests:
- First, we build a request, specifying the URL to be queried, the mode of that query, and any relevant information (data, passwords, etc.)
- Then, we execute that request to get a response
- Finally, we handle that response, transforming its contents into
R
as appropriate
Let’s look at these one at a time.
Build a Request
We can build a request using the request
function:
library(httr2)
request
function (base_url)
{
new_request(base_url)
}
<bytecode: 0x1155519e8>
<environment: namespace:httr2>
As seen here, we start a request by putting in a “base URL” - this is the unchanging part of the URL that won’t really depend on what query we are making.
For the cranlogs
API, we can take the base URL to be
https://cranlogs.r-pkg.org
so our base request is:
<- request("https://cranlogs.r-pkg.org")
my_req print(my_req)
<httr2_request>
GET https://cranlogs.r-pkg.org
Body: empty
We see here that this is a GET
request by default, indicating we would like a response from the server, but we are not
We then modify the path of the request to point to the specific resource or endpoint we want to query. For example, if we want to get the “top” R
packages for the last day, we can run the following:
<- my_req |>
my_req req_url_path_append("top") |>
req_url_path_append("last-day")
print(my_req)
<httr2_request>
GET https://cranlogs.r-pkg.org/top/last-day
Body: empty
Execute a Request to get a Response
Now that we have built our request, we pass it to req_perform
2 to execute (perform) the request:
<- req_perform(my_req)
my_resp print(my_resp)
<httr2_response>
GET https://cranlogs.r-pkg.org/top/last-day
Status: 200 OK
Content-Type: application/json
Body: In memory (451 bytes)
The result of performing this request is a response object. We see several things in this response:
- We received back a “200 OK” response, indicating that our query worked perfectly
- We received back data in a
json
format - Our results are currently in memory (as opposed to be saved to a file)
Process the Response for Use in R
Since we know our response is in JSON
format, we can use the resp_body_json
to get the “body” (content) of the response and parse it as json
:
<- resp_body_json(my_resp)
downloads_raw print(downloads_raw)
$start
[1] "2025-04-02T00:00:00.000Z"
$end
[1] "2025-04-02T00:00:00.000Z"
$downloads
$downloads[[1]]
$downloads[[1]]$package
[1] "ggplot2"
$downloads[[1]]$downloads
[1] "97172"
$downloads[[2]]
$downloads[[2]]$package
[1] "tibble"
$downloads[[2]]$downloads
[1] "80421"
$downloads[[3]]
$downloads[[3]]$package
[1] "rlang"
$downloads[[3]]$downloads
[1] "71966"
$downloads[[4]]
$downloads[[4]]$package
[1] "cli"
$downloads[[4]]$downloads
[1] "69508"
$downloads[[5]]
$downloads[[5]]$package
[1] "dplyr"
$downloads[[5]]$downloads
[1] "62663"
$downloads[[6]]
$downloads[[6]]$package
[1] "lifecycle"
$downloads[[6]]$downloads
[1] "58518"
$downloads[[7]]
$downloads[[7]]$package
[1] "tidyverse"
$downloads[[7]]$downloads
[1] "57524"
$downloads[[8]]
$downloads[[8]]$package
[1] "glue"
$downloads[[8]]$downloads
[1] "57178"
$downloads[[9]]
$downloads[[9]]$package
[1] "vctrs"
$downloads[[9]]$downloads
[1] "54492"
This gives us the type of data we were looking for!
Note that httr2
is designed for “piped” work, so we can write the entire process as
request("https://cranlogs.r-pkg.org") |>
req_url_path_append("top") |>
req_url_path_append("last-day") |>
req_perform() |>
resp_body_json()
$start
[1] "2025-04-02T00:00:00.000Z"
$end
[1] "2025-04-02T00:00:00.000Z"
$downloads
$downloads[[1]]
$downloads[[1]]$package
[1] "ggplot2"
$downloads[[1]]$downloads
[1] "97172"
$downloads[[2]]
$downloads[[2]]$package
[1] "tibble"
$downloads[[2]]$downloads
[1] "80421"
$downloads[[3]]
$downloads[[3]]$package
[1] "rlang"
$downloads[[3]]$downloads
[1] "71966"
$downloads[[4]]
$downloads[[4]]$package
[1] "cli"
$downloads[[4]]$downloads
[1] "69508"
$downloads[[5]]
$downloads[[5]]$package
[1] "dplyr"
$downloads[[5]]$downloads
[1] "62663"
$downloads[[6]]
$downloads[[6]]$package
[1] "lifecycle"
$downloads[[6]]$downloads
[1] "58518"
$downloads[[7]]
$downloads[[7]]$package
[1] "tidyverse"
$downloads[[7]]$downloads
[1] "57524"
$downloads[[8]]
$downloads[[8]]$package
[1] "glue"
$downloads[[8]]$downloads
[1] "57178"
$downloads[[9]]
$downloads[[9]]$package
[1] "vctrs"
$downloads[[9]]$downloads
[1] "54492"
This data is not super helpful for us, since it’s in a “list of lists” format. This is not uncommon with json
responses and it is usually at this point that we have a bit of work to do in order to make the data useable. Thankfully, API data is typically well-structured, so this doesn’t wind up being too hard. I personally find this type of complex R
output a bit hard to parse, so I instead print it as a “string” (the ‘raw text’ of the unparsed JSON) and use the prettify()
function from the jsonlite
package to make it extra readable:
library(jsonlite)
Attaching package: 'jsonlite'
The following object is masked from 'package:purrr':
flatten
|>
my_resp resp_body_string() |>
prettify()
{
"start": "2025-04-02T00:00:00.000Z",
"end": "2025-04-02T00:00:00.000Z",
"downloads": [
{
"package": "ggplot2",
"downloads": "97172"
},
{
"package": "tibble",
"downloads": "80421"
},
{
"package": "rlang",
"downloads": "71966"
},
{
"package": "cli",
"downloads": "69508"
},
{
"package": "dplyr",
"downloads": "62663"
},
{
"package": "lifecycle",
"downloads": "58518"
},
{
"package": "tidyverse",
"downloads": "57524"
},
{
"package": "glue",
"downloads": "57178"
},
{
"package": "vctrs",
"downloads": "54492"
}
]
}
This is the same data as before, but much easier to read. At this point, we should pause and make an ‘attack plan’ for our analysis. I see several things here:
- I really only want the
"downloads"
part of the response. - Each element inside
downloads
has the same flat structure, so they can be easily built into a one-row data frame. - The column names are the same for each
downloads
element, so we will be able to put them into one big easy-to-use data frame.
To do these steps, we will need to use functionality from the purrr
package, which we will discuss in more detail next week. For now, it suffices to run:
library(purrr)
library(tibble)
<- downloads_raw |>
downloads_df pluck("downloads") |>
map(as_tibble) |>
list_rbind()
Here, we see we
- Pulled out the
"downloads"
portion of the JSON (pluck
) - Converted each row to a data frame (
map(as_tibble)
) - Combined the results rowwise (
list_rbind
)
The result is a very nice little data frame:
downloads_df
# A tibble: 9 × 2
package downloads
<chr> <chr>
1 ggplot2 97172
2 tibble 80421
3 rlang 71966
4 cli 69508
5 dplyr 62663
6 lifecycle 58518
7 tidyverse 57524
8 glue 57178
9 vctrs 54492
Your Turn!
Now it’s your turn! In your breakout rooms, try the following:
Make sure you can run all of the code above.
Modify the above code to get the top 100
R
packages.This is a minor change to the request only, but you will need to read the documentation to see where and how the request needs to be changed.
request("https://cranlogs.r-pkg.org") |>
req_url_path_append("top") |>
req_url_path_append("last-day") |>
req_url_path_append(100) |>
req_perform() |>
resp_body_json() |>
pluck("downloads") |>
map(as_tibble) |>
list_rbind()
# A tibble: 100 × 2
package downloads
<chr> <chr>
1 ggplot2 97172
2 tibble 80421
3 rlang 71966
4 cli 69508
5 dplyr 62663
6 lifecycle 58518
7 tidyverse 57524
8 glue 57178
9 vctrs 54492
10 jsonlite 54381
# ℹ 90 more rows
Modify your query to get the daily downloads for the
ggplot2
package over the last month. This will require changes to how you process the response, so be sure to look at the raw JSON first.Hint: The
pluck
function can also take a number as input. This will say which list item (by position) to return.
request("https://cranlogs.r-pkg.org") |>
req_url_path_append("downloads") |>
req_url_path_append("daily") |>
req_url_path_append("last-month") |>
req_url_path_append("ggplot2") |>
req_perform() |>
resp_body_json() |>
pluck(1)
$downloads
$downloads[[1]]
$downloads[[1]]$day
[1] "2025-03-04"
$downloads[[1]]$downloads
[1] 68641
$downloads[[2]]
$downloads[[2]]$day
[1] "2025-03-05"
$downloads[[2]]$downloads
[1] 69055
$downloads[[3]]
$downloads[[3]]$day
[1] "2025-03-06"
$downloads[[3]]$downloads
[1] 71064
$downloads[[4]]
$downloads[[4]]$day
[1] "2025-03-07"
$downloads[[4]]$downloads
[1] 59153
$downloads[[5]]
$downloads[[5]]$day
[1] "2025-03-08"
$downloads[[5]]$downloads
[1] 37164
$downloads[[6]]
$downloads[[6]]$day
[1] "2025-03-09"
$downloads[[6]]$downloads
[1] 37480
$downloads[[7]]
$downloads[[7]]$day
[1] "2025-03-10"
$downloads[[7]]$downloads
[1] 68376
$downloads[[8]]
$downloads[[8]]$day
[1] "2025-03-11"
$downloads[[8]]$downloads
[1] 73772
$downloads[[9]]
$downloads[[9]]$day
[1] "2025-03-12"
$downloads[[9]]$downloads
[1] 72439
$downloads[[10]]
$downloads[[10]]$day
[1] "2025-03-13"
$downloads[[10]]$downloads
[1] 70576
$downloads[[11]]
$downloads[[11]]$day
[1] "2025-03-14"
$downloads[[11]]$downloads
[1] 61456
$downloads[[12]]
$downloads[[12]]$day
[1] "2025-03-15"
$downloads[[12]]$downloads
[1] 39683
$downloads[[13]]
$downloads[[13]]$day
[1] "2025-03-16"
$downloads[[13]]$downloads
[1] 40356
$downloads[[14]]
$downloads[[14]]$day
[1] "2025-03-17"
$downloads[[14]]$downloads
[1] 68526
$downloads[[15]]
$downloads[[15]]$day
[1] "2025-03-18"
$downloads[[15]]$downloads
[1] 72816
$downloads[[16]]
$downloads[[16]]$day
[1] "2025-03-19"
$downloads[[16]]$downloads
[1] 71516
$downloads[[17]]
$downloads[[17]]$day
[1] "2025-03-20"
$downloads[[17]]$downloads
[1] 71192
$downloads[[18]]
$downloads[[18]]$day
[1] "2025-03-21"
$downloads[[18]]$downloads
[1] 59108
$downloads[[19]]
$downloads[[19]]$day
[1] "2025-03-22"
$downloads[[19]]$downloads
[1] 37789
$downloads[[20]]
$downloads[[20]]$day
[1] "2025-03-23"
$downloads[[20]]$downloads
[1] 37987
$downloads[[21]]
$downloads[[21]]$day
[1] "2025-03-24"
$downloads[[21]]$downloads
[1] 66001
$downloads[[22]]
$downloads[[22]]$day
[1] "2025-03-25"
$downloads[[22]]$downloads
[1] 71551
$downloads[[23]]
$downloads[[23]]$day
[1] "2025-03-26"
$downloads[[23]]$downloads
[1] 71879
$downloads[[24]]
$downloads[[24]]$day
[1] "2025-03-27"
$downloads[[24]]$downloads
[1] 69295
$downloads[[25]]
$downloads[[25]]$day
[1] "2025-03-28"
$downloads[[25]]$downloads
[1] 57190
$downloads[[26]]
$downloads[[26]]$day
[1] "2025-03-29"
$downloads[[26]]$downloads
[1] 36177
$downloads[[27]]
$downloads[[27]]$day
[1] "2025-03-30"
$downloads[[27]]$downloads
[1] 34255
$downloads[[28]]
$downloads[[28]]$day
[1] "2025-03-31"
$downloads[[28]]$downloads
[1] 58583
$downloads[[29]]
$downloads[[29]]$day
[1] "2025-04-01"
$downloads[[29]]$downloads
[1] 68779
$downloads[[30]]
$downloads[[30]]$day
[1] "2025-04-02"
$downloads[[30]]$downloads
[1] 97172
$start
[1] "2025-03-04"
$end
[1] "2025-04-02"
$package
[1] "ggplot2"
pluck("downloads") |>
map(as_tibble) |>
list_rbind()
# A tibble: 1 × 1
value
<chr>
1 downloads
‘Functionize’ your daily downloads query as a function which takes an arbitrary package name and gets its daily downloads.
Hint: Use a
mutate
command to add the package name as a new column to the resulting data frame and to convert theday
column to aDate
object (day=as.Date(day)
).
library(dplyr)
<- function(pkg){
get_downloads request("https://cranlogs.r-pkg.org") |>
req_url_path_append("downloads") |>
req_url_path_append("daily") |>
req_url_path_append("last-month") |>
req_url_path_append(pkg) |>
req_perform() |>
resp_body_json() |>
pluck(1) |>
pluck("downloads") |>
map(as_tibble) |>
list_rbind() |>
mutate(package=pkg,
day=as.Date(day))
}
<- get_downloads("ggplot2")
gg_downloads print(gg_downloads)
# A tibble: 30 × 3
day downloads package
<date> <int> <chr>
1 2025-03-04 68641 ggplot2
2 2025-03-05 69055 ggplot2
3 2025-03-06 71064 ggplot2
4 2025-03-07 59153 ggplot2
5 2025-03-08 37164 ggplot2
6 2025-03-09 37480 ggplot2
7 2025-03-10 68376 ggplot2
8 2025-03-11 73772 ggplot2
9 2025-03-12 72439 ggplot2
10 2025-03-13 70576 ggplot2
# ℹ 20 more rows
- Use your function to get daily downloads for the following
R
packages:ggplot2
dplyr
httr2
tibble
purrr
tidyr
and combine your results into a single data frame. Then plot the download trends for each package usingggplot2
.
map() |> list_rbind()
idiom here as well.
library(ggplot2)
<- c("ggplot2", "dplyr", "httr2", "tibble", "purrr", "tidyr")
PACKAGES
map(PACKAGES, get_downloads) |>
list_rbind() |>
ggplot(aes(x=day,
y=downloads,
color=package,
group=package)) +
geom_point() +
geom_line() +
scale_x_date() +
xlab("Date of Download") +
ylab("Number of Package Downloads") +
theme_bw() +
labs(caption="Data from cranlogs.r-pkg.org") +
theme(legend.position="bottom") +
scale_color_brewer(type="qual",
palette=2,
name="Package Name")
Footnotes
In this case, there is a
cranlogs
package for interacting with this API. This type of package is commonly called a “wrapper” because it shields the user from the details of the API and exposes a more idiomatic (and more useful) interface. In general, when you can find anR
package that wraps an API, it is a good idea to use it. For certain very complex APIs, e.g., the API that powers Bloomberg Financial Information Services, use of the associatedR
package is almost mandatory because the underlying API is so complicated. For this in-class exercise, we will use the “raw” API as practice since you can’t always assume a niceR
package will exist.↩︎We will only use the basic
req_perform
for now, buthttr2
provides options for parallel execution, delayed execution, etc..↩︎