Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 10 – Tuesday 2026-04-21
Last Updated: 2026-01-14

STA 9750 Week 10

Today: Lecture #08: Advanced ggplot2 – maps, interactivity, animation

  • Proposals ✅
  • Mid-Semester Check-In ⬅️
  • Final Presentation
  • Final Group Report
  • Final Individual Report

Today

Today

  • Course Administration
  • Mid-Semester Check-In Presentations
  • Optional Material: Functional Programming with purrr
  • Wrap-Up
    • Life Tip of the Day

Administrative Business

STA 9750 Mini-Projects

  • Mini-Project #01 ✅ (2026-03-13 at 11:59pm ET)
  • Mini-Project #02 🔄 (2026-04-03 at 11:59pm ET)
    • Submission ✅
    • Peer Feedback ⬅️
  • Mini-Project #03 (2026-04-24 at 11:59pm ET)
    • Submission ⬅️
    • Peer Feedback
  • Mini-Project #04 (2026-05-15 at 11:59pm ET)

Mini-Project #03

MP#03 - TBD

Due 2026-04-24 at 11:59pm ET

Topics covered:

  • Data Import
    • One static file
    • One API call
  • Spatial Data
    • Very basic spatial joins
    • Spatial visualizations (maps!)

Future Mini-Projects

MP#04 - TBD

Due 2026-05-15 at 11:59pm ET

TBD

Topics covered:

  • Data Import
    • HTTP Request Construction
    • HTML Scraping (Tabular)
  • \(t\)-tests
  • Putting Everything Together

Grading in Progress

We will owe you:

  • Mid-Term Check In Grades

Re-Grade Requests

Per course policy:

  • If you feel peer grades are wrong, please request an instructor re-grade
  • New Brightspace ‘quiz’ to request
    • Not an actual quiz…

If you want to discuss your MPs or project in more detail, come to office hours!

Course Support

  • Synchronous
    • MW Office Hours 2x / week: Tuesdays + Thursdays 5-5:45pm
      • Rest of Semester except Thanksgiving (Nov 27th)
  • Asynchronous: Piazza (\(<20\) minute average response time)

Course Project

Course Project should be your main focus for rest of course

  • But you still need to do mini-projects and pre-assignments(!)

Course Project

Final submissions:

Nothing new per se - just more details about grading (4 elements => 10)

Additional notes added on integration of SQs and strategies for estimating causal effects. Ask questions! I’m happy to fill these out further.

Check-In Presentations

Check-In Presentations

Today, I’m looking for:

  • Finalized overarching question
  • Locking in on specific questions
  • Evaluation of relevant data sources
    • Anticipated data or statistical challenges
  • Engagement with existing literature
  • 6 minutes

Mainly, I want to see that you will be able to succeed

After Proposals

Optional discussion of functional programming in R

  • List manipulation (useful for JSON handling)
  • Simple iteration over large data sets
  • Error handling and robustness
  • Parallelization

On to the Show!

Presentation Order

TBD

Wrap-Up

Orientation

  • Communicating Results (quarto) ✅
  • R Basics ✅
  • Data Manipulation in R
  • Data Visualization in R
  • Getting Data into R ⬅️
    • Files and APIs ✅
    • Web Scraping
    • Cleaning and Processing Text
  • Statistical Modeling in R

Upcoming Work

Upcoming work from course calendar

Musical Treat


Optional Material: Functional Programming with purrr

Functional Programming

Functional programming - purity

  • Minimizing book-keeping and side-effects

Can go deep into FP world - we’re just dipping a toe in

  • Iteration: map and friends, reduce, list_*
  • Adverbs: safely, partial, insistently, in_parallel
  • List access: pluck

Functional Programming

Functional Programming

Compare:

for(i in seq_along(letters)){
    lower_letter <- letters[i]
    upper_letter <- LETTERS[i]
    
    cat(upper_letter, " is the upper case of ", lower_letter, "\n")
}

and

walk2(LETTERS, letters, ~ cat(.x, " is the upper case of ", .y, "\n"))

No indexing ([]) or explicit loop management

FP in R

In R, FP is principally associated with lists

Recall: a list is a generic container in R (can hold anything, even other lists)

  • Natural tool for parsing JSON (see last week)
  • Many things in R are lists under the hood (including data.frames)

Iteration - map

Often, we want to do the same thing to several different items:

  • E.g., on MP#04, download BLS revisions for each year

map and friends let us avoid loops

  • Handles book-keeping for us
  • List in and list out (by default)

Iteration - map

map(INPUT, FUNCTION)

applies FUNCTION to each element of INPUT and collects the output in a new list

[[1]]
[1] "JANUARY"

[[2]]
[1] "FEBRUARY"

[[3]]
[1] "MARCH"

[[4]]
[1] "APRIL"

[[5]]
[1] "MAY"

[[6]]
[1] "JUNE"

[[7]]
[1] "JULY"

[[8]]
[1] "AUGUST"

[[9]]
[1] "SEPTEMBER"

[[10]]
[1] "OCTOBER"

[[11]]
[1] "NOVEMBER"

[[12]]
[1] "DECEMBER"

Typed Iteration

Sometimes, we know the type of values to be returned:

map_* lets us put those into a vector:

 [1] "JANUARY"   "FEBRUARY"  "MARCH"     "APRIL"     "MAY"       "JUNE"     
 [7] "JULY"      "AUGUST"    "SEPTEMBER" "OCTOBER"   "NOVEMBER"  "DECEMBER" 
 [1] 7 8 5 5 3 4 4 6 9 7 8 8

Anonymous Functions

Functions in R can be defined as:

my_function <- function(arg1, arg2="default"){
    function_code(goes * here)
}

But this is clunky

Anonymous functions (“lambdas”) let us define ‘little functions’ more complactly

function (x) 
x + 1

Supposedly \( looks like the Greek \(\lambda\)

Anonymous Functions

Anonymous functions play well with map:

 [1] 3 3 1 1 1 2 1 2 3 2 3 3

map returning Data Frames

A common idiom is to return a data frame inside map

[[1]]
    month   upper n_vowels
1 January JANUARY        3

[[2]]
     month    upper n_vowels
1 February FEBRUARY        3

[[3]]
  month upper n_vowels
1 March MARCH        1

[[4]]
  month upper n_vowels
1 April APRIL        1

[[5]]
  month upper n_vowels
1   May   MAY        1

[[6]]
  month upper n_vowels
1  June  JUNE        2

[[7]]
  month upper n_vowels
1  July  JULY        1

[[8]]
   month  upper n_vowels
1 August AUGUST        2

[[9]]
      month     upper n_vowels
1 September SEPTEMBER        3

[[10]]
    month   upper n_vowels
1 October OCTOBER        2

[[11]]
     month    upper n_vowels
1 November NOVEMBER        3

[[12]]
     month    upper n_vowels
1 December DECEMBER        3

map returning Data Frames

Combine this list of little DFs rowwise with list_rbind()

       month     upper n_vowels
1    January   JANUARY        3
2   February  FEBRUARY        3
3      March     MARCH        1
4      April     APRIL        1
5        May       MAY        1
6       June      JUNE        2
7       July      JULY        1
8     August    AUGUST        2
9  September SEPTEMBER        3
10   October   OCTOBER        2
11  November  NOVEMBER        3
12  December  DECEMBER        3

Mapping Together

Often, we will want to map multiple things together:

  • month.name
  • month.abb
 [1] "Jan is short for January"   "Feb is short for February" 
 [3] "Mar is short for March"     "Apr is short for April"    
 [5] "May is short for May"       "Jun is short for June"     
 [7] "Jul is short for July"      "Aug is short for August"   
 [9] "Sep is short for September" "Oct is short for October"  
[11] "Nov is short for November"  "Dec is short for December" 

Use pmap to go to three or more

Mapping and Counting

Use imap to get the index of the element as well:

 [1] "January is month number 1"   "February is month number 2" 
 [3] "March is month number 3"     "April is month number 4"    
 [5] "May is month number 5"       "June is month number 6"     
 [7] "July is month number 7"      "August is month number 8"   
 [9] "September is month number 9" "October is month number 10" 
[11] "November is month number 11" "December is month number 12"

map vs Vectorization

map is most useful when the underlying function can’t be vectorized: e.g., file processing or downloading

# A tibble: 5 × 4
  type    setup                                        punchline              id
  <chr>   <chr>                                        <chr>               <int>
1 general What did the fish say when it hit the wall?  Dam.                    1
2 general How do you make a tissue dance?              You put a little b…     2
3 general What's Forrest Gump's password?              1Forrest1               3
4 general What do you call a belt made out of watches? A waist of time.        4
5 general Why can't bicycles stand on their own?       They are two tired      5

map pipelines

Often we will want to map several times as we perform steps of an analysis.

Cleaner than one big function:

# A tibble: 5 × 4
  type    setup                                        punchline              id
  <chr>   <chr>                                        <chr>               <int>
1 general What did the fish say when it hit the wall?  Dam.                    1
2 general How do you make a tissue dance?              You put a little b…     2
3 general What's Forrest Gump's password?              1Forrest1               3
4 general What do you call a belt made out of watches? A waist of time.        4
5 general Why can't bicycles stand on their own?       They are two tired      5

Accessing nested elements

Sometimes, when we have a complex list, we want to pull out certain elements:

  • Pass an index or name to map:
[[1]]
[1] "ggplot2"

[[2]]
[1] "lubridate"

[[3]]
[1] "stringr"

[[4]]
[1] "dplyr"

[[5]]
[1] "readr"

[[6]]
[1] "magrittr"

[[7]]
[1] "tidyr"

[[8]]
[1] "nycflights13"

[[9]]
[1] "rvest"

[[10]]
[1] "purrr"

[[11]]
[1] "haven"

[[12]]
[1] "readxl"

[[13]]
[1] "reprex"

[[14]]
[1] "tibble"

[[15]]
[1] "multidplyr"

[[16]]
[1] "dtplyr"

[[17]]
[1] "hms"

[[18]]
[1] "modelr"

[[19]]
[1] "forcats"

[[20]]
[1] "tidyverse"

[[21]]
[1] "tidytemplate"

[[22]]
[1] "blob"

[[23]]
[1] "ggplot2-docs"

[[24]]
[1] "glue"

[[25]]
[1] "style"

[[26]]
[1] "dbplyr"

[[27]]
[1] "googledrive"

[[28]]
[1] "googlesheets4"

[[29]]
[1] "tidyverse.org"

[[30]]
[1] "datascience-box"

Accessing List Elements

Use pluck to access elements of a list in the same way:

From last week,

request("https://cranlogs.r-pkg.org") |>
    req_url_path_append("top") |>
    req_url_path_append("last-day") |>
    req_url_path_append(100) |>
    req_perform() |>
    resp_body_json() |>
    pluck("downloads") |>
    map(as_tibble) |>
    list_rbind()

map for Complex Analysis

Anything can be a column of a data.frame, even another data.frame

# A tibble: 3 × 2
# Groups:   species [3]
  species   data              
  <fct>     <list>            
1 Adelie    <tibble [152 × 7]>
2 Gentoo    <tibble [124 × 7]>
3 Chinstrap <tibble [68 × 7]> 

data is a set of 3 different data frames (one per species)

map for Complex Analysis

Use map to fit the same model to each data separately:

# A tibble: 3 × 3
# Groups:   species [3]
  species   data               model 
  <fct>     <list>             <list>
1 Adelie    <tibble [152 × 7]> <lm>  
2 Gentoo    <tibble [124 × 7]> <lm>  
3 Chinstrap <tibble [68 × 7]>  <lm>  

map for Complex Analyses

Continue using map to analyze each species-model separately

# A tibble: 3 × 4
# Groups:   species [3]
  species   coefficients slope  r_sq
  <fct>     <list>       <dbl> <dbl>
1 Adelie    <dbl [2]>     32.8 0.219
2 Gentoo    <dbl [2]>     54.6 0.494
3 Chinstrap <dbl [2]>     34.6 0.412

So flipper_len explains the most body_mass variation in Gentoo penguins.

Modifying Map Behavior

When passing functions to map, we might want to handle errors

  • Use an adverb to modify a function

Adverbs: safely

If you have a function that sometimes throws errors, wrap it in safely

 [1] 7 8 5 5 3 4 4 6 9 7 8 8

Adverbs: safely

 [1]  7  8  5  5  3  4 NA  6  9 NA NA  8

Adverb: possibly

The safely |> map("result") combo is common, so helper possibly:

 [1]  7  8  5  5 NA  4  4  6  9  7  8  8

Adverbs: insistently

For functions that fail sporadically (e.g., web access), try insistently:

read_html_insist <- insistently(read_html)

read_html_insist("https://a.sketchy.site")

Will try 3 times by default

(cf, sites that don’t work reliably like in MP#02)

Adverbs: slowly

Some websites will get mad if you query too often: slowly will make sure it isn’t called too often

read_html_slow <- slowly(read_html)

read_html_slow("https://a.rate-limited.site")

Default is once per second.

Adverbs: in_parallel

For parallel processing, use the in_parallel adverb:

[1] 2 3 4 5
[1] 2 3 4 5

Argument to in_parallel needs to be an anonymous function

Adverbs: in_parallel

Compare:

   user  system elapsed 
  0.001   0.000   4.022 
   user  system elapsed 
  0.001   0.001   1.006 

Adverbs: in_parallel

Parallelization is not magic:

  • Most useful for IO bound tasks (reading files, downloads)
  • Too much parallelization slows things down (“thrashing”)
  • Need to be careful about error handling
    • Use safely so if one step errors, you don’t loose everything

List Access with pluck

Given a list, the pluck function will pull out elements:

  • list_obj |> pluck(n) will pull out the \(n^{\text{th}}\) element
  • list_obj |> pluck("name") will pull out the element named "name"
  • list_obj |> pluck(func) will apply the “accessor” func

List Access with pluck

lm: Linear regression (and ANOVA)

my_regression <- lm(body_mass ~ flipper_len, data=penguins)

summary(my_regression)

Call:
lm(formula = body_mass ~ flipper_len, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1058.80  -259.27   -26.88   247.33  1288.69 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5780.831    305.815  -18.90   <2e-16 ***
flipper_len    49.686      1.518   32.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 394.3 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.759, Adjusted R-squared:  0.7583 
F-statistic:  1071 on 1 and 340 DF,  p-value: < 2.2e-16

List Access with pluck

Can use pluck + accessors to get the coefficients

my_regression |> pluck(1)
my_regression |> pluck("coefficients")
my_regression |> pluck(coef)
(Intercept) flipper_len 
-5780.83136    49.68557 

Final form is most robust

List Access with pluck

pluck has some nice useability features:

  • Can supply “compound” selections: pluck(1) |> pluck("a") is the same as pluck(1, "a")
  • Can change default value from NULL: pluck("a", .default=NA)
    • Use chuck if you want to error instead of default

Combining Results

Given a list, we can ‘combine’ elements with the reduce function:

map(month.name, 
    nchar) |>
    reduce(`+`) # reduce(`+`) => sum
[1] 74
map(month.name, 
    nchar) |>
    reduce(max)
[1] 9

Useful for combining many data sets in a ‘mega-join’

Use accumulate to keep intermediate results (a la cumsum)

Functional Tools

Not everything fits within purrr tooling

But a lot does!

Use it when helpful:

  • Applying the same analysis many times map
  • Recursively combining (\(n\)-way inner join)
  • Error handling
  • List-structured data (HTML -> more next week!)