Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 10 – Tuesday 2026-04-21
Last Updated: 2026-01-14

STA 9750 Week 10

Today: Lecture #08: Advanced ggplot2 – maps, interactivity, animation

Proposals ✅
Mid-Semester Check-In ⬅️
Final Presentation
Final Group Report
Final Individual Report

Today

Course Administration
Mid-Semester Check-In Presentations
Optional Material: Functional Programming with purrr
Wrap-Up
- Life Tip of the Day

Administrative Business

STA 9750 Mini-Projects

Mini-Project #01 ✅ (2026-03-13 at 11:59pm ET)
Mini-Project #02 🔄 (2026-04-03 at 11:59pm ET)
- Submission ✅
- Peer Feedback ⬅️
Mini-Project #03 (2026-04-24 at 11:59pm ET)
- Submission ⬅️
- Peer Feedback
Mini-Project #04 (2026-05-15 at 11:59pm ET)

Mini-Project #03

MP#03 - TBD

Due 2026-04-24 at 11:59pm ET

Topics covered:

Data Import
- One static file
- One API call
Spatial Data
- Very basic spatial joins
- Spatial visualizations (maps!)

Future Mini-Projects

MP#04 - TBD

Due 2026-05-15 at 11:59pm ET

TBD

Topics covered:

Data Import
- HTTP Request Construction
- HTML Scraping (Tabular)
\(t\)-tests
Putting Everything Together

Grading in Progress

We will owe you:

Mid-Term Check In Grades

Re-Grade Requests

Per course policy:

If you feel peer grades are wrong, please request an instructor re-grade
New Brightspace ‘quiz’ to request
- Not an actual quiz…

If you want to discuss your MPs or project in more detail, come to office hours!

Course Support

Synchronous
- MW Office Hours 2x / week: Tuesdays + Thursdays 5-5:45pm
  - Rest of Semester except Thanksgiving (Nov 27th)
Asynchronous: Piazza (\(<20\) minute average response time)

Course Project

Course Project should be your main focus for rest of course

But you still need to do mini-projects and pre-assignments(!)

Course Project

Final submissions:

Detailed rubrics for group report + individual report
- Final Group Report
- Final Individual Report

Nothing new per se - just more details about grading (4 elements => 10)

Additional notes added on integration of SQs and strategies for estimating causal effects. Ask questions! I’m happy to fill these out further.

Check-In Presentations

Today, I’m looking for:

Finalized overarching question
Locking in on specific questions
Evaluation of relevant data sources
- Anticipated data or statistical challenges
Engagement with existing literature
6 minutes

Mainly, I want to see that you will be able to succeed

After Proposals

Optional discussion of functional programming in R

List manipulation (useful for JSON handling)
Simple iteration over large data sets
Error handling and robustness
Parallelization

On to the Show!

Presentation Order

TBD

Wrap-Up

Orientation

Communicating Results (quarto) ✅
R Basics ✅
Data Manipulation in R ✅
Data Visualization in R ✅
Getting Data into R ⬅️
- Files and APIs ✅
- Web Scraping
- Cleaning and Processing Text
Statistical Modeling in R

Upcoming Work

Upcoming work from course calendar

Mini-Project #02 Peer Feedback due on ~~2026-04-12~~
Pre-Assignment #011 due 2026-04-21 at 6:00pm ET
Mini-Project #03 due on 2026-04-24 at 11:59pm ET

Musical Treat

Optional Material: Functional Programming with `purrr`

Functional Programming

Functional programming - purity

Minimizing book-keeping and side-effects

Can go deep into FP world - we’re just dipping a toe in

Iteration: map and friends, reduce, list_*
Adverbs: safely, partial, insistently, in_parallel
List access: pluck

Functional Programming

Compare:

for(i in seq_along(letters)){
    lower_letter <- letters[i]
    upper_letter <- LETTERS[i]
    
    cat(upper_letter, " is the upper case of ", lower_letter, "\n")
}

and

walk2(LETTERS, letters, ~ cat(.x, " is the upper case of ", .y, "\n"))

No indexing ([]) or explicit loop management

FP in R

In R, FP is principally associated with lists

Recall: a list is a generic container in R (can hold anything, even other lists)

Natural tool for parsing JSON (see last week)
Many things in R are lists under the hood (including data.frames)

Iteration - map

Often, we want to do the same thing to several different items:

E.g., on MP#04, download BLS revisions for each year

map and friends let us avoid loops

Handles book-keeping for us
List in and list out (by default)

Iteration - map

map(INPUT, FUNCTION)

applies FUNCTION to each element of INPUT and collects the output in a new list

[[1]]
[1] "JANUARY"

[[2]]
[1] "FEBRUARY"

[[3]]
[1] "MARCH"

[[4]]
[1] "APRIL"

[[5]]
[1] "MAY"

[[6]]
[1] "JUNE"

[[7]]
[1] "JULY"

[[8]]
[1] "AUGUST"

[[9]]
[1] "SEPTEMBER"

[[10]]
[1] "OCTOBER"

[[11]]
[1] "NOVEMBER"

[[12]]
[1] "DECEMBER"

Typed Iteration

Sometimes, we know the type of values to be returned:

map_* lets us put those into a vector:

 [1] "JANUARY"   "FEBRUARY"  "MARCH"     "APRIL"     "MAY"       "JUNE"     
 [7] "JULY"      "AUGUST"    "SEPTEMBER" "OCTOBER"   "NOVEMBER"  "DECEMBER"

 [1] 7 8 5 5 3 4 4 6 9 7 8 8

Anonymous Functions

Functions in R can be defined as:

my_function <- function(arg1, arg2="default"){
    function_code(goes * here)
}

But this is clunky

Anonymous functions (“lambdas”) let us define ‘little functions’ more complactly

function (x) 
x + 1

Supposedly \( looks like the Greek \(\lambda\)

Anonymous Functions

Anonymous functions play well with map:

 [1] 3 3 1 1 1 2 1 2 3 2 3 3

map returning Data Frames

A common idiom is to return a data frame inside map

[[1]]
    month   upper n_vowels
1 January JANUARY        3

[[2]]
     month    upper n_vowels
1 February FEBRUARY        3

[[3]]
  month upper n_vowels
1 March MARCH        1

[[4]]
  month upper n_vowels
1 April APRIL        1

[[5]]
  month upper n_vowels
1   May   MAY        1

[[6]]
  month upper n_vowels
1  June  JUNE        2

[[7]]
  month upper n_vowels
1  July  JULY        1

[[8]]
   month  upper n_vowels
1 August AUGUST        2

[[9]]
      month     upper n_vowels
1 September SEPTEMBER        3

[[10]]
    month   upper n_vowels
1 October OCTOBER        2

[[11]]
     month    upper n_vowels
1 November NOVEMBER        3

[[12]]
     month    upper n_vowels
1 December DECEMBER        3

map returning Data Frames

Combine this list of little DFs rowwise with list_rbind()

       month     upper n_vowels
1    January   JANUARY        3
2   February  FEBRUARY        3
3      March     MARCH        1
4      April     APRIL        1
5        May       MAY        1
6       June      JUNE        2
7       July      JULY        1
8     August    AUGUST        2
9  September SEPTEMBER        3
10   October   OCTOBER        2
11  November  NOVEMBER        3
12  December  DECEMBER        3

Mapping Together

Often, we will want to map multiple things together:

month.name
month.abb

 [1] "Jan is short for January"   "Feb is short for February" 
 [3] "Mar is short for March"     "Apr is short for April"    
 [5] "May is short for May"       "Jun is short for June"     
 [7] "Jul is short for July"      "Aug is short for August"   
 [9] "Sep is short for September" "Oct is short for October"  
[11] "Nov is short for November"  "Dec is short for December"

Use pmap to go to three or more

Mapping and Counting

Use imap to get the index of the element as well:

 [1] "January is month number 1"   "February is month number 2" 
 [3] "March is month number 3"     "April is month number 4"    
 [5] "May is month number 5"       "June is month number 6"     
 [7] "July is month number 7"      "August is month number 8"   
 [9] "September is month number 9" "October is month number 10" 
[11] "November is month number 11" "December is month number 12"

map vs Vectorization

map is most useful when the underlying function can’t be vectorized: e.g., file processing or downloading

# A tibble: 5 × 4
  type    setup                                        punchline              id
  <chr>   <chr>                                        <chr>               <int>
1 general What did the fish say when it hit the wall?  Dam.                    1
2 general How do you make a tissue dance?              You put a little b…     2
3 general What's Forrest Gump's password?              1Forrest1               3
4 general What do you call a belt made out of watches? A waist of time.        4
5 general Why can't bicycles stand on their own?       They are two tired      5

map pipelines

Often we will want to map several times as we perform steps of an analysis.

Cleaner than one big function:

# A tibble: 5 × 4
  type    setup                                        punchline              id
  <chr>   <chr>                                        <chr>               <int>
1 general What did the fish say when it hit the wall?  Dam.                    1
2 general How do you make a tissue dance?              You put a little b…     2
3 general What's Forrest Gump's password?              1Forrest1               3
4 general What do you call a belt made out of watches? A waist of time.        4
5 general Why can't bicycles stand on their own?       They are two tired      5

Accessing nested elements

Sometimes, when we have a complex list, we want to pull out certain elements:

Pass an index or name to map:

[[1]]
[1] "ggplot2"

[[2]]
[1] "lubridate"

[[3]]
[1] "stringr"

[[4]]
[1] "dplyr"

[[5]]
[1] "readr"

[[6]]
[1] "magrittr"

[[7]]
[1] "tidyr"

[[8]]
[1] "nycflights13"

[[9]]
[1] "rvest"

[[10]]
[1] "purrr"

[[11]]
[1] "haven"

[[12]]
[1] "readxl"

[[13]]
[1] "reprex"

[[14]]
[1] "tibble"

[[15]]
[1] "multidplyr"

[[16]]
[1] "dtplyr"

[[17]]
[1] "hms"

[[18]]
[1] "modelr"

[[19]]
[1] "forcats"

[[20]]
[1] "tidyverse"

[[21]]
[1] "tidytemplate"

[[22]]
[1] "blob"

[[23]]
[1] "ggplot2-docs"

[[24]]
[1] "glue"

[[25]]
[1] "style"

[[26]]
[1] "dbplyr"

[[27]]
[1] "googledrive"

[[28]]
[1] "googlesheets4"

[[29]]
[1] "tidyverse.org"

[[30]]
[1] "datascience-box"

Accessing List Elements

Use pluck to access elements of a list in the same way:

From last week,

request("https://cranlogs.r-pkg.org") |>
    req_url_path_append("top") |>
    req_url_path_append("last-day") |>
    req_url_path_append(100) |>
    req_perform() |>
    resp_body_json() |>
    pluck("downloads") |>
    map(as_tibble) |>
    list_rbind()

map for Complex Analysis

Anything can be a column of a data.frame, even another data.frame

# A tibble: 3 × 2
# Groups:   species [3]
  species   data              
  <fct>     <list>            
1 Adelie    <tibble [152 × 7]>
2 Gentoo    <tibble [124 × 7]>
3 Chinstrap <tibble [68 × 7]>

data is a set of 3 different data frames (one per species)

map for Complex Analysis

Use map to fit the same model to each data separately:

# A tibble: 3 × 3
# Groups:   species [3]
  species   data               model 
  <fct>     <list>             <list>
1 Adelie    <tibble [152 × 7]> <lm>  
2 Gentoo    <tibble [124 × 7]> <lm>  
3 Chinstrap <tibble [68 × 7]>  <lm>

map for Complex Analyses

Continue using map to analyze each species-model separately

# A tibble: 3 × 4
# Groups:   species [3]
  species   coefficients slope  r_sq
  <fct>     <list>       <dbl> <dbl>
1 Adelie    <dbl [2]>     32.8 0.219
2 Gentoo    <dbl [2]>     54.6 0.494
3 Chinstrap <dbl [2]>     34.6 0.412

So flipper_len explains the most body_mass variation in Gentoo penguins.

Modifying Map Behavior

When passing functions to map, we might want to handle errors

Use an adverb to modify a function

Adverbs: safely

If you have a function that sometimes throws errors, wrap it in safely

 [1] 7 8 5 5 3 4 4 6 9 7 8 8

Adverbs: safely

 [1]  7  8  5  5  3  4 NA  6  9 NA NA  8

Adverb: `possibly`

The safely |> map("result") combo is common, so helper possibly:

 [1]  7  8  5  5 NA  4  4  6  9  7  8  8

Adverbs: insistently

For functions that fail sporadically (e.g., web access), try insistently:

read_html_insist <- insistently(read_html)

read_html_insist("https://a.sketchy.site")

Will try 3 times by default

(cf, sites that don’t work reliably like in MP#02)

Adverbs: slowly

Some websites will get mad if you query too often: slowly will make sure it isn’t called too often

read_html_slow <- slowly(read_html)

read_html_slow("https://a.rate-limited.site")

Default is once per second.

Adverbs: in_parallel

For parallel processing, use the in_parallel adverb:

[1] 2 3 4 5

[1] 2 3 4 5

Argument to in_parallel needs to be an anonymous function

Adverbs: in_parallel

Compare:

   user  system elapsed 
  0.001   0.000   4.022

   user  system elapsed 
  0.001   0.001   1.006

Adverbs: in_parallel

Parallelization is not magic:

Most useful for IO bound tasks (reading files, downloads)
Too much parallelization slows things down (“thrashing”)
Need to be careful about error handling
- Use safely so if one step errors, you don’t loose everything

List Access with pluck

Given a list, the pluck function will pull out elements:

list_obj |> pluck(n) will pull out the \(n^{\text{th}}\) element
list_obj |> pluck("name") will pull out the element named "name"
list_obj |> pluck(func) will apply the “accessor” func

List Access with pluck

lm: Linear regression (and ANOVA)

my_regression <- lm(body_mass ~ flipper_len, data=penguins)

summary(my_regression)


Call:
lm(formula = body_mass ~ flipper_len, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1058.80  -259.27   -26.88   247.33  1288.69 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5780.831    305.815  -18.90   <2e-16 ***
flipper_len    49.686      1.518   32.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 394.3 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.759, Adjusted R-squared:  0.7583 
F-statistic:  1071 on 1 and 340 DF,  p-value: < 2.2e-16

List Access with pluck

Can use pluck + accessors to get the coefficients

my_regression |> pluck(1)
my_regression |> pluck("coefficients")
my_regression |> pluck(coef)

(Intercept) flipper_len 
-5780.83136    49.68557

Final form is most robust

List Access with pluck

pluck has some nice useability features:

Can supply “compound” selections: pluck(1) |> pluck("a") is the same as pluck(1, "a")
Can change default value from NULL: pluck("a", .default=NA)
- Use chuck if you want to error instead of default

Combining Results

Given a list, we can ‘combine’ elements with the reduce function:

map(month.name, 
    nchar) |>
    reduce(`+`) # reduce(`+`) => sum

[1] 74

map(month.name, 
    nchar) |>
    reduce(max)

[1] 9

Useful for combining many data sets in a ‘mega-join’

Use accumulate to keep intermediate results (a la cumsum)

Functional Tools

Not everything fits within purrr tooling

But a lot does!

Use it when helpful:

Applying the same analysis many times map
Recursively combining (\(n\)-way inner join)
Error handling
List-structured data (HTML -> more next week!)

Software Tools for Data AnalysisSTA 9750Michael WeylandtWeek 10 – Tuesday 2026-04-21Last Updated: 2026-01-14

STA 9750 Week 10

Today

Today

Administrative Business

STA 9750 Mini-Projects

Mini-Project #03

Future Mini-Projects

Grading in Progress

Re-Grade Requests

Course Support

Course Project

Course Project

Check-In Presentations

Check-In Presentations

After Proposals

On to the Show!

Presentation Order

Wrap-Up

Orientation

Upcoming Work

Musical Treat

Optional Material: Functional Programming with purrr

Functional Programming

Functional Programming

Functional Programming

FP in R

Iteration - map

Iteration - map

Typed Iteration

Anonymous Functions

Anonymous Functions

map returning Data Frames

map returning Data Frames

Mapping Together

Mapping and Counting

map vs Vectorization

map pipelines

Accessing nested elements

Accessing List Elements

map for Complex Analysis

map for Complex Analysis

map for Complex Analyses

Modifying Map Behavior

Adverbs: safely

Adverbs: safely

Adverb: possibly

Adverbs: insistently

Adverbs: slowly

Adverbs: in_parallel

Adverbs: in_parallel

Adverbs: in_parallel

List Access with pluck

List Access with pluck

List Access with pluck

List Access with pluck

Combining Results

Functional Tools

Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 10 – Tuesday 2026-04-21
Last Updated: 2026-01-14

Optional Material: Functional Programming with `purrr`

Adverb: `possibly`