Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 13 – Thursday 2026-05-07
Last Updated: 2026-05-08

STA 9750 Week 13

Today: Final Project Presentation + Enrichment: Functional Tools in R

These slides can be found online at:

https://michael-weylandt.com/STA9750/slides/slides13.html

Upcoming TODO

Upcoming student responsibilities:

Date Time Details
2026-05-14 6:00pm ET Pre-Assignment #14 Due
2026-05-15 11:59pm ET Mini-Project #04 Due
2026-05-21 11:59pm ET Final Project Summary Report Due [Tentative]
2026-05-21 11:59pm ET Final Project Individual Report Due [Tentative]
2026-05-21 11:59pm ET Final Project Teammate Peer Evaluations Due [Tentative]
2026-05-24 11:59pm ET Mini-Project Peer Feedback #04 Due

STA 9750 Week 13

Today: Final Project Presentation + Enrichment: Functional Tools in R

  • Team Formation ✅
  • Project Proposals ✅
  • Check-In Presentations ✅
  • Final Presentations ⬅️
  • Final Reports
    • Individual Technical Report
    • Group Summary Report
    • Final Peer Evaluations

Today

  • Administrative Business
  • Course Project Presentations
  • Optional Enrichment: Functional Programming in R

Course Administration

Mini-Project #04

MP#04 - Going for the Gold

Due 2026-05-15 at 11:59pm ET

Topics Covered:

  • Web scraping and text parsing
  • Statistical testing

Peer feedback due 2026-05-24

Course Support

  • Synchronous - MW Office Hours 2x / week:
    • Wednesdays 5pm: In Person
    • Thursdays 5pm: Zoom
  • Asynchronous: Piazza (\(<45\) minute average response time)

Course Project

End of Semester Course Project:

  • In-Class Final Presentations: Today!
  • Individual Report: 2026-05-21
  • Group Report: 2026-05-21
  • Peer Evaluations: 2026-05-21

See detailed instructions for rubrics, expectations, etc.

Final Report - Group

Non-Technical Summary

Think of yourself as a “consultant” asked by a client to investigate a topic. This is the “Executive Summary” - details in appendix (individual reports)

Should cover roughly the same material as today’s presentation:

  • Can link to individual reports for details
  • Use image links ![]() to reference figures from individual reports
  • “Words and pictures” not “code”

Only one of you needs to submit on Brightspace. Everyone should still open a GH issue

Final Report - Individual

Technical Appendix to Group Report.

Appendix for lower-level (detail-oriented) staff, not leadership

Still requires writing and context, etc. but this is where I’m going to focus on your code and analysis.

Good Thinking is more important than Good Findings

I am your audience, so don’t waste time explaining basic technical details. I’m more interested in you justifying the choices you made along the way

Code does not count towards word counts. (See helper function for word count)

Final Reports

Group and Individual Reports

  • Submitted via GitHub and Brightspace
  • Everyone submit a separate link to group report on GH

Deadline on “final exam” day

No late work accepted

Project Peer Feedback

Same peer feedback mechanism as mid-semester check-ins:

  • More questions, same script
source("https://michael-weylandt.com/STA9750/load_helpers.R")
project_peer_evals(cycle = "final")
  • Due on same day as reports 2026-05-21

If you don’t submit, you won’t receive any points!

Final Project Grading

Rubric is set very high to give me flexibility to reward teams that take on big challenges

Hard rubric => Grades are curved generously

Multiple paths to success

If your project is “easy” on an element (data import in particular), that’s great! Don’t spend the effort over-complicating things. Effort is better spent elsewhere

On to the Show!

Presentation Order

Presentation Order Team
1 Water Benders (JE+JABB+MTP+JA+AS)
2 3-1-Fun! (XC+ML+ER+RJSN)
3 Maniac Braniacs (HHS+KK+FC+DN)
4 Emissions Impossible (LR+MOG+APTL)
5 Inspector Gadget (MUO+KN+CM+ID+KM)


Wrap Up

Time to write:

  • Team Formation ✅
  • Project Proposals ✅
  • Check-In Presentations ✅
  • Final Presentations ✅
  • Final Reports ⬅️
    • Individual Technical Report
    • Group Summary Report
    • Final Peer Evaluations

Next week: statistical modeling - useful for some integration aims

Optional enrichment: Functional Programming in R with purrr

Life Tip of the Week

End of the Semester is Upcoming

  • End of the semester is rough
    • More important than ever to plan ahead
    • Ask for extensions / accomodations early
    • Help us help you
    • Faculty are slammed as well
  • Don’t ‘grade grub’
    • Makes it harder for professors to curve in your favor
    • Seek extra credit from the syllabus
    • Don’t ask for special treatment - ask how to take advantage of existing opportunities
  • Take care of yourselves
    • Seasonal bugs and allergies going around …

Musical Treat


Optional Material: Functional Programming with purrr

Functional Programming

Functional programming - purity

  • Minimizing book-keeping and side-effects

Can go deep into FP world - we’re just dipping a toe in

  • Iteration: map and friends, reduce, list_*
  • Adverbs: safely, partial, insistently, in_parallel
  • List access: pluck

Functional Programming

Functional Programming

Compare:

for(i in seq_along(letters)){
    lower_letter <- letters[i]
    upper_letter <- LETTERS[i]
    
    cat(upper_letter, " is the upper case of ", lower_letter, "\n")
}

and

walk2(LETTERS, letters, ~ cat(.x, " is the upper case of ", .y, "\n"))

No indexing ([]) or explicit loop management

FP in R

In R, FP is principally associated with lists

Recall: a list is a generic container in R (can hold anything, even other lists)

  • Natural tool for parsing JSON (see previous weeks)
  • Many things in R are lists under the hood (including data.frames)

Iteration - map

Often, we want to do the same thing to several different items:

  • E.g., on MP#04, download Olympic results for each year

map and friends let us avoid loops

  • Handles book-keeping for us
  • List in and list out (by default)

Iteration - map

map(INPUT, FUNCTION)

applies FUNCTION to each element of INPUT and collects the output in a new list

map(month.name, str_to_upper)
[[1]]
[1] "JANUARY"

[[2]]
[1] "FEBRUARY"

[[3]]
[1] "MARCH"

[[4]]
[1] "APRIL"

[[5]]
[1] "MAY"

[[6]]
[1] "JUNE"

[[7]]
[1] "JULY"

[[8]]
[1] "AUGUST"

[[9]]
[1] "SEPTEMBER"

[[10]]
[1] "OCTOBER"

[[11]]
[1] "NOVEMBER"

[[12]]
[1] "DECEMBER"

Typed Iteration

Sometimes, we know the type of values to be returned:

map_* lets us put those into a vector:

map_chr(month.name, str_to_upper)
 [1] "JANUARY"   "FEBRUARY"  "MARCH"     "APRIL"     "MAY"       "JUNE"     
 [7] "JULY"      "AUGUST"    "SEPTEMBER" "OCTOBER"   "NOVEMBER"  "DECEMBER" 
map_int(month.name, nchar)
 [1] 7 8 5 5 3 4 4 6 9 7 8 8

Anonymous Functions

Functions in R can be defined as:

my_function <- function(arg1, arg2="default"){
    function_code(goes * here)
}

But this is clunky

Anonymous functions (“lambdas”) let us define ‘little functions’ more complactly

plus_1 <- function(x){
    x + 1
}

\(x) x + 1

Supposedly \( looks like the Greek \(\lambda\)

Anonymous Functions

Anonymous functions play well with map:

# String manipulation (Lecture 10) to count vowels
map_int(month.name, \(x) x |> str_to_lower() |> str_count( "[aeiou]")) 
 [1] 3 3 1 2 1 2 1 3 3 3 3 3

map returning Data Frames

A common idiom is to return a data frame inside map

map(month.name, 
    \(x) data.frame(month = x, 
                    upper = str_to_upper(x), 
                    n_vowels = str_count(str_to_lower(x), "[aeiou]")))
[[1]]
    month   upper n_vowels
1 January JANUARY        3

[[2]]
     month    upper n_vowels
1 February FEBRUARY        3

[[3]]
  month upper n_vowels
1 March MARCH        1

[[4]]
  month upper n_vowels
1 April APRIL        2

[[5]]
  month upper n_vowels
1   May   MAY        1

[[6]]
  month upper n_vowels
1  June  JUNE        2

[[7]]
  month upper n_vowels
1  July  JULY        1

[[8]]
   month  upper n_vowels
1 August AUGUST        3

[[9]]
      month     upper n_vowels
1 September SEPTEMBER        3

[[10]]
    month   upper n_vowels
1 October OCTOBER        3

[[11]]
     month    upper n_vowels
1 November NOVEMBER        3

[[12]]
     month    upper n_vowels
1 December DECEMBER        3

map returning Data Frames

Combine this list of little DFs rowwise with list_rbind()

map(month.name, 
    \(x) data.frame(month = x, 
                    upper = str_to_upper(x), 
                    n_vowels = str_count(str_to_lower(x), "[aeiou]"))) |>
    list_rbind()
       month     upper n_vowels
1    January   JANUARY        3
2   February  FEBRUARY        3
3      March     MARCH        1
4      April     APRIL        2
5        May       MAY        1
6       June      JUNE        2
7       July      JULY        1
8     August    AUGUST        3
9  September SEPTEMBER        3
10   October   OCTOBER        3
11  November  NOVEMBER        3
12  December  DECEMBER        3

Mapping Together

Often, we will want to map multiple things together:

  • month.name
  • month.abb
map2_chr(month.name, month.abb, \(x, y) paste(y, "is short for", x))
 [1] "Jan is short for January"   "Feb is short for February" 
 [3] "Mar is short for March"     "Apr is short for April"    
 [5] "May is short for May"       "Jun is short for June"     
 [7] "Jul is short for July"      "Aug is short for August"   
 [9] "Sep is short for September" "Oct is short for October"  
[11] "Nov is short for November"  "Dec is short for December" 

Use pmap to go to three or more

Mapping and Counting

Use imap to get the index of the element as well:

imap_chr(month.name, \(x, y) paste(x, "is month number", y))
 [1] "January is month number 1"   "February is month number 2" 
 [3] "March is month number 3"     "April is month number 4"    
 [5] "May is month number 5"       "June is month number 6"     
 [7] "July is month number 7"      "August is month number 8"   
 [9] "September is month number 9" "October is month number 10" 
[11] "November is month number 11" "December is month number 12"

map vs Vectorization

map is most useful when the underlying function can’t be vectorized: e.g., file processing or downloading

library(jsonlite); library(glue)
map(1:5, 
    \(n) as_tibble(fromJSON(glue("https://official-joke-api.appspot.com/jokes/{n}")))) |>
    list_rbind()
# A tibble: 5 × 4
  type    setup                                        punchline              id
  <chr>   <chr>                                        <chr>               <int>
1 general What did the fish say when it hit the wall?  Dam.                    1
2 general How do you make a tissue dance?              You put a little b…     2
3 general What's Forrest Gump's password?              1Forrest1               3
4 general What do you call a belt made out of watches? A waist of time.        4
5 general Why can't bicycles stand on their own?       They are two tired      5

map pipelines

Often we will want to map several times as we perform steps of an analysis.

Cleaner than one big function:

1:5 |>
    map(\(n) glue("https://official-joke-api.appspot.com/jokes/{n}")) |>
    map(fromJSON) |>
    map(as_tibble) |>
    list_rbind()
# A tibble: 5 × 4
  type    setup                                        punchline              id
  <chr>   <chr>                                        <chr>               <int>
1 general What did the fish say when it hit the wall?  Dam.                    1
2 general How do you make a tissue dance?              You put a little b…     2
3 general What's Forrest Gump's password?              1Forrest1               3
4 general What do you call a belt made out of watches? A waist of time.        4
5 general Why can't bicycles stand on their own?       They are two tired      5

Accessing nested elements

Sometimes, when we have a complex list, we want to pull out certain elements:

  • Pass an index or name to map:
library(gh)
gh("/orgs/tidyverse/repos") |>
    map("name")
[[1]]
[1] "ggplot2"

[[2]]
[1] "lubridate"

[[3]]
[1] "stringr"

[[4]]
[1] "dplyr"

[[5]]
[1] "readr"

[[6]]
[1] "magrittr"

[[7]]
[1] "tidyr"

[[8]]
[1] "nycflights13"

[[9]]
[1] "rvest"

[[10]]
[1] "purrr"

[[11]]
[1] "haven"

[[12]]
[1] "readxl"

[[13]]
[1] "reprex"

[[14]]
[1] "tibble"

[[15]]
[1] "multidplyr"

[[16]]
[1] "dtplyr"

[[17]]
[1] "hms"

[[18]]
[1] "modelr"

[[19]]
[1] "forcats"

[[20]]
[1] "tidyverse"

[[21]]
[1] "tidytemplate"

[[22]]
[1] "blob"

[[23]]
[1] "ggplot2-docs"

[[24]]
[1] "glue"

[[25]]
[1] "style"

[[26]]
[1] "dbplyr"

[[27]]
[1] "googledrive"

[[28]]
[1] "googlesheets4"

[[29]]
[1] "tidyverse.org"

[[30]]
[1] "datascience-box"

Accessing List Elements

Use pluck to access elements of a list in the same way:

From API Examples:

request("https://cranlogs.r-pkg.org") |>
    req_url_path_append("top") |>
    req_url_path_append("last-day") |>
    req_url_path_append(100) |>
    req_perform() |>
    resp_body_json() |>
    pluck("downloads") |>
    map(as_tibble) |>
    list_rbind()

map for Complex Analysis

Anything can be a column of a data.frame, even another data.frame

penguins |>
    group_by(species) |>
    nest()
# A tibble: 3 × 2
# Groups:   species [3]
  species   data              
  <fct>     <list>            
1 Adelie    <tibble [152 × 7]>
2 Gentoo    <tibble [124 × 7]>
3 Chinstrap <tibble [68 × 7]> 

data is a set of 3 different data frames (one per species)

map for Complex Analysis

Use map to fit the same model to each data separately:

penguins |>
    group_by(species) |>
    nest() |>
    mutate(model = map(data, \(d) lm(body_mass ~ flipper_len, data=d)))
# A tibble: 3 × 3
# Groups:   species [3]
  species   data               model 
  <fct>     <list>             <list>
1 Adelie    <tibble [152 × 7]> <lm>  
2 Gentoo    <tibble [124 × 7]> <lm>  
3 Chinstrap <tibble [68 × 7]>  <lm>  

map for Complex Analyses

Continue using map to analyze each species-model separately

penguins |>
    group_by(species) |>
    nest() |>
    mutate(model = map(data, \(d) lm(body_mass ~ flipper_len, data=d)), 
           mdl_summary = map(model, summary), 
           coefficients = map(model, coef), 
           slope = map_dbl(coefficients, "flipper_len"), 
           r_sq = map_dbl(mdl_summary, "r.squared")) |>
    select(-data, -model, -mdl_summary)
# A tibble: 3 × 4
# Groups:   species [3]
  species   coefficients slope  r_sq
  <fct>     <list>       <dbl> <dbl>
1 Adelie    <dbl [2]>     32.8 0.219
2 Gentoo    <dbl [2]>     54.6 0.494
3 Chinstrap <dbl [2]>     34.6 0.412

So flipper_len explains the most body_mass variation in Gentoo penguins.

Modifying Map Behavior

When passing functions to map, we might want to handle errors

  • Use an adverb to modify a function

Adverbs: safely

If you have a function that sometimes throws errors, wrap it in safely

nchar_bad <- function(x){
    if(runif(1) < 0.2) stop("AN ERROR") else nchar(x)
}

map_dbl(month.name, nchar_bad)
Error in `map_dbl()`:
ℹ In index: 2.
Caused by error in `.f()`:
! AN ERROR

Adverbs: safely

nchar_safe <- safely(nchar_bad, otherwise=NA)

map(month.name, nchar_safe) |> 
    map("result") |> # Result of safely() is list, so
    list_c()         # We must map() our pluck()
 [1]  7  8  5 NA  3  4  4  6  9  7  8  8

Adverb: possibly

The safely |> map("result") combo is common, so helper possibly:

nchar_safe <- possibly(nchar_bad, otherwise=NA)

map_int(month.name, nchar_safe) 
 [1] NA  8  5  5  3  4  4  6  9 NA  8  8

Adverbs: insistently

For functions that fail sporadically (e.g., web access), try insistently:

read_html_insist <- insistently(read_html)

read_html_insist("https://a.sketchy.site")

Will try 3 times by default

(cf, sites that don’t work reliably like in MP#02)

Adverbs: slowly

Some websites will get mad if you query too often: slowly will make sure it isn’t called too often

read_html_slow <- slowly(read_html)

read_html_slow("https://a.rate-limited.site")

Default is once per second.

Adverbs: in_parallel

For parallel processing, use the in_parallel adverb:

mirai::daemons(4) # Run 4 processes in parallel

map_dbl(1:4, \(x) x+1)
[1] 2 3 4 5
map_dbl(1:4, in_parallel(\(x) x + 1))
[1] 2 3 4 5

Argument to in_parallel needs to be an anonymous function

Adverbs: in_parallel

Compare:

system.time(map_dbl(1:4, \(x) {Sys.sleep(1); x+1}))
   user  system elapsed 
  0.004   0.001   4.023 
mirai::daemons(4) # Run 4 processes in parallel
system.time(map_dbl(1:4, in_parallel(\(x) {Sys.sleep(1); x+1})))
   user  system elapsed 
  0.001   0.000   1.009 

Adverbs: in_parallel

Parallelization is not magic:

  • Most useful for IO bound tasks (reading files, downloads)
  • Too much parallelization slows things down (“thrashing”)
  • Need to be careful about error handling
    • Use safely so if one step errors, you don’t loose everything

List Access with pluck

Given a list, the pluck function will pull out elements:

  • list_obj |> pluck(n) will pull out the \(n^{\text{th}}\) element
  • list_obj |> pluck("name") will pull out the element named "name"
  • list_obj |> pluck(func) will apply the “accessor” func

List Access with pluck

lm: Linear regression (and ANOVA)

my_regression <- lm(body_mass ~ flipper_len, data=penguins)

summary(my_regression)

Call:
lm(formula = body_mass ~ flipper_len, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1058.80  -259.27   -26.88   247.33  1288.69 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5780.831    305.815  -18.90   <2e-16 ***
flipper_len    49.686      1.518   32.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 394.3 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.759, Adjusted R-squared:  0.7583 
F-statistic:  1071 on 1 and 340 DF,  p-value: < 2.2e-16

List Access with pluck

Can use pluck + accessors to get the coefficients

my_regression |> pluck(1)
my_regression |> pluck("coefficients")
my_regression |> pluck(coef)
(Intercept) flipper_len 
-5780.83136    49.68557 

Final form is most robust

List Access with pluck

pluck has some nice useability features:

  • Can supply “compound” selections: pluck(1) |> pluck("a") is the same as pluck(1, "a")
  • Can change default value from NULL: pluck("a", .default=NA)
    • Use chuck if you want to error instead of default

Combining Results

Given a list, we can ‘combine’ elements with the reduce function:

map(month.name, 
    nchar) |>
    reduce(`+`) # reduce(`+`) => sum
[1] 74
map(month.name, 
    nchar) |>
    reduce(max)
[1] 9

Useful for combining many data sets in a ‘mega-join’

Use accumulate to keep intermediate results (a la cumsum)

Functional Tools

Not everything fits within purrr tooling

But a lot does!

Use it when helpful:

  • Applying the same analysis many times map
  • Recursively combining (\(n\)-way inner join)
  • Error handling
  • List-structured data (HTML -> more next week!)