STA 9750
Week 12 Update
2025-05-01

Michael Weylandt

Agenda

Today

  • Administrative Business
  • Brief Review: String Manipulation
  • New Material:
    • Manipulating HTML Text into Data
    • Statistical Inference
  • Wrap Up and Looking Ahead

Orientation

  • Communicating Results (quarto) ✅
  • R Basics ✅
  • Data Manipulation in R
  • Data Visualization in R
  • Getting Data into R
    • Flat Files and APIs ✅
    • Web Scraping ✅
    • Cleaning and Processing Text ⬅️
  • Statistical Modeling in R

Administrative Business

STA 9750 Mini-Project #04

MP#04 online now

  • Due 2025-05-07 at 11:45pm ET (\(\approx\) 3 weeks - 2 remaining)
  • Topic: Political Maps
    • Technical Subject: Table scraping from Wikipedia
  • Format:
    • Political Talking Head (Optional - see notes)
    • GitHub post AND Brightspace submission

STA 9750 Mini-Project #03

MP#03 peer feedback in process

Going Forward

Pre-Assignments

Brightspace - Wednesdays at 11:45

  • Reading, typically on course website
  • Brightspace auto-grades
    • I have to manually change to completion grading

Next (and final!) pre-assignment is 2025-05-07 at 11:45pm ET

I am behind on reading PA comments:

  • For anything urgent, please contact me directly🙏

Grading

I returned:

  • Mid-Term Check-In Feedback

I still owe you:

  • MP#02 peer meta-review fixes

I will owe you:

  • MP#03 grades and meta-grades

Course Support

  • Synchronous
    • Office Hours 2x / week
      • MW Office Hours on Tuesdays + Thursday
  • Asynchronous
    • Piazza (\(\approx 20\) minute average response time)

Upcoming

Semester end is coming quickly!

  • MP#04
  • Final presentations
  • Final reports

That’s it!

Feedback Survey

I have posted a course feedback survey at

https://baruch.az1.qualtrics.com/jfe/form/SV_9uyZ4YFsrcRRPIG

Comments very welcome (but not required)

Next Semester Topics

Possible MP ideas:

  • NYC Open Data
  • Sports (Baseball?)
  • Spotify / Music
  • Healthcare / Pharmaceutical (might be tricky…)
  • Video Games
  • Quant Finance / Time Series?
  • Baruch Demographics
  • Job Market
  • Real Estate

Comments

Bad - Trivial:

# Set x to 3
x <- 3

Bad - Opaque:

# Follow instructions
x <- 3

Bad - Redundant / Explaining Code

# Fit a model
mod <- model(x, y)

# Build a query
query_build() |> query_add() |> query_formulate()

Comments

Good - Purpose (“Business Logic”):

# Regulation XX.YY requires us to apply a risk multiplier to output
# As of 2024-11-01, CFO says risk multiplier is 3
# Cf. Email to risk group, subject "New Risk Multiplier"
RISK_MULTIPLIER <- 3

Good - Higher Level Structure (Example from googledrive package):

# https://github.com/gaborcsardi/rencfaq#with-base-r
write_utf8 <- function(text, path = NULL) {
  # sometimes we use writeLines() basically to print something for a snapshot
  if (is.null(path)) {
    return(base::writeLines(text))
  }

  # step 1: ensure our text is utf8 encoded
  utf8 <- enc2utf8(text)
  upath <- enc2utf8(path)

  # step 2: create a connection with 'native' encoding
  # this signals to R that translation before writing
  # to the connection should be skipped
  con <- file(upath, open = "w+", encoding = "native.enc")
  withr::defer(close(con))

  # step 3: write to the connection with 'useBytes = TRUE',
  # telling R to skip translation to the native encoding
  base::writeLines(utf8, con = con, useBytes = TRUE)
}

Comments

More Advice on StackOverflow

Review: String Manipulation

Agenda

  • Unicode Discussion
  • Regex Discussion
  • Regex Exercises

Strings

In R, strings and characters are basically interchangeable

  • Arbitrary “bits of text” that can be stored in a vector
  • Don’t normally need to think about encoding

stringr provides basic tools for string manipulation

stringi provides advanced functionality

String Handling

Easy to get 90% of the way correct - very hard to get 100% correct

Human language is messy - choices are culturally-specific

Unicode standard exists to make it easy (easier…) to do the right thing

FAQ: Unicode Resources

FAQ: Regular Expression Tools

FAQ: Substrings and String Splitting

fruits <- c("apples and oranges and pears and bananas", 
            "pineapples and mangos and guavas")

stringr::str_split(fruits, " and ")
[[1]]
[1] "apples"  "oranges" "pears"   "bananas"

[[2]]
[1] "pineapples" "mangos"     "guavas"    
stringr::str_split_fixed(fruits, "and", n=2)
     [,1]          [,2]                            
[1,] "apples "     " oranges and pears and bananas"
[2,] "pineapples " " mangos and guavas"            

Sub-Strings / Splitting

x <- "Baruch College, CUNY"
stringr::str_sub(x, end=6) # Includes endpoints
[1] "Baruch"
stringr::str_sub(x, start=-4) # Count from end
[1] "CUNY"
x <- c("Baruch College, CUNY", "Brooklyn College, CUNY")
stringr::str_sub(x, end=-7) # Drop last _6_
[1] "Baruch College"   "Brooklyn College"

FAQ: Start and End Anchors

When to use the ^ and $ anchors?

Start and end of a line.

  • Very useful for structured text (computer log outputs)
  • In data analysis, a bit less useful
    • Applied to output of str_split

FAQ: Exclusion + Detection

x <- c("10 blue fish", "three wet goats")
stringr::str_detect(x, "[^0123456789]")
[1] TRUE TRUE

str_detect has a negate option:

stringr::str_detect(x, "[0-9]", negate=TRUE)
[1] FALSE  TRUE

FAQ: str_detect vs str_match vs str_extract

  • str_detect is there a ‘fit’?
  • str_extract extract the whole ‘fit’
  • str_match extract specific groups
x <- "Baruch College, CUNY is a wonderful place to work!"
stringr::str_detect(x, "(.*), CUNY")
[1] TRUE
stringr::str_extract(x, "(.*), CUNY")
[1] "Baruch College, CUNY"
stringr::str_match(x, "(.*), CUNY")
     [,1]                   [,2]            
[1,] "Baruch College, CUNY" "Baruch College"

FAQ: Subset Selection + Indexing

str_match(group=) is useful for complex data extraction.

x <- c("Michael Weylandt teaches STA9750", "KRR teaches STA9891")
pattern <- c("(.*) teaches (.*)")
stringr::str_extract(x, pattern, group=1)
[1] "Michael Weylandt" "KRR"             
stringr::str_extract(x, pattern, group=2)
[1] "STA9750" "STA9891"

(Not sure what negatives do here…)

Also allows named groups:

x <- c("Michael Weylandt teaches STA9750 on Thursday", "KRR teaches STA9891 on Wednesday")
pattern <- c("(?<instructor>.*) teaches (?<course>.*) on (?<weekday>.*)")
stringr::str_match(x, pattern) |> as.data.frame()
                                            V1       instructor  course
1 Michael Weylandt teaches STA9750 on Thursday Michael Weylandt STA9750
2             KRR teaches STA9891 on Wednesday              KRR STA9891
    weekday
1  Thursday
2 Wednesday

FAQ: Homoglyphs

x <- c("Η", "H")
tolower(x)
[1] "η" "h"

Why?

uni_info <- Vectorize(function(x) Unicode::u_char_name(utf8ToInt(x)), "x")
uni_info(x)
                         Η                          H 
"GREEK CAPITAL LETTER ETA"   "LATIN CAPITAL LETTER H" 

Particularly nasty with dashes - lean on [[:punct::]] where possible.

x <- c("Em Dash —", "En Dash –", "Hyphen ‐")
stringr::str_remove(x, "[[:punct:]]") # Works
[1] "Em Dash " "En Dash " "Hyphen " 
stringr::str_remove(x, "-")  # Keyboard minus = Fail
[1] "Em Dash —" "En Dash –" "Hyphen ‐" 

FAQ: ? Symbol (Quantifiers)

Quantifiers (multiple matches):

  • .{a, b}: anywhere from a to b copies (inclusive)
  • .{, b}: no more than b copies
  • .{a,}: at least a copies
  • .?: zero-or-one, same as .{0,1}
  • .*: zero-or-more, same as .{0,}
  • .+: one-or-more, same as {1,}

FAQ: stringr vs grep / grepl

Ultimately the same functionality, but stringr has a more consistent interface.

Conversion table online

FAQ: Working Columnwise

All stringr functions work well in dplyr pipelines (“vectorized”):

library(dplyr); library(stringr)
df <- data.frame(lower_letters = letters)
df |> mutate(upper_letters = str_to_upper(lower_letters))
   lower_letters upper_letters
1              a             A
2              b             B
3              c             C
4              d             D
5              e             E
6              f             F
7              g             G
8              h             H
9              i             I
10             j             J
11             k             K
12             l             L
13             m             M
14             n             N
15             o             O
16             p             P
17             q             Q
18             r             R
19             s             S
20             t             T
21             u             U
22             v             V
23             w             W
24             x             X
25             y             Y
26             z             Z

FAQ: How to Convert to UTF-8

If you know the source encoding:

inconv(STRING, from="latin1", to="UTF-8")

If you don’t know the source, ….

Review Activity

Regular Expression Practice

As of Thursday morning, on the fritz so you likely need to copy exercises into local RStudio

Breakout Rooms

Room Team Room Team
1 Team Mystic 5 Money Team + CWo.
2 Subway Metrics 6 Lit Group
3 Noise Busters 7 Cinephiles + VG
4 AI Impact Col 8

New Material

Agenda

  • Completion of Cocktail Exercise
  • Time Permitting: More Scraping
  • Time Permitting: Statistical Inference

Cocktail Exercise

First, we will complete the cocktail scraping exercise from last week.

Instructions and pointers can be found here

Breakout Rooms

Room Team Room Team
1 Team Mystic 5 Money Team + CWo.
2 Subway Metrics 6 Lit Group
3 Noise Busters 7 Cinephiles + VG
4 AI Impact Col 8

Additional Scraping Exercise

Now, complete the second scraping exercise in your small groups

Breakout Rooms

Room Team Room Team
1 Team Mystic 5 Money Team + CWo.
2 Subway Metrics 6 Lit Group
3 Noise Busters 7 Cinephiles + VG
4 AI Impact Col 8

Statistical Inference

Recall the basic theory of statistical tests - “goodness of fit”

  • Select a baseline model (‘null hypothesis’)
  • Select a quantity of interest (‘test statistic’)
  • Determine distribution of test statistic under null hypothesis
  • If observed test statistic is extreme (vis-a-vis null distribution of test statistic):
    • -> “doesn’t fit” and reject null

Statistical Theory

75+ Years of Theory

  • Pick a null + test statistic
    • Compute “null distribution”

\(Z\)-values, \(t\)-values, \(p\)-values, etc.

Typically requires ‘big math’

Alternative:

  • Let a computer do the hard work

Monte Carlo Simulation

Using a computer’s pseudo-random number generator (PRNG)

Repeat:

  • Generate \(X_1, X_2, X_3, \dots\)
  • Compute \(f(X_1), f(X_2), f(X_3), \dots\)

Sample average (LLN)

\[\frac{1}{n} \sum_{i=1}^n f(X_i) \to \E[f(X)]\]

Holds for arbitrary related quantities (quantiles, medians, variances)

Monte Carlo Simulation

Example: suppose we have \(X_i \sim\text{iid} \mathcal{N}(0, \sigma^2)\) and we want to test \(H_0: \sigma=1\)

n <- 20
X <- rnorm(n, mean=0, sd=1.25)

sd(X)
[1] 1.093844

How to test?

The Math Way

Per Cochran’s theorem, \(S \sim \sqrt{\frac{\chi^2_{n-1}}{n-1}} = \frac{1}{\sqrt{n-1}} \chi_{n-1}\) has a \(\chi\) (not \(\chi^2\)) distribution

library(chi)
critical_value <- qchi(0.95, df=n-1) / sqrt(n-1)
critical_value
[1] 1.259564

So reject \(H_0\) if \(S\) above critical value (1.26)

The Computer Way

To get a critical value

gen_sample_sd <- function(..., n=25, sd=1){
    sd(rnorm(n, mean=0, sd=sd))
}

tibble(simulation=1:1000) |>
    mutate(test_statistic_null = map_dbl(simulation, gen_sample_sd)) |>
    summarize(quantile(test_statistic_null, 0.95))
# A tibble: 1 × 1
  `quantile(test_statistic_null, 0.95)`
                                  <dbl>
1                                  1.23

The Computer Way

To get a \(p\)-value:

gen_sample_sd <- function(..., n=25, sd=1){
    sd(rnorm(n, mean=0, sd=sd))
}

tibble(simulation=1:1000) |>
    mutate(test_statistic_null = map_dbl(simulation, gen_sample_sd)) |>
    summarize(p_val = mean(test_statistic_null > sd(X)))
# A tibble: 1 × 1
  p_val
  <dbl>
1 0.222

infer

The infer package automates much of this for common tests

Many examples

Looking Ahead

Upcoming Mini-Projects

  • MP#04: Exploring Recent US Political Shifts

Seeking suggestions for next semester

Upcoming

This Week:

  • MP#03 Peer Feedback
  • Pre Assignment

Longer Term:

  • MP#04
  • Final Presentations

Life Tip of the Week

Register to Vote

If you want to vote in the upcoming NYC Mayoral Primary, it’s time to register to vote:

https://www.vote.nyc/page/register-vote

Primary voting begins in mid-June: need to register 10 days before