STA 9750
Week 12 Update
Tue 2025-11-25
Thu 2025-11-20

Michael Weylandt

Agenda

Today

  • Administrative Business
  • Brief Review: String Manipulation
  • New Material:
    • Manipulating HTML Text into Data
    • Statistical Inference
  • Wrap Up and Looking Ahead

Orientation

  • Communicating Results (quarto) ✅
  • R Basics ✅
  • Data Manipulation in R
  • Data Visualization in R
  • Getting Data into R
    • Flat Files and APIs ✅
    • Web Scraping ✅
    • Cleaning and Processing Text ⬅️
  • Statistical Modeling in R

Administrative Business

STA 9750 Mini-Project #04

MP#04 online now

  • Due 2025-11-21 at 11:59pm ET (\(\approx\) 3 weeks - 2 remaining)
  • Topic: Political Maps
    • Technical Subject: Table scraping from Wikipedia
  • Format:
    • Political Talking Head (Optional - see notes)
    • GitHub post AND Brightspace submission

STA 9750 Mini-Project #03

MP#03 peer feedback in process

Going Forward

Pre-Assignments

Brightspace - Wednesdays at 11:45

  • Reading, typically on course website
  • Brightspace auto-grades
    • I have to manually change to completion grading

Next (and final!) pre-assignment is 2025-12-03 at 11:59pm ET

I am behind on reading PA comments:

  • For anything urgent, please contact me directly🙏

Grading

I returned:

  • Mid-Term Check-In Feedback

I still owe you:

  • MP#02 peer meta-review fixes

I will owe you:

  • MP#03 grades and meta-grades

Course Support

  • Synchronous
    • Office Hours 2x / week
      • MW Office Hours on Tuesdays + Thursday
  • Asynchronous
    • Piazza (\(\approx 20\) minute average response time)

Upcoming

Semester end is coming quickly!

  • MP#04
  • Final presentations
  • Final reports

That’s it!

Feedback Survey

I have posted a course feedback survey at

https://baruch.az1.qualtrics.com/jfe/form/SV_9uyZ4YFsrcRRPIG

Comments very welcome (but not required)

Next Semester Topics

Possible MP ideas:

  • NYC Open Data
  • Sports (Baseball?)
  • Spotify / Music
  • Healthcare / Pharmaceutical (might be tricky…)
  • Video Games
  • Quant Finance / Time Series?
  • Baruch Demographics
  • Job Market
  • Real Estate

Comments

Bad - Trivial:

Bad - Opaque:

Bad - Redundant / Explaining Code

Comments

Good - Purpose (“Business Logic”):

Good - Higher Level Structure (Example from googledrive package):

Comments

More Advice on StackOverflow

Review: String Manipulation

Agenda

  • Unicode Discussion
  • Regex Discussion
  • Regex Exercises

Strings

In R, strings and characters are basically interchangeable

  • Arbitrary “bits of text” that can be stored in a vector
  • Don’t normally need to think about encoding

stringr provides basic tools for string manipulation

stringi provides advanced functionality

String Handling

Easy to get 90% of the way correct - very hard to get 100% correct

Human language is messy - choices are culturally-specific

Unicode standard exists to make it easy (easier…) to do the right thing

FAQ: Unicode Resources

FAQ: Regular Expression Tools

FAQ: Substrings and String Splitting

[[1]]
[1] "apples"  "oranges" "pears"   "bananas"

[[2]]
[1] "pineapples" "mangos"     "guavas"    
     [,1]          [,2]                            
[1,] "apples "     " oranges and pears and bananas"
[2,] "pineapples " " mangos and guavas"            

Sub-Strings / Splitting

[1] "Baruch"
[1] "CUNY"
[1] "Baruch College"   "Brooklyn College"

FAQ: Start and End Anchors

When to use the ^ and $ anchors?

Start and end of a line.

  • Very useful for structured text (computer log outputs)
  • In data analysis, a bit less useful
    • Applied to output of str_split

FAQ: Exclusion + Detection

[1] TRUE TRUE

str_detect has a negate option:

[1] FALSE  TRUE

FAQ: str_detect vs str_match vs str_extract

  • str_detect is there a ‘fit’?
  • str_extract extract the whole ‘fit’
  • str_match extract specific groups
[1] TRUE
[1] "Baruch College, CUNY"
     [,1]                   [,2]            
[1,] "Baruch College, CUNY" "Baruch College"

FAQ: Subset Selection + Indexing

str_match(group=) is useful for complex data extraction.

[1] "Michael Weylandt" "KRR"             
[1] "STA9750" "STA9891"

(Not sure what negatives do here…)

Also allows named groups:

                                            V1       instructor  course
1 Michael Weylandt teaches STA9750 on Thursday Michael Weylandt STA9750
2             KRR teaches STA9891 on Wednesday              KRR STA9891
    weekday
1  Thursday
2 Wednesday

FAQ: Homoglyphs

[1] "η" "h"

Why?

                         Η                          H 
"GREEK CAPITAL LETTER ETA"   "LATIN CAPITAL LETTER H" 

Particularly nasty with dashes - lean on [[:punct::]] where possible.

[1] "Em Dash " "En Dash " "Hyphen " 
[1] "Em Dash —" "En Dash –" "Hyphen ‐" 

FAQ: ? Symbol (Quantifiers)

Quantifiers (multiple matches):

  • .{a, b}: anywhere from a to b copies (inclusive)
  • .{, b}: no more than b copies
  • .{a,}: at least a copies
  • .?: zero-or-one, same as .{0,1}
  • .*: zero-or-more, same as .{0,}
  • .+: one-or-more, same as {1,}

FAQ: stringr vs grep / grepl

Ultimately the same functionality, but stringr has a more consistent interface.

Conversion table online

FAQ: Working Columnwise

All stringr functions work well in dplyr pipelines (“vectorized”):

   lower_letters upper_letters
1              a             A
2              b             B
3              c             C
4              d             D
5              e             E
6              f             F
7              g             G
8              h             H
9              i             I
10             j             J
11             k             K
12             l             L
13             m             M
14             n             N
15             o             O
16             p             P
17             q             Q
18             r             R
19             s             S
20             t             T
21             u             U
22             v             V
23             w             W
24             x             X
25             y             Y
26             z             Z

FAQ: How to Convert to UTF-8

If you know the source encoding:

If you don’t know the source, ….

Review Activity

Regular Expression Practice

As of Thursday morning, on the fritz so you likely need to copy exercises into local RStudio

Breakout Rooms

Room Team Room Team
1 Team Mystic 5 Money Team + CWo.
2 Subway Metrics 6 Lit Group
3 Noise Busters 7 Cinephiles + VG
4 AI Impact Col 8

New Material

Agenda

  • Completion of Cocktail Exercise
  • Time Permitting: More Scraping
  • Time Permitting: Statistical Inference

Cocktail Exercise

First, we will complete the cocktail scraping exercise from last week.

Instructions and pointers can be found here

Breakout Rooms

Room Team Room Team
1 Team Mystic 5 Money Team + CWo.
2 Subway Metrics 6 Lit Group
3 Noise Busters 7 Cinephiles + VG
4 AI Impact Col 8

Additional Scraping Exercise

Now, complete the second scraping exercise in your small groups

Breakout Rooms

Room Team Room Team
1 Team Mystic 5 Money Team + CWo.
2 Subway Metrics 6 Lit Group
3 Noise Busters 7 Cinephiles + VG
4 AI Impact Col 8

Statistical Inference

Recall the basic theory of statistical tests - “goodness of fit”

  • Select a baseline model (‘null hypothesis’)
  • Select a quantity of interest (‘test statistic’)
  • Determine distribution of test statistic under null hypothesis
  • If observed test statistic is extreme (vis-a-vis null distribution of test statistic):
    • -> “doesn’t fit” and reject null

Statistical Theory

75+ Years of Theory

  • Pick a null + test statistic
    • Compute “null distribution”

\(Z\)-values, \(t\)-values, \(p\)-values, etc.

Typically requires ‘big math’

Alternative:

  • Let a computer do the hard work

Monte Carlo Simulation

Using a computer’s pseudo-random number generator (PRNG)

Repeat:

  • Generate \(X_1, X_2, X_3, \dots\)
  • Compute \(f(X_1), f(X_2), f(X_3), \dots\)

Sample average (LLN)

\[\frac{1}{n} \sum_{i=1}^n f(X_i) \to \E[f(X)]\]

Holds for arbitrary related quantities (quantiles, medians, variances)

Monte Carlo Simulation

Example: suppose we have \(X_i \sim\text{iid} \mathcal{N}(0, \sigma^2)\) and we want to test \(H_0: \sigma=1\)

n <- 20
X <- rnorm(n, mean=0, sd=1.25)

sd(X)
[1] 1.353189

How to test?

The Math Way

Per Cochran’s theorem, \(S \sim \sqrt{\frac{\chi^2_{n-1}}{n-1}} = \frac{1}{\sqrt{n-1}} \chi_{n-1}\) has a \(\chi\) (not \(\chi^2\)) distribution

library(chi)
critical_value <- qchi(0.95, df=n-1) / sqrt(n-1)
critical_value
[1] 1.259564

So reject \(H_0\) if \(S\) above critical value (1.26)

The Computer Way

To get a critical value

# A tibble: 1 × 1
  `quantile(test_statistic_null, 0.95)`
                                  <dbl>
1                                  1.26

The Computer Way

To get a \(p\)-value:

# A tibble: 1 × 1
  p_val
  <dbl>
1 0.008

infer

The infer package automates much of this for common tests

Many examples

Looking Ahead

Upcoming Mini-Projects

  • MP#04: TBD

Seeking suggestions for next semester

Upcoming

This Week:

  • MP#03 Peer Feedback
  • Pre Assignment

Longer Term:

  • MP#04
  • Final Presentations

Life Tip of the Week

Register to Vote

If you want to vote in the upcoming NYC Mayoral Primary, it’s time to register to vote:

https://www.vote.nyc/page/register-vote

Primary voting begins in mid-June: need to register 10 days before