STA 9750
Week 12 Update
Tue 2025-11-25
Thu 2025-11-20

Michael Weylandt

Agenda

Today

Administrative Business
Brief Review: String Manipulation
New Material:
- Manipulating HTML Text into Data
- Statistical Inference
Wrap Up and Looking Ahead

Orientation

Communicating Results (quarto) ✅
R Basics ✅
Data Manipulation in R ✅
Data Visualization in R ✅
Getting Data into R
- Flat Files and APIs ✅
- Web Scraping ✅
- Cleaning and Processing Text ⬅️
Statistical Modeling in R

Administrative Business

STA 9750 Mini-Project #04

MP#04 online now

Due 2025-11-21 at 11:59pm ET ($\approx$ 3 weeks - 2 remaining)
Topic: Political Maps
- Technical Subject: Table scraping from Wikipedia
Format:
- Political Talking Head (Optional - see notes)
- GitHub post AND Brightspace submission

STA 9750 Mini-Project #03

MP#03 peer feedback in process

Due Sunday evening
Evaluate per rubric
Make sure to follow feedback template

Going Forward

Pre-Assignments

Brightspace - Wednesdays at 11:45

Reading, typically on course website
Brightspace auto-grades
- I have to manually change to completion grading

Next (and final!) pre-assignment is 2025-12-03 at 11:59pm ET

I am behind on reading PA comments:

For anything urgent, please contact me directly🙏

Grading

I returned:

Mid-Term Check-In Feedback

I still owe you:

MP#02 peer meta-review fixes

I will owe you:

MP#03 grades and meta-grades

Course Support

Synchronous
- Office Hours 2x / week
  - MW Office Hours on Tuesdays + Thursday
Asynchronous
- Piazza ($\approx 20$ minute average response time)

Upcoming

Semester end is coming quickly!

MP#04
Final presentations
Final reports

That’s it!

Feedback Survey

I have posted a course feedback survey at

https://baruch.az1.qualtrics.com/jfe/form/SV_9uyZ4YFsrcRRPIG

Comments very welcome (but not required)

Next Semester Topics

Possible MP ideas:

NYC Open Data
Sports (Baseball?)
Spotify / Music
Healthcare / Pharmaceutical (might be tricky…)
Video Games
Quant Finance / Time Series?
Baruch Demographics
Job Market
Real Estate

Comments

Bad - Trivial:

Bad - Opaque:

Bad - Redundant / Explaining Code

Comments

Good - Purpose (“Business Logic”):

Good - Higher Level Structure (Example from googledrive package):

Comments

More Advice on StackOverflow

Review: String Manipulation

Agenda

Unicode Discussion
Regex Discussion
Regex Exercises

Strings

In R, strings and characters are basically interchangeable

Arbitrary “bits of text” that can be stored in a vector
Don’t normally need to think about encoding

stringr provides basic tools for string manipulation

stringi provides advanced functionality

String Handling

Easy to get 90% of the way correct - very hard to get 100% correct

Human language is messy - choices are culturally-specific

Unicode standard exists to make it easy (easier…) to do the right thing

FAQ: Unicode Resources

Unicode Tables: unicodeplus.com/
Taco Emoji History
Taco Emoji Controversy

FAQ: Regular Expression Tools

Testing Regular Expressions Interactively: regex101.com/
Alternative regexr.com/
Automated Regular Expression Builder: regex-generator
AI Regexp Builder: hregexgo.com/

FAQ: Substrings and String Splitting

[[1]]
[1] "apples"  "oranges" "pears"   "bananas"

[[2]]
[1] "pineapples" "mangos"     "guavas"

     [,1]          [,2]                            
[1,] "apples "     " oranges and pears and bananas"
[2,] "pineapples " " mangos and guavas"

Sub-Strings / Splitting

[1] "Baruch"

[1] "CUNY"

[1] "Baruch College"   "Brooklyn College"

FAQ: Start and End Anchors

When to use the ^ and $ anchors?

Start and end of a line.

Very useful for structured text (computer log outputs)
In data analysis, a bit less useful
- Applied to output of str_split

FAQ: Exclusion + Detection

[1] TRUE TRUE

str_detect has a negate option:

[1] FALSE  TRUE

FAQ: `str_detect` vs `str_match` vs `str_extract`

str_detect is there a ‘fit’?
str_extract extract the whole ‘fit’
str_match extract specific groups

[1] TRUE

[1] "Baruch College, CUNY"

     [,1]                   [,2]            
[1,] "Baruch College, CUNY" "Baruch College"

FAQ: Subset Selection + Indexing

str_match(group=) is useful for complex data extraction.

[1] "Michael Weylandt" "KRR"

[1] "STA9750" "STA9891"

(Not sure what negatives do here…)

Also allows named groups:

                                            V1       instructor  course
1 Michael Weylandt teaches STA9750 on Thursday Michael Weylandt STA9750
2             KRR teaches STA9891 on Wednesday              KRR STA9891
    weekday
1  Thursday
2 Wednesday

FAQ: Homoglyphs

[1] "η" "h"

Why?

                         Η                          H 
"GREEK CAPITAL LETTER ETA"   "LATIN CAPITAL LETTER H"

Particularly nasty with dashes - lean on [[:punct::]] where possible.

[1] "Em Dash " "En Dash " "Hyphen "

[1] "Em Dash —" "En Dash –" "Hyphen ‐"

FAQ: `?` Symbol (Quantifiers)

Quantifiers (multiple matches):

.{a, b}: anywhere from a to b copies (inclusive)
.{, b}: no more than b copies
.{a,}: at least a copies
.?: zero-or-one, same as .{0,1}
.*: zero-or-more, same as .{0,}
.+: one-or-more, same as {1,}

FAQ: `stringr` vs `grep` / `grepl`

Ultimately the same functionality, but stringr has a more consistent interface.

Conversion table online

FAQ: Working Columnwise

All stringr functions work well in dplyr pipelines (“vectorized”):

   lower_letters upper_letters
1              a             A
2              b             B
3              c             C
4              d             D
5              e             E
6              f             F
7              g             G
8              h             H
9              i             I
10             j             J
11             k             K
12             l             L
13             m             M
14             n             N
15             o             O
16             p             P
17             q             Q
18             r             R
19             s             S
20             t             T
21             u             U
22             v             V
23             w             W
24             x             X
25             y             Y
26             z             Z

FAQ: How to Convert to UTF-8

If you know the source encoding:

If you don’t know the source, ….

Review Activity

Regular Expression Practice

As of Thursday morning, on the fritz so you likely need to copy exercises into local RStudio

Breakout Rooms

Room	Team	Room	Team
1	Team Mystic	5	Money Team + CWo.
2	Subway Metrics	6	Lit Group
3	Noise Busters	7	Cinephiles + VG
4	AI Impact Col	8

New Material

Agenda

Completion of Cocktail Exercise
Time Permitting: More Scraping
Time Permitting: Statistical Inference

Cocktail Exercise

First, we will complete the cocktail scraping exercise from last week.

Instructions and pointers can be found here

Breakout Rooms

Room	Team	Room	Team
1	Team Mystic	5	Money Team + CWo.
2	Subway Metrics	6	Lit Group
3	Noise Busters	7	Cinephiles + VG
4	AI Impact Col	8

Additional Scraping Exercise

Now, complete the second scraping exercise in your small groups

Breakout Rooms

Room	Team	Room	Team
1	Team Mystic	5	Money Team + CWo.
2	Subway Metrics	6	Lit Group
3	Noise Busters	7	Cinephiles + VG
4	AI Impact Col	8

Statistical Inference

Recall the basic theory of statistical tests - “goodness of fit”

Select a baseline model (‘null hypothesis’)
Select a quantity of interest (‘test statistic’)
Determine distribution of test statistic under null hypothesis
If observed test statistic is extreme (vis-a-vis null distribution of test statistic):
- -> “doesn’t fit” and reject null

Statistical Theory

75+ Years of Theory

Pick a null + test statistic
- Compute “null distribution”

$Z$-values, $t$-values, $p$-values, etc.

Typically requires ‘big math’

Alternative:

Let a computer do the hard work

Monte Carlo Simulation

Using a computer’s pseudo-random number generator (PRNG)

Repeat:

Generate $X_1, X_2, X_3, \dots$
Compute $f(X_1), f(X_2), f(X_3), \dots$

Sample average (LLN)

\[\frac{1}{n} \sum_{i=1}^n f(X_i) \to \E[f(X)]\]

Holds for arbitrary related quantities (quantiles, medians, variances)

Monte Carlo Simulation

Example: suppose we have $X_i \sim\text{iid} \mathcal{N}(0, \sigma^2)$ and we want to test $H_0: \sigma=1$

n <- 20
X <- rnorm(n, mean=0, sd=1.25)

sd(X)

[1] 1.353189

How to test?

The Math Way

Per Cochran’s theorem, $S \sim \sqrt{\frac{\chi^2_{n-1}}{n-1}} = \frac{1}{\sqrt{n-1}} \chi_{n-1}$ has a $\chi$ (not $\chi^2$) distribution

library(chi)
critical_value <- qchi(0.95, df=n-1) / sqrt(n-1)
critical_value

[1] 1.259564

So reject $H_0$ if $S$ above critical value (1.26)

The Computer Way

To get a critical value

# A tibble: 1 × 1
  `quantile(test_statistic_null, 0.95)`
                                  <dbl>
1                                  1.26

The Computer Way

To get a $p$-value:

# A tibble: 1 × 1
  p_val
  <dbl>
1 0.008

`infer`

The infer package automates much of this for common tests

Many examples

Looking Ahead

Upcoming Mini-Projects

MP#04: TBD

Seeking suggestions for next semester

Course Feedback Survey

Upcoming

This Week:

MP#03 Peer Feedback
Pre Assignment

Longer Term:

MP#04
Final Presentations

Life Tip of the Week

Register to Vote

If you want to vote in the upcoming NYC Mayoral Primary, it’s time to register to vote:

https://www.vote.nyc/page/register-vote

Primary voting begins in mid-June: need to register 10 days before

STA 9750 Week 12 Update Tue 2025-11-25 Thu 2025-11-20

Agenda

Today

Orientation

Administrative Business

STA 9750 Mini-Project #04

STA 9750 Mini-Project #03

Going Forward

Pre-Assignments

Grading

Course Support

Upcoming

Feedback Survey

Next Semester Topics

Comments

Comments

Comments

Review: String Manipulation

Agenda

Strings

String Handling

FAQ: Unicode Resources

FAQ: Regular Expression Tools

FAQ: Substrings and String Splitting

Sub-Strings / Splitting

FAQ: Start and End Anchors

FAQ: Exclusion + Detection

FAQ: str_detect vs str_match vs str_extract

FAQ: Subset Selection + Indexing

FAQ: Homoglyphs

FAQ: ? Symbol (Quantifiers)

FAQ: stringr vs grep / grepl

FAQ: Working Columnwise

FAQ: How to Convert to UTF-8

Review Activity

Breakout Rooms

New Material

Agenda

Cocktail Exercise

Breakout Rooms

Additional Scraping Exercise

Breakout Rooms

Statistical Inference

Statistical Theory

Monte Carlo Simulation

Monte Carlo Simulation

The Math Way

The Computer Way

The Computer Way

infer

Looking Ahead

Upcoming Mini-Projects

Upcoming

Life Tip of the Week

Register to Vote

STA 9750
Week 12 Update
Tue 2025-11-25
Thu 2025-11-20

FAQ: `str_detect` vs `str_match` vs `str_extract`

FAQ: `?` Symbol (Quantifiers)

FAQ: `stringr` vs `grep` / `grepl`

`infer`