Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 12

STA 9750 Week 12

Today:

Tuesday Section: 2025-11-25
Thursday Section: 2025-11-20

Lecture #10: Strings, Regular Expressions, and Text Processing

Communicating Results (quarto) ✅
R Basics ✅
Data Manipulation in R ✅
Data Visualization in R ✅
Getting Data into R ⬅️
- Files and APIs ✅
- Web Scraping ✅
- Cleaning and Processing Text ⬅️
Statistical Modeling in R

Today

Course Administration
Warm-Up Exercise
New Material
- Strings Encodings
- Regular Expressions
- Text Manipulation
- Computational Inference
Wrap-Up
- Life Tip of the Day

Course Administration

GTA

Charles Ramirez is our GTA

Wednesday Office Hours moved to 5:15-7:15 for greater access
- Give a bit of flexibility on the front end for CR to get off work
Grading Meta-Review #02 after Peer Feedback

Mini-Project #04

MP#04 - Just the Fact(-Check)s, Ma’am!

Due 2025-12-05 at 11:59pm ET

Topics covered:

Data Import
- HTTP Request Construction (Week 9)
- Tabular HTML Scraping (Week 11)
$t$-tests
Putting Everything Together

Grading in Progress

I owe you:

Mid-Term Check-In Presentation Feedback
MP#02 Meta-Review
Selected Regrades

Course Support

Synchronous
- MW Office Hours 2x / week: Tuesdays + Thursdays 5pm
  - Rest of Semester except Thanksgiving (Nov 27th)
- GTA Office Hours: Wednesdays at 5:15-7:15pm
Asynchronous: Piazza ($<20$ minute average response time)

Course Project

End of Semester Course Project:

In-Class Final Presentations
- Last week of class or during finals week (optional)
Individual Report
- 2025-12-18
Group Report
- 2025-12-18
Peer Evaluations
- 2025-12-18

See detailed instructions for rubrics, expectations, etc.

Review Exercise

Cocktail Scraping Exercise 🍸

Download all recipes from https://cocktails.hadley.nz/

Steps:

Make a mental map of site & Identify Selectors
Develop a strategy to get all pages
Function to pull out each recipe
Function to parse earch recipe into a data frame
Put it all together

Breakout Rooms

Breakout	Teams
1	Cycle Paths (T) + Green Apple (R)
2	Inspector Clouseau (T) + House Busters (R)
3	Weight Watchers (T) + Irish Mafia (R)
4	Point of Interest (T) + Urban Health Insight Group (R)
5	Sounds Good (T) + Wellness Warriors (R)
6	The Mean, Green, Data-Analyzing Team (T) + Gridion Regression (R)
7	Nightshift Analysts (T) + Standard Deviants (R)
8	Stats & The City (R)
9	Kitchen Nightmares (R)

Working with Strings

Strings

In R, strings and characters are basically interchangeable

Arbitrary “bits of text” that can be stored in a vector
Don’t normally need to think about encoding

stringr provides basic tools for string manipulation (str_ functions)

stringi provides advanced functionality

String Handling

Easy to get 90% of the way correct - very hard to get 100% correct

Human language is messy - choices are culturally-specific

Unicode standard exists to make it easy (easier…) to do the right thing

Unicode

Unicode is an attempt to standardize all human written language:

So hard!
Moving target
Don’t implement yourself - use libraries

Latest Unicode tables: unicodeplus.com/

Encodings connect Unicode IDs with actual bits on your computer: UTF-8 is mainly back-compatible and should be your default

Unicode Controversies

Pistol (U+1F52B) emoji:

Originally a (regular) gun, Apple lead the charge to a water pistol, now standard

Taco Controversy:

Taco Emoji History
Taco Emoji Controversy

Unicode+UTF-8 - Modern Standard

Best practices:

Use updated Unicode compliant libraries like stringr
Use UTF-8 strings
If your data isn’t UTF-8, make it UTF-8 ASAP

stringr

The tidyverse package stringr provides a tools for string manipulation:

All functions start with str_
“Input” string is always first argument
Reasonably vectorized

stringr + dplyr

All stringr functions work well in dplyr pipelines (“vectorized”):

   lower_letters upper_letters
1              a             A
2              b             B
3              c             C
4              d             D
5              e             E
6              f             F
7              g             G
8              h             H
9              i             I
10             j             J
11             k             K
12             l             L
13             m             M
14             n             N
15             o             O
16             p             P
17             q             Q
18             r             R
19             s             S
20             t             T
21             u             U
22             v             V
23             w             W
24             x             X
25             y             Y
26             z             Z

Substrings and String Splitting

[[1]]
[1] "apples"  "oranges" "pears"   "bananas"

[[2]]
[1] "pineapples" "mangos"     "guavas"

     [,1]          [,2]                            
[1,] "apples "     " oranges and pears and bananas"
[2,] "pineapples " " mangos and guavas"

See also str_split_i to get only one element of split

Trimming Strings

Common to have excess whitespace around results: str_trim

[1] "pears"  "guavas"

Sub-Strings

str_sub to get substrings:

[1] "Baruch"

[1] "CUNY"

[1] "Baruch College"   "Brooklyn College"

Regular Expressions

Working directly with characters is painful and hard to do properly

Regular Expressions (regex) provide tools for specifying patterns in strings:

Regular => following rules

Regular Expression Tools

Testing Regular Expressions Interactively: regex101.com/
Alternative regexr.com/
Automated Regular Expression Builder: regex-generator
AI Regexp Builder: hregexgo.com/

Regex 101

A basic regex is just a pattern:

a: The regex a will match all strings with an a:

[1]  TRUE FALSE FALSE  TRUE

Longer patterns are more precise:

[1] FALSE FALSE  TRUE  TRUE

Replacement

str_replace will replace string with something else: - str_remove will replace with nothing - Does first match (cf str_{remove,replace}_all)

[1] "123"        "123456"     "123456,789"

[1] "123"       "123456"    "123456789"

Wildcard

The . character is a ‘wildcard’ and matches anything:

[1] FALSE FALSE FALSE  TRUE

(You might have seen a similar usage using formulas)

Alternatives

Alternatives can be expressed using a |:

[1]  TRUE  TRUE FALSE  TRUE

For longer patterns, wrap in parentheses

[1] FALSE  TRUE  TRUE  TRUE

Ranges

Sometimes we might want to match a wide range of characters; e.g. digits

Alternatives are painful: (0|1|2|3|4|5|6|7|8|9)

Can use a range notion instead: [0-9]

[1]  TRUE FALSE  TRUE FALSE

Ranges

Useful ranges:

[A-Z]: Uppercase letters
[a-z]: Lowercase letters
[0-9]: Digits

Can also ‘hard code’ a range by listing all elements:

[0123456789]
[aeiou]

Ranges

Some useful ranges are hard-coded:

[:alpha:]
[:lower:]
[:upper:]
[:digit:]
[:alnum:]
[:punct:]
[:space:]

I like these - quite clear:

[1]  TRUE FALSE  TRUE FALSE

Quantifiers

Quantifiers (multiple matches):

.{a, b}: anywhere from a to b copies (inclusive)
.{, b}: no more than b copies
.{a,}: at least a copies
.?: zero-or-one, same as .{0,1}
.*: zero-or-more, same as .{0,}
.+: one-or-more, same as {1,}

Quantifiers

Wildcard match optional:

[1] FALSE FALSE  TRUE  TRUE

Strings with numbers:

[1]  TRUE FALSE  TRUE FALSE

Numbers 10 or greater:

[1] FALSE FALSE  TRUE  TRUE

Start and End Anchors

Anchors let us refer to the start and end of a string:

^: start
$: end

Things starting with a number:

[1] "99 Red Balloons" "5 Years Time"

Extracting Matches

Often, we use regex to pull our part of a string:

str_detect is there a ‘fit’?
str_extract extract the whole ‘fit’
str_match extract specific groups

Specify groups with parentheses

Extracting Matches

[1] TRUE

[1] "Baruch College, CUNY"

     [,1]                   [,2]            
[1,] "Baruch College, CUNY" "Baruch College"

Extracting Matches

Very useful for pulling numbers out of text:

[1] "123"  "456"  "7.89"

Extracting Matches

str_match(group=) is useful for complex data extraction.

[1] "Michael Weylandt" "KRR"

[1] "STA9750" "STA9891"

Also allows named groups - really helpful!

                                            V1       instructor  course
1 Michael Weylandt teaches STA9750 on Thursday Michael Weylandt STA9750
2             KRR teaches STA9891 on Wednesday              KRR STA9891
    weekday
1  Thursday
2 Wednesday

Exclusion

[1] TRUE TRUE

Not quite what we want

str_detect has a negate option:

[1] FALSE  TRUE

Homoglyphs

[1] "η" "h"

Why?

                         Η                          H 
"GREEK CAPITAL LETTER ETA"   "LATIN CAPITAL LETTER H"

Homoglyphs

Particularly nasty with dashes - lean on [[:punct::]] where possible.

[1] "Em Dash " "En Dash " "Hyphen "

[1] "Em Dash —" "En Dash –" "Hyphen ‐"

Why stringr?

Base R has its own set of regular expression functions (grep and friends)

stringr does the same thing, but with a more consistent interface.

Conversion table online

Regular Expression Practice

With your breakout group, it’s time for some Regular Expression Practice

Regex + Scraping

Regular expressions are incredibly useful when converting HTML text to workable data:

Extract numbers
Extract relevant parts of strings

Regex + Scraping

Common paradigm: html_text2() |> str_remove_all() |> as.numeric()

[1]     8.25  1000.00   500.00 12345.67

Here, [^.[:digit:]] means anything ([]) that is not (^) a period or a digit.

Regex + Scraping

Another common paradigm is to extract structured text into a data frame when html_table fails

    species    sex weight
1    Adelie female    200
2    Gentoo   male    500
3 Chinstrap female   1000

Regex + Scraping

Can also be used to manipulate strings within a data frame:

# A tibble: 2 × 4
  enrollment course   dept  numb 
       <dbl> <chr>    <chr> <chr>
1         50 STA 9750 STA   9750 
2         20 STA 9890 STA   9890

Cocktail Scraping

With your breakout group, it’s time to finish the cocktail scraping exercise

Cocktails

Wrap Up

Processing Strings in R

Encoding and Unicode
Regular Expressions

Computational Statistical Inference

Upcoming Work

Upcoming work from course calendar

Mini-Project #03 peer feedback due 2025-11-28
Mini-Project #04 due on 2025-12-05 at 11:59pm ET
Mini-Project #04 peer feedback due 2025-12-19
End of Semester Project Submissions

Remaining Topic:

Machine Learning (Predictive Modeling)
- After Thanksgiving 🦃

Musical Treat

Concert season - remember CUNY Student Benefits

Software Tools for Data AnalysisSTA 9750Michael WeylandtWeek 12

STA 9750 Week 12

Today

Today

Course Administration

GTA

Mini-Project #04

Grading in Progress

Course Support

Course Project

Review Exercise

Review Exercise

Breakout Rooms

Working with Strings

Strings

String Handling

Unicode

Unicode Controversies

Unicode+UTF-8 - Modern Standard

stringr

stringr + dplyr

Substrings and String Splitting

Trimming Strings

Sub-Strings

Regular Expressions

Regular Expression Tools

Regex 101

Replacement

Wildcard

Alternatives

Ranges

Ranges

Ranges

Quantifiers

Quantifiers

Start and End Anchors

Extracting Matches

Extracting Matches

Extracting Matches

Extracting Matches

Exclusion

Homoglyphs

Homoglyphs

Why stringr?

Regular Expression Practice

Regex + Scraping

Regex + Scraping

Regex + Scraping

Regex + Scraping

Cocktail Scraping

Wrap Up

Wrap Up

Upcoming Work

Musical Treat

Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 12