Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 12

STA 9750 Week 12

Today:

  • Tuesday Section: 2025-11-25
  • Thursday Section: 2025-11-20

Lecture #10: Strings, Regular Expressions, and Text Processing

  • Communicating Results (quarto) ✅
  • R Basics ✅
  • Data Manipulation in R
  • Data Visualization in R
  • Getting Data into R ⬅️
    • Files and APIs ✅
    • Web Scraping ✅
    • Cleaning and Processing Text ⬅️
  • Statistical Modeling in R

Today

Today

  • Course Administration
  • Warm-Up Exercise
  • New Material
    • Strings Encodings
    • Regular Expressions
    • Text Manipulation
    • Computational Inference
  • Wrap-Up
    • Life Tip of the Day

Course Administration

GTA

Charles Ramirez is our GTA

  • Wednesday Office Hours moved to 5:15-7:15 for greater access
    • Give a bit of flexibility on the front end for CR to get off work
  • Grading Meta-Review #02 after Peer Feedback

Mini-Project #04

MP#04 - Just the Fact(-Check)s, Ma’am!

Due 2025-12-05 at 11:59pm ET

Topics covered:

  • Data Import
    • HTTP Request Construction (Week 9)
    • Tabular HTML Scraping (Week 11)
  • \(t\)-tests
  • Putting Everything Together

Grading in Progress

I owe you:

  • Mid-Term Check-In Presentation Feedback
  • MP#02 Meta-Review
  • Selected Regrades

Course Support

  • Synchronous
    • MW Office Hours 2x / week: Tuesdays + Thursdays 5pm
      • Rest of Semester except Thanksgiving (Nov 27th)
    • GTA Office Hours: Wednesdays at 5:15-7:15pm
  • Asynchronous: Piazza (\(<20\) minute average response time)

Course Project

End of Semester Course Project:

  • In-Class Final Presentations
    • Last week of class or during finals week (optional)
  • Individual Report
    • 2025-12-18
  • Group Report
    • 2025-12-18
  • Peer Evaluations
    • 2025-12-18

See detailed instructions for rubrics, expectations, etc.

Review Exercise

Review Exercise

Cocktail Scraping Exercise 🍸

Download all recipes from https://cocktails.hadley.nz/

Steps:

  1. Make a mental map of site & Identify Selectors
  2. Develop a strategy to get all pages
  3. Function to pull out each recipe
  4. Function to parse earch recipe into a data frame
  5. Put it all together

Breakout Rooms

Breakout Teams
1 Cycle Paths (T) + Green Apple (R)
2 Inspector Clouseau (T) + House Busters (R)
3 Data Miners (T) + Irish Mafia (R)
4 Happy Hour (T) + Urban Health Insight Group (R)
5 Weight Watchers (T) + Wellness Warriors (R)
6 Point of Interest (T) + Gridion Regression (R)
7 Sounds Good (T) + Standard Deviants (R)
8 How We Met Your Landlord (T) + Stats & The City (R)
9 The Mean, Green, Data-Analyzing Team (T) + Restaurant Nightmares (R)
10 Nightshift Analysts (T)

Working with Strings

Strings

In R, strings and characters are basically interchangeable

  • Arbitrary “bits of text” that can be stored in a vector
  • Don’t normally need to think about encoding

stringr provides basic tools for string manipulation (str_ functions)

stringi provides advanced functionality

String Handling

Easy to get 90% of the way correct - very hard to get 100% correct

Human language is messy - choices are culturally-specific

Unicode standard exists to make it easy (easier…) to do the right thing

Unicode

Unicode is an attempt to standardize all human written language:

  • So hard!
  • Moving target
  • Don’t implement yourself - use libraries

Latest Unicode tables: unicodeplus.com/

Encodings connect Unicode IDs with actual bits on your computer: UTF-8 is mainly back-compatible and should be your default

Unicode Controversies

Pistol (U+1F52B) emoji:

  • Originally a (regular) gun, Apple lead the charge to a water pistol, now standard

Taco Controversy:


Unicode+UTF-8 - Modern Standard

Best practices:

  • Use updated Unicode compliant libraries like stringr
  • Use UTF-8 strings
  • If your data isn’t UTF-8, make it UTF-8 ASAP
iconv(STRING, from="latin1", to="UTF-8")

stringr

The tidyverse package stringr provides a tools for string manipulation:

  • All functions start with str_
  • “Input” string is always first argument
  • Reasonably vectorized

stringr + dplyr

All stringr functions work well in dplyr pipelines (“vectorized”):

library(dplyr); library(stringr)
df <- data.frame(lower_letters = letters)
df |> mutate(upper_letters = str_to_upper(lower_letters))
   lower_letters upper_letters
1              a             A
2              b             B
3              c             C
4              d             D
5              e             E
6              f             F
7              g             G
8              h             H
9              i             I
10             j             J
11             k             K
12             l             L
13             m             M
14             n             N
15             o             O
16             p             P
17             q             Q
18             r             R
19             s             S
20             t             T
21             u             U
22             v             V
23             w             W
24             x             X
25             y             Y
26             z             Z

Substrings and String Splitting

fruits <- c("apples and oranges and pears and bananas", 
            "pineapples and mangos and guavas")

stringr::str_split(fruits, " and ")
[[1]]
[1] "apples"  "oranges" "pears"   "bananas"

[[2]]
[1] "pineapples" "mangos"     "guavas"    
stringr::str_split_fixed(fruits, "and", n=2)
     [,1]          [,2]                            
[1,] "apples "     " oranges and pears and bananas"
[2,] "pineapples " " mangos and guavas"            

See also str_split_i to get only one element of split

Trimming Strings

Common to have excess whitespace around results: str_trim

stringr::str_split_i(fruits, "and", i=3) |> str_trim()
[1] "pears"  "guavas"

Sub-Strings

str_sub to get substrings:

x <- "Baruch College, CUNY"
stringr::str_sub(x, end=6) # Includes endpoints
[1] "Baruch"
stringr::str_sub(x, start=-4) # Count from end
[1] "CUNY"
x <- c("Baruch College, CUNY", "Brooklyn College, CUNY")
stringr::str_sub(x, end=-7) # Drop last _6_
[1] "Baruch College"   "Brooklyn College"

Regular Expressions

Working directly with characters is painful and hard to do properly

Regular Expressions (regex) provide tools for specifying patterns in strings:

  • Regular => following rules

Regular Expression Tools

Regex 101

A basic regex is just a pattern:

  • a: The regex a will match all strings with an a:
pets <- c("cat", "dog", "fish", "catfish")
str_detect(pets, "a")
[1]  TRUE FALSE FALSE  TRUE
  • Longer patterns are more precise:
pets <- c("cat", "dog", "fish", "catfish")
str_detect(pets, "fish")
[1] FALSE FALSE  TRUE  TRUE

Replacement

str_replace will replace string with something else: - str_remove will replace with nothing - Does first match (cf str_{remove,replace}_all)

x <- c("123", "123,456", "123,456,789")
str_remove(x, ",")
[1] "123"        "123456"     "123456,789"
str_remove_all(x, ",")
[1] "123"       "123456"    "123456789"

Wildcard

The . character is a ‘wildcard’ and matches anything:

pets <- c("cat", "dog", "fish", "catfish")
str_detect(pets, ".fish")
[1] FALSE FALSE FALSE  TRUE

(You might have seen a similar usage using formulas)

Alternatives

Alternatives can be expressed using a |:

pets <- c("cat", "dog", "fish", "catfish")
str_detect(pets, "a|o")
[1]  TRUE  TRUE FALSE  TRUE

For longer patterns, wrap in parentheses

pets <- c("cat", "dog", "fish", "catfish")
str_detect(pets, "(dog|fish)")
[1] FALSE  TRUE  TRUE  TRUE

Ranges

Sometimes we might want to match a wide range of characters; e.g. digits

Alternatives are painful: (0|1|2|3|4|5|6|7|8|9)

Can use a range notion instead: [0-9]

pets <- c("1 cat", "a dog", "3 fish", "two elephants")
str_detect(pets, "[0-9]")
[1]  TRUE FALSE  TRUE FALSE

Ranges

Useful ranges:

  • [A-Z]: Uppercase letters
  • [a-z]: Lowercase letters
  • [0-9]: Digits

Can also ‘hard code’ a range by listing all elements:

  • [0123456789]
  • [aeiou]

Ranges

Some useful ranges are hard-coded:

  • [:alpha:]
  • [:lower:]
  • [:upper:]
  • [:digit:]
  • [:alnum:]
  • [:punct:]
  • [:space:]

I like these - quite clear:

pets <- c("1 cat", "a dog", "3 fish", "two elephants")
str_detect(pets, "[:digit:]")
[1]  TRUE FALSE  TRUE FALSE

Quantifiers

Quantifiers (multiple matches):

  • .{a, b}: anywhere from a to b copies (inclusive)
  • .{, b}: no more than b copies
  • .{a,}: at least a copies
  • .?: zero-or-one, same as .{0,1}
  • .*: zero-or-more, same as .{0,}
  • .+: one-or-more, same as {1,}

Quantifiers

Wildcard match optional:

pets <- c("cat", "dog", "fish", "catfish")
str_detect(pets, ".?fish")
[1] FALSE FALSE  TRUE  TRUE

Strings with numbers:

pets <- c("1 cat", "a dogs", "3 fish", "two birds")
str_detect(pets, "[:digit:]")
[1]  TRUE FALSE  TRUE FALSE

Numbers 10 or greater:

pets <- c("1 cat", "3 dogs", "10 fish", "20 birds")
str_detect(pets, "[:digit:]{2,}")
[1] FALSE FALSE  TRUE  TRUE

Start and End Anchors

Anchors let us refer to the start and end of a string:

  • ^: start
  • $: end

Things starting with a number:

songs <- c("Mambo No 5", "99 Red Balloons", "5 Years Time")
str_subset(songs, "^[:digit:]")
[1] "99 Red Balloons" "5 Years Time"   

Extracting Matches

Often, we use regex to pull our part of a string:

  • str_detect is there a ‘fit’?
  • str_extract extract the whole ‘fit’
  • str_match extract specific groups

Specify groups with parentheses

Extracting Matches

x <- "Baruch College, CUNY is a wonderful place to work!"
stringr::str_detect(x, "(.*), CUNY")
[1] TRUE
stringr::str_extract(x, "(.*), CUNY")
[1] "Baruch College, CUNY"
stringr::str_match(x, "(.*), CUNY")
     [,1]                   [,2]            
[1,] "Baruch College, CUNY" "Baruch College"

Extracting Matches

Very useful for pulling numbers out of text:

x <- c("123", "456 (estimated)", "Approximately 7.89")
str_extract(x, "[.[:digit:]]+")
[1] "123"  "456"  "7.89"

Extracting Matches

str_match(group=) is useful for complex data extraction.

x <- c("Michael Weylandt teaches STA9750", 
       "KRR teaches STA9891")
pattern <- c("(.*) teaches (.*)")
stringr::str_extract(x, pattern, group=1)
[1] "Michael Weylandt" "KRR"             
stringr::str_extract(x, pattern, group=2)
[1] "STA9750" "STA9891"

Also allows named groups - really helpful!

x <- c("Michael Weylandt teaches STA9750 on Thursday", 
       "KRR teaches STA9891 on Wednesday")
pattern <- c("(?<instructor>.*) teaches (?<course>.*) on (?<weekday>.*)")
stringr::str_match(x, pattern) |> as.data.frame()
                                            V1       instructor  course
1 Michael Weylandt teaches STA9750 on Thursday Michael Weylandt STA9750
2             KRR teaches STA9891 on Wednesday              KRR STA9891
    weekday
1  Thursday
2 Wednesday

Exclusion

x <- c("10 blue fish", "three wet goats")
stringr::str_detect(x, "[^0123456789]")
[1] TRUE TRUE

Not quite what we want

str_detect has a negate option:

stringr::str_detect(x, "[0-9]", negate=TRUE)
[1] FALSE  TRUE

Homoglyphs

x <- c("Η", "H")
tolower(x)
[1] "η" "h"

Why?

uni_info <- Vectorize(\(x) Unicode::u_char_name(utf8ToInt(x)), "x")
uni_info(x)
                         Η                          H 
"GREEK CAPITAL LETTER ETA"   "LATIN CAPITAL LETTER H" 

Homoglyphs

Particularly nasty with dashes - lean on [[:punct::]] where possible.

x <- c("Em Dash —", "En Dash –", "Hyphen ‐")
stringr::str_remove(x, "[:punct:]") # Works
[1] "Em Dash " "En Dash " "Hyphen " 
stringr::str_remove(x, "-")  # Keyboard minus = Fail
[1] "Em Dash —" "En Dash –" "Hyphen ‐" 

Why stringr?

Base R has its own set of regular expression functions (grep and friends)

stringr does the same thing, but with a more consistent interface.

Conversion table online

Regular Expression Practice

With your breakout group, it’s time for some Regular Expression Practice

Regex + Scraping

Regular expressions are incredibly useful when converting HTML text to workable data:

  • Extract numbers
  • Extract relevant parts of strings

Regex + Scraping

Common paradigm: html_text2() |> str_remove_all() |> as.numeric()

prices <- c("8.25", "$1,000", "500 USD", "$12,345.67 (Estimate)")
prices |> str_remove_all("[^.[:digit:]]") |> as.numeric()
[1]     8.25  1000.00   500.00 12345.67

Here, [^.[:digit:]] means anything ([]) that is not (^) a period or a digit.

Regex + Scraping

Another common paradigm is to extract structured text into a data frame when html_table fails

x <- "Adelie female 200g
Gentoo Male 500g
Chinstrap Female 1000g"

str_split(x, "\\n", simplify=TRUE) |> 
    str_match("(?<species>.*) (?<sex>.*) (?<weight>\\d+)g") |> 
    as.data.frame() |>
    select(-V1) |>
    mutate(sex = if_else(str_detect(sex, "[Ff]"), "female", "male"))
    species    sex weight
1    Adelie female    200
2    Gentoo   male    500
3 Chinstrap female   1000

Regex + Scraping

Can also be used to manipulate strings within a data frame:

x <- tribble(
    ~enrollment, ~course,
    50, "STA 9750",
    20, "STA 9890"
)

x |> mutate(dept = str_extract(course, "([:alpha:]{3}) ([:digit:]{4})", group=1),
            numb = str_extract(course, "([:alpha:]{3}) ([:digit:]{4})", group=2))
# A tibble: 2 × 4
  enrollment course   dept  numb 
       <dbl> <chr>    <chr> <chr>
1         50 STA 9750 STA   9750 
2         20 STA 9890 STA   9890 

Cocktail Scraping

With your breakout group, it’s time to finish the cocktail scraping exercise

Cocktails

Wrap Up

Wrap Up

Processing Strings in R

  • Encoding and Unicode
  • Regular Expressions

Computational Statistical Inference

Upcoming Work

Upcoming work from course calendar

Remaining Topic:

  • Machine Learning (Predictive Modeling)
    • After Thanksgiving 🦃

Musical Treat


Concert season - remember CUNY Student Benefits