Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 12 – Thursday 2026-04-30
Last Updated: 2026-04-30

STA 9750 Week 12

Today: Lecture #10: Strings, Regular Expressions, and Text Processing

These slides can be found online at:

https://michael-weylandt.com/STA9750/slides/slides12.html

In-class activities can be found at:

https://michael-weylandt.com/STA9750/labs/lab10.html

Upcoming TODO

Upcoming student responsibilities:

Date Time Details
2026-05-03 11:59pm ET Mini-Project Peer Feedback #03 Due
2026-05-07 6:00pm ET Final Project Presentation Slides Due
2026-05-14 6:00pm ET Pre-Assignment #14 Due
2026-05-15 11:59pm ET Mini-Project #04 Due
2026-05-21 11:59pm ET Final Project Summary Report Due [Tentative]
2026-05-21 11:59pm ET Final Project Individual Report Due [Tentative]
2026-05-21 11:59pm ET Final Project Teammate Peer Evaluations Due [Tentative]
2026-05-24 11:59pm ET Mini-Project Peer Feedback #04 Due

STA 9750 Week 12

Today: Lecture #10: Strings, Regular Expressions, and Text Processing

  • Communicating Results (quarto) ✅
  • R Basics ✅
  • Data Manipulation in R
  • Data Visualization in R
  • Getting Data into R ⬅️
    • Files and APIs ✅
    • Web Scraping ✅
    • Cleaning and Processing Text ⬅️
  • Statistical Modeling in R

Today

Today

  • Course Administration
  • Warm-Up Exercise
  • New Material
    • Strings and Encodings
    • Regular Expressions
    • Text Manipulation
  • Wrap-Up

Course Administration

Mini-Project #04

MP#04 - Going for the Gold

Due 2026-05-15 at 11:59pm ET

Topics covered:

  • Data Import
    • HTTP Request Construction (Week 9)
    • Tabular HTML Scraping (Week 11)
  • \(t\)-tests
  • Putting Everything Together

Course Support

  • Synchronous - MW Office Hours 2x / week:
    • Wednesdays 5pm: In Person
    • Thursdays 5pm: Zoom
  • Asynchronous: Piazza (\(<45\) minute average response time)

Course Project

End of Semester Course Project:

  • In-Class Final Presentations
    • Next Week (2026-05-07)
  • Individual Report: 2026-05-21
  • Group Report: 2026-05-21
  • Peer Evaluations: 2026-05-21

See detailed instructions for rubrics, expectations, etc.

Review Exercise

Apple Rankings

We’re going to parse the page https://applerankings.com to find the best type of apples. See this week’s lab for details.

Breakout Rooms

Breakout Room Team
1 3-1-Fun! (XC+ML+ER+RJSN)
2 Emissions Impossible (LR+MOG+APTL)
3 Inspector Gadget (MUO+KN+CM+ID+KM)
4 Maniac Braniacs (HHS+KK+FC+DN)
5 Water Benders (JE+JABB+MTP+JA+AS)

’em Apples

Working with Strings

Strings

In R, strings and characters are basically interchangeable

  • Arbitrary “bits of text” that can be stored in a vector
  • Don’t normally need to think about encoding

stringr provides basic tools for string manipulation (str_ functions)

stringi provides advanced functionality

String Handling

Easy to get 90% of the way correct - very hard to get 100% correct

Human language is messy - choices are culturally-specific

Unicode standard exists to make it easy (easier…) to do the right thing

Unicode

Unicode is an attempt to standardize all human written language:

  • So hard!
  • Moving target
  • Don’t implement yourself - use libraries

Latest Unicode tables: unicodeplus.com/

Encodings connect Unicode IDs with actual bits on your computer: UTF-8 is mainly back-compatible and should be your default

Unicode Controversies

Pistol (U+1F52B) emoji:

  • Originally a (regular) gun, Apple lead the charge to a water pistol, now standard

Unicode Controversies

Taco Controversy 🌮:

Unicode Failures


Unicode+UTF-8 - Modern Standard

Best practices:

  • Use updated Unicode compliant libraries like stringr
  • Use UTF-8 strings
  • If your data isn’t UTF-8, make it UTF-8 ASAP
iconv(STRING, from="latin1", to="UTF-8")

stringr

The tidyverse package stringr provides a tools for string manipulation:

  • All functions start with str_
  • “Input” string is always first argument
  • Reasonably vectorized

stringr + dplyr

All stringr functions work well in dplyr pipelines (“vectorized”):

library(dplyr); library(stringr)
df <- data.frame(lower_letters = letters)
df |> mutate(upper_letters = str_to_upper(lower_letters))
   lower_letters upper_letters
1              a             A
2              b             B
3              c             C
4              d             D
5              e             E
6              f             F
7              g             G
8              h             H
9              i             I
10             j             J
11             k             K
12             l             L
13             m             M
14             n             N
15             o             O
16             p             P
17             q             Q
18             r             R
19             s             S
20             t             T
21             u             U
22             v             V
23             w             W
24             x             X
25             y             Y
26             z             Z

Substrings and String Splitting

fruits <- c("apples and oranges and pears and bananas", 
            "pineapples and mangos and guavas")

stringr::str_split(fruits, " and ")
[[1]]
[1] "apples"  "oranges" "pears"   "bananas"

[[2]]
[1] "pineapples" "mangos"     "guavas"    
stringr::str_split_fixed(fruits, "and", n=2)
     [,1]          [,2]                            
[1,] "apples "     " oranges and pears and bananas"
[2,] "pineapples " " mangos and guavas"            

See also str_split_i to get only one element of split

Trimming Strings

Common to have excess whitespace around results: str_trim

split_fruits <- stringr::str_split(fruits, "and") |> list_c()
split_fruits
[1] "apples "     " oranges "   " pears "     " bananas"    "pineapples "
[6] " mangos "    " guavas"    

becomes

split_fruits |> str_trim()
[1] "apples"     "oranges"    "pears"      "bananas"    "pineapples"
[6] "mangos"     "guavas"    

Sub-Strings

str_sub to get substrings:

x <- "Baruch College, CUNY"
stringr::str_sub(x, end=6) # Includes endpoints
[1] "Baruch"
stringr::str_sub(x, start=-4) # Count from end
[1] "CUNY"
x <- c("Baruch College, CUNY", "Brooklyn College, CUNY")
stringr::str_sub(x, end=-7) # Drop last _6_
[1] "Baruch College"   "Brooklyn College"

Regular Expressions

Working directly with characters is painful and hard to do properly

Regular Expressions (regex) provide tools for specifying patterns in strings:

  • Regular => following rules

Regular Expression Tools

LLMs are also very good at this if you can specify what you want properly.

Regex 101

A basic regex is just a pattern:

  • a: The regex a will match all strings with an a:
pets <- c("cat", "dog", "fish", "catfish")
str_detect(pets, "a")
[1]  TRUE FALSE FALSE  TRUE
  • Longer patterns are more precise:
pets <- c("cat", "dog", "fish", "catfish")
str_detect(pets, "fish")
[1] FALSE FALSE  TRUE  TRUE

Replacement

str_replace will replace string with something else: - str_remove will replace with nothing - Does first match (cf str_{remove,replace}_all)

x <- c("123", "123,456", "123,456,789")
str_remove(x, ",")
[1] "123"        "123456"     "123456,789"
str_remove_all(x, ",")
[1] "123"       "123456"    "123456789"

Wildcard

The . character is a ‘wildcard’ and matches anything:

pets <- c("cat", "dog", "fish", "catfish")
str_detect(pets, ".fish")
[1] FALSE FALSE FALSE  TRUE

(You might have seen a similar usage using formulas)

Alternatives

Alternatives can be expressed using a |:

pets <- c("cat", "dog", "fish", "catfish")
str_detect(pets, "a|o")
[1]  TRUE  TRUE FALSE  TRUE

For longer patterns, wrap in parentheses

pets <- c("cat", "dog", "fish", "catfish")
str_detect(pets, "(dog|fish)")
[1] FALSE  TRUE  TRUE  TRUE

Ranges

Sometimes we might want to match a wide range of characters; e.g. digits

Alternatives are painful: (0|1|2|3|4|5|6|7|8|9)

Can use a range notion instead: [0-9]

pets <- c("1 cat", "a dog", "3 fish", "two elephants")
str_detect(pets, "[0-9]")
[1]  TRUE FALSE  TRUE FALSE

Ranges

Useful ranges:

  • [A-Z]: Uppercase letters
  • [a-z]: Lowercase letters
  • [0-9]: Digits

Can also ‘hard code’ a range by listing all elements:

  • [0123456789]
  • [aeiou]

Ranges

Some useful ranges are hard-coded:

  • [:alpha:]
  • [:lower:]
  • [:upper:]
  • [:digit:]
  • [:alnum:]
  • [:punct:]
  • [:space:]

I like these - quite clear:

pets <- c("1 cat", "a dog", "3 fish", "two elephants")
str_detect(pets, "[:digit:]")
[1]  TRUE FALSE  TRUE FALSE

Quantifiers

Quantifiers (multiple matches):

  • .{a, b}: anywhere from a to b copies (inclusive)
  • .{, b}: no more than b copies
  • .{a,}: at least a copies
  • .?: zero-or-one, same as .{0,1}
  • .*: zero-or-more, same as .{0,}
  • .+: one-or-more, same as {1,}

Quantifiers

Wildcard match optional:

pets <- c("cat", "dog", "fish", "catfish")
str_detect(pets, ".?fish")
[1] FALSE FALSE  TRUE  TRUE

Strings with numbers:

pets <- c("1 cat", "a dogs", "3 fish", "two birds")
str_detect(pets, "[:digit:]")
[1]  TRUE FALSE  TRUE FALSE

Numbers 10 or greater:

pets <- c("1 cat", "3 dogs", "10 fish", "20 birds")
str_detect(pets, "[:digit:]{2,}")
[1] FALSE FALSE  TRUE  TRUE

Regular Expression Practice

With your breakout group, it’s time for some Regular Expression Practice

Start and End Anchors

Anchors let us refer to the start and end of a string:

  • ^: start
  • $: end

Things starting with a number:

songs <- c("Mambo No 5", "99 Red Balloons", "5 Years Time")
str_subset(songs, "^[:digit:]")
[1] "99 Red Balloons" "5 Years Time"   

Extracting Matches

Often, we use regex to pull our part of a string:

  • str_detect is there a ‘fit’?
  • str_extract extract the whole ‘fit’
  • str_extract(group=) and str_match extract specific groups

Specify groups with parentheses

"([:digit:]{3})-[:digit:]{3}-[:digit:]{4}"

will extract "646" when applied to "646-312-3257"

Extracting Matches

str_detect - is there a match?

x <- "Baruch College, CUNY is a wonderful place to work!"
stringr::str_detect(x, "(.*), CUNY")
[1] TRUE

str_extract - get the matched substring

stringr::str_extract(x, "(.*), CUNY")
[1] "Baruch College, CUNY"

str_match - use capture groups

stringr::str_extract(x, "(.*), CUNY", group=1)
[1] "Baruch College"
stringr::str_match(x, "(.*), CUNY")
     [,1]                   [,2]            
[1,] "Baruch College, CUNY" "Baruch College"

Extracting Matches

Very useful for pulling numbers out of text:

x <- c("123", "456 (estimated)", "Approximately 7.89")
str_extract(x, "[.[:digit:]]+")
[1] "123"  "456"  "7.89"

Extracting Matches

str_match(group=) is useful for complex data extraction.

x <- c("Michael Weylandt teaches STA9750 on Thursday", 
       "Kamiar Rad teaches STA9891 on Wednesday")
pattern <- c("(.*) teaches ([:alnum:]*)")
stringr::str_extract(x, pattern, group=1)
[1] "Michael Weylandt" "Kamiar Rad"      
stringr::str_extract(x, pattern, group=2)
[1] "STA9750" "STA9891"

Also allows named groups - really helpful!

pattern <- c("(?<instructor>.*) teaches (?<course>.*) on (?<weekday>.*)")
stringr::str_match(x, pattern) |> 
  as.data.frame()
                                            V1       instructor  course
1 Michael Weylandt teaches STA9750 on Thursday Michael Weylandt STA9750
2      Kamiar Rad teaches STA9891 on Wednesday       Kamiar Rad STA9891
    weekday
1  Thursday
2 Wednesday

Exclusion

x <- c("10 blue fish", "three wet goats")
stringr::str_detect(x, "[^0123456789]")
[1] TRUE TRUE

Not quite what we want

str_detect has a negate option:

stringr::str_detect(x, "[0-9]", negate=TRUE)
[1] FALSE  TRUE

Homoglyphs

x <- c("Η", "H")
str_to_lower(x)
[1] "η" "h"

Why?

uni_info <- Vectorize(\(x) Unicode::u_char_name(utf8ToInt(x)), "x")
uni_info(x)
                         Η                          H 
"GREEK CAPITAL LETTER ETA"   "LATIN CAPITAL LETTER H" 

Homoglyphs

Particularly nasty with dashes - lean on [:punct:] where possible.

x <- c("Em Dash —", "En Dash –", "Hyphen ‐")
stringr::str_remove(x, "[:punct:]") # Works
[1] "Em Dash " "En Dash " "Hyphen " 
stringr::str_remove(x, "-")  # Keyboard minus = Fail
[1] "Em Dash —" "En Dash –" "Hyphen ‐" 

Why stringr?

Base R has its own set of regular expression functions (grep and friends)

stringr does the same thing, but with a more consistent interface.

Conversion table online

Regex in Scraping

Let’s practice using these functions in a web-scraping context.

Follow the lab to practice scraping https://quotes.toscrape.com with your breakout group

Items to practice:

  • String detection (does this quote mention X)
  • String manipulation (longest, shortest, etc.)
  • Removing punctuation (str_remove_all)

Regex + Scraping

Regular expressions are incredibly useful when converting HTML text to workable data:

  • Extract numbers
  • Extract relevant parts of strings

Regex + Scraping

Common paradigm: html_text2() |> str_remove_all() |> as.numeric()

prices <- c("8.25", "$1,000", "500 USD", "$12,345.67 (Estimate)")
prices |> str_remove_all("[^.[:digit:]-]") |> as.numeric()
[1]     8.25  1000.00   500.00 12345.67

Here, [^.[:digit:]-] means anything ([]) that is not (^) a period, a digit, or a -.

Regex + Scraping

Another common paradigm is to extract structured text into a data frame when html_table fails

x <- "Adelie female 200g
Gentoo Male 500g
Chinstrap Female 1000g"

str_split(x, "\\n", simplify=TRUE) |> 
    str_match("(?<species>.*) (?<sex>.*) (?<weight>[:digit:]+)g") |> 
    as.data.frame() |>
    select(-V1) |>
    mutate(sex = if_else(str_detect(sex, "[Ff]"), "female", "male"))
    species    sex weight
1    Adelie female    200
2    Gentoo   male    500
3 Chinstrap female   1000

Regex + Scraping

Can also be used to manipulate strings within a data frame:

x <- tribble(
    ~enrollment, ~course,
    50, "STA 9750",
    20, "STA 9890"
)

x |> mutate(dept = str_extract(course, "([:alpha:]{3}) ([:digit:]{4})", group=1),
            numb = str_extract(course, "([:alpha:]{3}) ([:digit:]{4})", group=2))
# A tibble: 2 × 4
  enrollment course   dept  numb 
       <dbl> <chr>    <chr> <chr>
1         50 STA 9750 STA   9750 
2         20 STA 9890 STA   9890 

Cocktail Scraping

With your breakout group, it’s time to finish the cocktail scraping exercise🍸

Aim: Download all recipes from https://cocktails.hadley.nz/

Recall from Last Time:

  1. Make a mental map of site & Identify Selectors
  2. Develop a strategy to get all pages
  3. Function to pull out each recipe
  4. Function to parse earch recipe into a data frame
  5. Put it all together

Wrap Up

Wrap Up

Processing Strings in R

  • Encoding and Unicode
  • Regular Expressions
  • String Manipulations
  • Capturing with Regex

Musical Treat