Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 12

STA 9750 Week 12

Today:

  • Tuesday Section: 2025-11-25
  • Thursday Section: 2025-11-20

Lecture #10: Strings, Regular Expressions, and Text Processing

  • Communicating Results (quarto) ✅
  • R Basics ✅
  • Data Manipulation in R
  • Data Visualization in R
  • Getting Data into R ⬅️
    • Files and APIs ✅
    • Web Scraping ✅
    • Cleaning and Processing Text ⬅️
  • Statistical Modeling in R

Today

Today

  • Course Administration
  • Warm-Up Exercise
  • New Material
    • Strings Encodings
    • Regular Expressions
    • Text Manipulation
    • Computational Inference
  • Wrap-Up
    • Life Tip of the Day

Course Administration

GTA

Charles Ramirez is our GTA

  • Wednesday Office Hours moved to 5:15-7:15 for greater access
    • Give a bit of flexibility on the front end for CR to get off work
  • Grading Meta-Review #02 after Peer Feedback

Mini-Project #04

MP#04 - Just the Fact(-Check)s, Ma’am!

Due 2025-12-05 at 11:59pm ET

Topics covered:

  • Data Import
    • HTTP Request Construction (Week 9)
    • Tabular HTML Scraping (Week 11)
  • \(t\)-tests
  • Putting Everything Together

Grading in Progress

I owe you:

  • Mid-Term Check-In Presentation Feedback
  • MP#02 Meta-Review
  • Selected Regrades

Course Support

  • Synchronous
    • MW Office Hours 2x / week: Tuesdays + Thursdays 5pm
      • Rest of Semester except Thanksgiving (Nov 27th)
    • GTA Office Hours: Wednesdays at 5:15-7:15pm
  • Asynchronous: Piazza (\(<20\) minute average response time)

Course Project

End of Semester Course Project:

  • In-Class Final Presentations
    • Last week of class or during finals week (optional)
  • Individual Report
    • 2025-12-18
  • Group Report
    • 2025-12-18
  • Peer Evaluations
    • 2025-12-18

See detailed instructions for rubrics, expectations, etc.

Review Exercise

Review Exercise

Cocktail Scraping Exercise 🍸

Download all recipes from https://cocktails.hadley.nz/

Steps:

  1. Make a mental map of site & Identify Selectors
  2. Develop a strategy to get all pages
  3. Function to pull out each recipe
  4. Function to parse earch recipe into a data frame
  5. Put it all together

Breakout Rooms

Breakout Teams
1 Cycle Paths (T) + Green Apple (R)
2 Inspector Clouseau (T) + House Busters (R)
3 Weight Watchers (T) + Irish Mafia (R)
4 Point of Interest (T) + Urban Health Insight Group (R)
5 Sounds Good (T) + Wellness Warriors (R)
6 The Mean, Green, Data-Analyzing Team (T) + Gridion Regression (R)
7 Nightshift Analysts (T) + Standard Deviants (R)
8 Stats & The City (R)
9 Kitchen Nightmares (R)

Working with Strings

Strings

In R, strings and characters are basically interchangeable

  • Arbitrary “bits of text” that can be stored in a vector
  • Don’t normally need to think about encoding

stringr provides basic tools for string manipulation (str_ functions)

stringi provides advanced functionality

String Handling

Easy to get 90% of the way correct - very hard to get 100% correct

Human language is messy - choices are culturally-specific

Unicode standard exists to make it easy (easier…) to do the right thing

Unicode

Unicode is an attempt to standardize all human written language:

  • So hard!
  • Moving target
  • Don’t implement yourself - use libraries

Latest Unicode tables: unicodeplus.com/

Encodings connect Unicode IDs with actual bits on your computer: UTF-8 is mainly back-compatible and should be your default

Unicode Controversies

Pistol (U+1F52B) emoji:

  • Originally a (regular) gun, Apple lead the charge to a water pistol, now standard

Taco Controversy:


Unicode+UTF-8 - Modern Standard

Best practices:

  • Use updated Unicode compliant libraries like stringr
  • Use UTF-8 strings
  • If your data isn’t UTF-8, make it UTF-8 ASAP

stringr

The tidyverse package stringr provides a tools for string manipulation:

  • All functions start with str_
  • “Input” string is always first argument
  • Reasonably vectorized

stringr + dplyr

All stringr functions work well in dplyr pipelines (“vectorized”):

   lower_letters upper_letters
1              a             A
2              b             B
3              c             C
4              d             D
5              e             E
6              f             F
7              g             G
8              h             H
9              i             I
10             j             J
11             k             K
12             l             L
13             m             M
14             n             N
15             o             O
16             p             P
17             q             Q
18             r             R
19             s             S
20             t             T
21             u             U
22             v             V
23             w             W
24             x             X
25             y             Y
26             z             Z

Substrings and String Splitting

[[1]]
[1] "apples"  "oranges" "pears"   "bananas"

[[2]]
[1] "pineapples" "mangos"     "guavas"    
     [,1]          [,2]                            
[1,] "apples "     " oranges and pears and bananas"
[2,] "pineapples " " mangos and guavas"            

See also str_split_i to get only one element of split

Trimming Strings

Common to have excess whitespace around results: str_trim

[1] "pears"  "guavas"

Sub-Strings

str_sub to get substrings:

[1] "Baruch"
[1] "CUNY"
[1] "Baruch College"   "Brooklyn College"

Regular Expressions

Working directly with characters is painful and hard to do properly

Regular Expressions (regex) provide tools for specifying patterns in strings:

  • Regular => following rules

Regular Expression Tools

Regex 101

A basic regex is just a pattern:

  • a: The regex a will match all strings with an a:
[1]  TRUE FALSE FALSE  TRUE
  • Longer patterns are more precise:
[1] FALSE FALSE  TRUE  TRUE

Replacement

str_replace will replace string with something else: - str_remove will replace with nothing - Does first match (cf str_{remove,replace}_all)

[1] "123"        "123456"     "123456,789"
[1] "123"       "123456"    "123456789"

Wildcard

The . character is a ‘wildcard’ and matches anything:

[1] FALSE FALSE FALSE  TRUE

(You might have seen a similar usage using formulas)

Alternatives

Alternatives can be expressed using a |:

[1]  TRUE  TRUE FALSE  TRUE

For longer patterns, wrap in parentheses

[1] FALSE  TRUE  TRUE  TRUE

Ranges

Sometimes we might want to match a wide range of characters; e.g. digits

Alternatives are painful: (0|1|2|3|4|5|6|7|8|9)

Can use a range notion instead: [0-9]

[1]  TRUE FALSE  TRUE FALSE

Ranges

Useful ranges:

  • [A-Z]: Uppercase letters
  • [a-z]: Lowercase letters
  • [0-9]: Digits

Can also ‘hard code’ a range by listing all elements:

  • [0123456789]
  • [aeiou]

Ranges

Some useful ranges are hard-coded:

  • [:alpha:]
  • [:lower:]
  • [:upper:]
  • [:digit:]
  • [:alnum:]
  • [:punct:]
  • [:space:]

I like these - quite clear:

[1]  TRUE FALSE  TRUE FALSE

Quantifiers

Quantifiers (multiple matches):

  • .{a, b}: anywhere from a to b copies (inclusive)
  • .{, b}: no more than b copies
  • .{a,}: at least a copies
  • .?: zero-or-one, same as .{0,1}
  • .*: zero-or-more, same as .{0,}
  • .+: one-or-more, same as {1,}

Quantifiers

Wildcard match optional:

[1] FALSE FALSE  TRUE  TRUE

Strings with numbers:

[1]  TRUE FALSE  TRUE FALSE

Numbers 10 or greater:

[1] FALSE FALSE  TRUE  TRUE

Start and End Anchors

Anchors let us refer to the start and end of a string:

  • ^: start
  • $: end

Things starting with a number:

[1] "99 Red Balloons" "5 Years Time"   

Extracting Matches

Often, we use regex to pull our part of a string:

  • str_detect is there a ‘fit’?
  • str_extract extract the whole ‘fit’
  • str_match extract specific groups

Specify groups with parentheses

Extracting Matches

[1] TRUE
[1] "Baruch College, CUNY"
     [,1]                   [,2]            
[1,] "Baruch College, CUNY" "Baruch College"

Extracting Matches

Very useful for pulling numbers out of text:

[1] "123"  "456"  "7.89"

Extracting Matches

str_match(group=) is useful for complex data extraction.

[1] "Michael Weylandt" "KRR"             
[1] "STA9750" "STA9891"

Also allows named groups - really helpful!

                                            V1       instructor  course
1 Michael Weylandt teaches STA9750 on Thursday Michael Weylandt STA9750
2             KRR teaches STA9891 on Wednesday              KRR STA9891
    weekday
1  Thursday
2 Wednesday

Exclusion

[1] TRUE TRUE

Not quite what we want

str_detect has a negate option:

[1] FALSE  TRUE

Homoglyphs

[1] "η" "h"

Why?

                         Η                          H 
"GREEK CAPITAL LETTER ETA"   "LATIN CAPITAL LETTER H" 

Homoglyphs

Particularly nasty with dashes - lean on [[:punct::]] where possible.

[1] "Em Dash " "En Dash " "Hyphen " 
[1] "Em Dash —" "En Dash –" "Hyphen ‐" 

Why stringr?

Base R has its own set of regular expression functions (grep and friends)

stringr does the same thing, but with a more consistent interface.

Conversion table online

Regular Expression Practice

With your breakout group, it’s time for some Regular Expression Practice

Regex + Scraping

Regular expressions are incredibly useful when converting HTML text to workable data:

  • Extract numbers
  • Extract relevant parts of strings

Regex + Scraping

Common paradigm: html_text2() |> str_remove_all() |> as.numeric()

[1]     8.25  1000.00   500.00 12345.67

Here, [^.[:digit:]] means anything ([]) that is not (^) a period or a digit.

Regex + Scraping

Another common paradigm is to extract structured text into a data frame when html_table fails

    species    sex weight
1    Adelie female    200
2    Gentoo   male    500
3 Chinstrap female   1000

Regex + Scraping

Can also be used to manipulate strings within a data frame:

# A tibble: 2 × 4
  enrollment course   dept  numb 
       <dbl> <chr>    <chr> <chr>
1         50 STA 9750 STA   9750 
2         20 STA 9890 STA   9890 

Cocktail Scraping

With your breakout group, it’s time to finish the cocktail scraping exercise

Cocktails

Wrap Up

Wrap Up

Processing Strings in R

  • Encoding and Unicode
  • Regular Expressions

Computational Statistical Inference

Upcoming Work

Upcoming work from course calendar

Remaining Topic:

  • Machine Learning (Predictive Modeling)
    • After Thanksgiving 🦃

Musical Treat


Concert season - remember CUNY Student Benefits