| Breakout | Team |
|---|---|
| 1 | TBD |
Today: Lecture #10: Strings, Regular Expressions, and Text Processing
quarto) ✅R Basics ✅R ✅R ✅R ⬅️
RCharles Ramirez is our GTA
MP#04 - TBD
Due 2026-05-15 at 11:59pm ET
Topics covered:
I owe you:
End of Semester Course Project:
See detailed instructions for rubrics, expectations, etc.
Cocktail Scraping Exercise 🍸
Download all recipes from https://cocktails.hadley.nz/
Steps:
| Breakout | Team |
|---|---|
| 1 | TBD |
In R, strings and characters are basically interchangeable
stringr provides basic tools for string manipulation (str_ functions)
stringi provides advanced functionality
Easy to get 90% of the way correct - very hard to get 100% correct
Human language is messy - choices are culturally-specific
Unicode standard exists to make it easy (easier…) to do the right thing
Unicode is an attempt to standardize all human written language:
Latest Unicode tables: unicodeplus.com/
Encodings connect Unicode IDs with actual bits on your computer: UTF-8 is mainly back-compatible and should be your default
Pistol (U+1F52B) emoji:

Taco Controversy:
Best practices:
stringrThe tidyverse package stringr provides a tools for string manipulation:
str_All stringr functions work well in dplyr pipelines (“vectorized”):
lower_letters upper_letters
1 a A
2 b B
3 c C
4 d D
5 e E
6 f F
7 g G
8 h H
9 i I
10 j J
11 k K
12 l L
13 m M
14 n N
15 o O
16 p P
17 q Q
18 r R
19 s S
20 t T
21 u U
22 v V
23 w W
24 x X
25 y Y
26 z Z
[[1]]
[1] "apples" "oranges" "pears" "bananas"
[[2]]
[1] "pineapples" "mangos" "guavas"
[,1] [,2]
[1,] "apples " " oranges and pears and bananas"
[2,] "pineapples " " mangos and guavas"
See also str_split_i to get only one element of split
Common to have excess whitespace around results: str_trim
[1] "pears" "guavas"
str_sub to get substrings:
[1] "Baruch"
[1] "CUNY"
[1] "Baruch College" "Brooklyn College"
Working directly with characters is painful and hard to do properly
Regular Expressions (regex) provide tools for specifying patterns in strings:
A basic regex is just a pattern:
a: The regex a will match all strings with an a:[1] TRUE FALSE FALSE TRUE
[1] FALSE FALSE TRUE TRUE
str_replace will replace string with something else: - str_remove will replace with nothing - Does first match (cf str_{remove,replace}_all)
[1] "123" "123456" "123456,789"
[1] "123" "123456" "123456789"
The . character is a ‘wildcard’ and matches anything:
[1] FALSE FALSE FALSE TRUE
(You might have seen a similar usage using formulas)
Alternatives can be expressed using a |:
[1] TRUE TRUE FALSE TRUE
For longer patterns, wrap in parentheses
[1] FALSE TRUE TRUE TRUE
Sometimes we might want to match a wide range of characters; e.g. digits
Alternatives are painful: (0|1|2|3|4|5|6|7|8|9)
Can use a range notion instead: [0-9]
[1] TRUE FALSE TRUE FALSE
Useful ranges:
[A-Z]: Uppercase letters[a-z]: Lowercase letters[0-9]: DigitsCan also ‘hard code’ a range by listing all elements:
[0123456789][aeiou]Some useful ranges are hard-coded:
[:alpha:][:lower:][:upper:][:digit:][:alnum:][:punct:][:space:]I like these - quite clear:
[1] TRUE FALSE TRUE FALSE
Quantifiers (multiple matches):
.{a, b}: anywhere from a to b copies (inclusive).{, b}: no more than b copies.{a,}: at least a copies.?: zero-or-one, same as .{0,1}.*: zero-or-more, same as .{0,}.+: one-or-more, same as {1,}Wildcard match optional:
[1] FALSE FALSE TRUE TRUE
Strings with numbers:
[1] TRUE FALSE TRUE FALSE
Numbers 10 or greater:
[1] FALSE FALSE TRUE TRUE
Anchors let us refer to the start and end of a string:
^: start$: endThings starting with a number:
[1] "99 Red Balloons" "5 Years Time"
Often, we use regex to pull our part of a string:
str_detect is there a ‘fit’?str_extract extract the whole ‘fit’str_match extract specific groupsSpecify groups with parentheses
[1] TRUE
[1] "Baruch College, CUNY"
[,1] [,2]
[1,] "Baruch College, CUNY" "Baruch College"
Very useful for pulling numbers out of text:
[1] "123" "456" "7.89"
str_match(group=) is useful for complex data extraction.
[1] "Michael Weylandt" "KRR"
[1] "STA9750" "STA9891"
Also allows named groups - really helpful!
V1 instructor course
1 Michael Weylandt teaches STA9750 on Thursday Michael Weylandt STA9750
2 KRR teaches STA9891 on Wednesday KRR STA9891
weekday
1 Thursday
2 Wednesday
[1] TRUE TRUE
Not quite what we want
str_detect has a negate option:
[1] FALSE TRUE
[1] "η" "h"
Why?
Η H
"GREEK CAPITAL LETTER ETA" "LATIN CAPITAL LETTER H"
Particularly nasty with dashes - lean on [[:punct::]] where possible.
[1] "Em Dash " "En Dash " "Hyphen "
[1] "Em Dash —" "En Dash –" "Hyphen ‐"
Base R has its own set of regular expression functions (grep and friends)
stringr does the same thing, but with a more consistent interface.
Conversion table online
With your breakout group, it’s time for some Regular Expression Practice
Regular expressions are incredibly useful when converting HTML text to workable data:
Common paradigm: html_text2() |> str_remove_all() |> as.numeric()
[1] 8.25 1000.00 500.00 12345.67
Here, [^.[:digit:]] means anything ([]) that is not (^) a period or a digit.
Another common paradigm is to extract structured text into a data frame when html_table fails
species sex weight
1 Adelie female 200
2 Gentoo male 500
3 Chinstrap female 1000
Can also be used to manipulate strings within a data frame:
# A tibble: 2 × 4
enrollment course dept numb
<dbl> <chr> <chr> <chr>
1 50 STA 9750 STA 9750
2 20 STA 9890 STA 9890
With your breakout group, it’s time to finish the cocktail scraping exercise
Processing Strings in R
Computational Statistical Inference
Upcoming work from course calendar
Remaining Topic:
Concert season - remember CUNY Student Benefits