Strings
In R, strings and characters are basically interchangeable
- Arbitrary “bits of text” that can be stored in a vector
- Don’t normally need to think about encoding
stringr provides basic tools for string manipulation (str_ functions)
stringi provides advanced functionality
String Handling
Easy to get 90% of the way correct - very hard to get 100% correct
Human language is messy - choices are culturally-specific
Unicode standard exists to make it easy (easier…) to do the right thing
Unicode
Unicode is an attempt to standardize all human written language:
- So hard!
- Moving target
- Don’t implement yourself - use libraries
Encodings connect Unicode IDs with actual bits on your computer: UTF-8 is mainly back-compatible and should be your default
Unicode Controversies
Pistol (U+1F52B) emoji:
- Originally a (regular) gun, Apple lead the charge to a water pistol, now standard
![]()
Unicode+UTF-8 - Modern Standard
Best practices:
- Use updated Unicode compliant libraries like
stringr
- Use UTF-8 strings
- If your data isn’t UTF-8, make it UTF-8 ASAP
stringr
The tidyverse package stringr provides a tools for string manipulation:
- All functions start with
str_
- “Input” string is always first argument
- Reasonably vectorized
stringr + dplyr
All stringr functions work well in dplyr pipelines (“vectorized”):
lower_letters upper_letters
1 a A
2 b B
3 c C
4 d D
5 e E
6 f F
7 g G
8 h H
9 i I
10 j J
11 k K
12 l L
13 m M
14 n N
15 o O
16 p P
17 q Q
18 r R
19 s S
20 t T
21 u U
22 v V
23 w W
24 x X
25 y Y
26 z Z
Substrings and String Splitting
[[1]]
[1] "apples" "oranges" "pears" "bananas"
[[2]]
[1] "pineapples" "mangos" "guavas"
[,1] [,2]
[1,] "apples " " oranges and pears and bananas"
[2,] "pineapples " " mangos and guavas"
See also str_split_i to get only one element of split
Trimming Strings
Common to have excess whitespace around results: str_trim
Sub-Strings
str_sub to get substrings:
[1] "Baruch College" "Brooklyn College"
Regular Expressions
Working directly with characters is painful and hard to do properly
Regular Expressions (regex) provide tools for specifying patterns in strings:
- Regular => following rules
Regex 101
A basic regex is just a pattern:
a: The regex a will match all strings with an a:
[1] TRUE FALSE FALSE TRUE
- Longer patterns are more precise:
[1] FALSE FALSE TRUE TRUE
Replacement
str_replace will replace string with something else: - str_remove will replace with nothing - Does first match (cf str_{remove,replace}_all)
[1] "123" "123456" "123456,789"
[1] "123" "123456" "123456789"
Wildcard
The . character is a ‘wildcard’ and matches anything:
[1] FALSE FALSE FALSE TRUE
(You might have seen a similar usage using formulas)
Alternatives
Alternatives can be expressed using a |:
For longer patterns, wrap in parentheses
Ranges
Sometimes we might want to match a wide range of characters; e.g. digits
Alternatives are painful: (0|1|2|3|4|5|6|7|8|9)
Can use a range notion instead: [0-9]
[1] TRUE FALSE TRUE FALSE
Ranges
Useful ranges:
[A-Z]: Uppercase letters
[a-z]: Lowercase letters
[0-9]: Digits
Can also ‘hard code’ a range by listing all elements:
Ranges
Some useful ranges are hard-coded:
[:alpha:]
[:lower:]
[:upper:]
[:digit:]
[:alnum:]
[:punct:]
[:space:]
I like these - quite clear:
[1] TRUE FALSE TRUE FALSE
Quantifiers
Quantifiers (multiple matches):
.{a, b}: anywhere from a to b copies (inclusive)
.{, b}: no more than b copies
.{a,}: at least a copies
.?: zero-or-one, same as .{0,1}
.*: zero-or-more, same as .{0,}
.+: one-or-more, same as {1,}
Quantifiers
Wildcard match optional:
[1] FALSE FALSE TRUE TRUE
Strings with numbers:
[1] TRUE FALSE TRUE FALSE
Numbers 10 or greater:
[1] FALSE FALSE TRUE TRUE
Start and End Anchors
Anchors let us refer to the start and end of a string:
Things starting with a number:
[1] "99 Red Balloons" "5 Years Time"
Exclusion
Not quite what we want
str_detect has a negate option:
Homoglyphs
Why?
Η H
"GREEK CAPITAL LETTER ETA" "LATIN CAPITAL LETTER H"
Homoglyphs
Particularly nasty with dashes - lean on [[:punct::]] where possible.
[1] "Em Dash " "En Dash " "Hyphen "
[1] "Em Dash —" "En Dash –" "Hyphen ‐"
Why stringr?
Base R has its own set of regular expression functions (grep and friends)
stringr does the same thing, but with a more consistent interface.
Conversion table online
Regex + Scraping
Regular expressions are incredibly useful when converting HTML text to workable data:
- Extract numbers
- Extract relevant parts of strings
Regex + Scraping
Common paradigm: html_text2() |> str_remove_all() |> as.numeric()
[1] 8.25 1000.00 500.00 12345.67
Here, [^.[:digit:]] means anything ([]) that is not (^) a period or a digit.
Regex + Scraping
Another common paradigm is to extract structured text into a data frame when html_table fails
species sex weight
1 Adelie female 200
2 Gentoo male 500
3 Chinstrap female 1000
Regex + Scraping
Can also be used to manipulate strings within a data frame:
# A tibble: 2 × 4
enrollment course dept numb
<dbl> <chr> <chr> <chr>
1 50 STA 9750 STA 9750
2 20 STA 9890 STA 9890
Cocktail Scraping
With your breakout group, it’s time to finish the cocktail scraping exercise
Cocktails