STA 9750 Lecture #10 Pre-Class Assignment: Strings and Things
Due Date: 2026-04-30 (Thursday) at 06:00pm (before Class Session #12)
Submission: CUNY Brightspace
This week, we begin to study the world of text data. While numerical data is reasonably straight-forward to deal with, text data is remarkably complex. A full discussion of text data requires understanding the vast world of human written language, but we will discuss enough of the major points to hopefully solve 95% of the challenges you will face in your career.
Goals
In our quest to understand text data, we have two major goals:
- Understanding String Encodings and Unicode
- Manipulating Strings with Regular Expressions
Before we get into these, let’s begin with a basic review of the character data type in R.
String Vectors
Recall that R works by default on vectors - ordered collections of the “same sort” of thing. R supports the following vector types:
-
Raw for pure access to bytes without any additional meaning: rarely useful for pure data-analytic work, but commonly used to interact with binary file formats and with non-
Rsoftware - Integer: 32-bit signed integers, ranging from \(-2^{30}\) to \(2^{30}-1\). (If you have done low-level work before, you might ask where the extra bit went: it’s used for encoding NA values.)
- Numeric: 64-bit (double precision) floating point values, ranging from 0 to (approximately) \(\pm 10^{308}\). The detailed behavior of numeric (often called double) data is beyond this course, but it is well documented elsewhwere.
- Character: the topic of today’s discussion.
R makes no difference between a character -in the sense of a single letter- and a string: in particular, each element of a character vector is an (arbitrary length) string. Specialized functions are required for work at the true “single letter” scale. If you come from other languages, this behavior might be surprising, but it allows R to handle much of the complexity associated with characters ‘auto-magically’, which greatly simplifies data analysis.
When speaking, I often refer to R as using strings because of this flexibility, even if R itself calls them character elements for historical reasons.
Encoding
How are strings represented on a computer? The answer has evolved over time, but the current state of the art - used by almost all non-legacy software - is based on the Unicode system and the UTF-8 encoding.
The Unicode system is comprised of two essential parts:
- A numbered list of “letter-like” elements
- Rules for manipulating those elements
While this seems simple, it is anything but. The history of string representations in computers is a long and painful story of programmers repeatedly underestimating the complexity of the seemingly simple task of listing “all the letters.”
The Unicode consortium makes a long list of characters that computers should be able to represent: the most recent version of the Unicode standard (Version 17.0 released 2025-09-09) includes 159,801 characters divided into 172 scripts. These include everything from the basic (Anglo-American) Latin alphabet to the Greek and Cyrillic alphabets to Chinese and Japanese characters to the undeciphered Linear A alphabet and Tengwar, the fictional script used in the Lord of the Rings novels. The Unicode standard also includes a wide set of Emoji (approximately 4000) and many “modifying” characters.1 Recent additions include a Big Foot emoji, an Avalanche Emoji, and a Trombone Emoji.
To each of these, the Unicode consortium assigns a code point : a numerical identifier. Even superficially similar characters may be assigned different code points to distinguish them: for example, “H” is code point U+0048 with the official description “Latin Capital Letter H” while “Η” is U+0397, “Greek Capital Letter Eta.”
Visually, these look identical, but the difference between these characters is essential to know how to manipulate them:
Use the tolower function to lower-case each of these:
The Unicode standard defines the lower case mapping of U+0048 as the Latin lower case h (U+0068), while the lower case mapping of U+0397 is the Greek lower case eta (U+03B7), which looks something like a streched n.
The Unicode package provides some tools to investigate these:
In general, these sorts of ‘case-fold’ mappings are incredibly complicated and depend not only on the specific code point, but also the set of translation rules being used. (For historical and political reasons, certain languages have different lower/upper mappings for what are otherwise the same letter in Unicode.)
While you don’t need to know all of this complexity, it is essential to know that it’s out there and to rely on battle-tested libraries to perform these mappings.
Unicode is supplemented by the UTF-8 encodings, which controls how 0/1-bit strings are actually translated to code points. (Fonts then map code points to what you see on the screen.) UTF-8 is more-or-less back-compatible with other major encodings, so it’s a good default. When dealing with modern websites or public data sources, they almost always present their contents in a UTF-8 compatible encoding (if not UTF-8 proper) so you should be ok.
A well-formatted website will state its encoding near the top of the page:
library(rvest)
read_html("http://www.baruch.cuny.edu") |>
html_elements("meta[charset]") |>
html_attr("charset")[1] "UTF-8"
Advice: Whenever possible, make sure you are using UTF-8 strings: if your data is not UTF-8, reencode it to UTF-8 as soon as possible. This will save you much pain.
String Manipulation
Once data is in R and encoded as UTF-8 Unicode points, we have several tools for dealing with strings. Your first port of call should be the stringr package.
All the functions of the stringr package start with str_ and take a vector of strings as the first argument, making them well suited for chained analysis.
Let’s start with str_length which simply computes the length of each element. For the basic Latin alphabet, this more or less matches our intuition:
but it can be tricky for strings that involve Unicode combining characters.
Here the “overbar” is a combining character which we add on to the X. This is commonly (though not always) used for languages with accents (e.g. French) or for languages where vowels are written above and below the main script (Arabic or Hebrew). This same idea is used for certain Emoji constructs:
Here, “Man with Dark Skin Tone” is the combination of “Man” and “Dark Skin Tone.” (Compare how this appears in your browser to how RStudio prints it if you copy this code to your computer.)
While there is complexity in all of Unicode, str_length will behave as you might expect for “regular” text. I’m going to stop showing the “scary side” of Unicode, but you should be aware of it for the remainder of these exercises.
Concatenation
You have already seen the base paste and paste0 functions for combining two string vectors together.
By default, paste combines strings with a space between them, while paste0 omits the space. paste is typically what you want for strings for human reading, while paste0 is a better guess for computer-oriented text (e.g., putting together a URL).
You can change the separator by passing a sep argument to paste:
You can also combine together multiple elements of a vector using the collapse argument:
Exercises:
- Using the
pastefunction, make a vector of strings like “John’s favorite color is blue” from the following data:
- Modify your answer to write a (run-on) sentence of favorite colors: “John’s favorite color is blue and Jane’s favorite color is orange and …”
Note that this does not provide terminating punctuation.
Substring Selection
When cleaning up data for analysis, it is common to need to take substrings from larger text. The str_sub function will do this for a fixed length:
This behavior is useful when you are trying to extract a single piece from a longer bit of consistently formatted text, e.g., a computer log file or a set of IDs.
If you want to go all the way to the end, set the end element to -1:
Exercises
Using str_sub, remove the system name (CUNY or UC) and return only the campus name:
Detect and Matching
Often we only need to know whether a particular substring is present in a larger string. We can use str_detect to do this:
This is particularly useful inside of dplyr commands:
The str_match function will return the text of the match. Here it’s useless, but we’ll see that it becomes more powerful when we allow more flexible pattern specifications.
The str_subset function will return only those strings which match a certain pattern:
Exercises
Use str_subset to find the CUNY schools:
Specifying Patterns
While working by individual characters is sometimes useful (for very predictably formatted data), we generally need more powerful tools: regular expressions (RE) provide a compact language for specifying patterns in strings. We’ll introduce the basics here to help with string functions and then explore some more advanced RE features.
The most basic pattern is a set of elements in brackets: this means “any of these”.
For example, we want to see which names have an “A” in them:
Here, we consider a string to be a match if it has an A, an a or both.
Alternatively, we can see which strings contain numbers:
If we use str_match we can pull out the matching element:
By default, this only finds one appearance of the pattern:
We can modify the pattern specifier to include count information. The basic behavior is to add explicit count bounds:
Here a single number is an exact count ({2}), while pairs ({2,3}) specify a range. If one end of the range is left empty, it is 0 or infinite (depending on the direction).
Certain count specifications are sufficiently useful to get their own syntax:
- One or more:
+is equivalent to{1,} - Zero or more:
*is equivalent to{0,} - One or zero:
?is equivalent to{0,1}.
Use these specifications for the next set of exercises.
Exercises
- Which strings contain a three digit number?
A more compact way of writing [0123456789] is \\d, so this could also be done with
Combining patterns
You can combine REs to make more complex patterns:
-
(a|b)meansaorb. This is like[]notation buta,bcan be more complex than single characters. For example, we can detect schools that are either CUNY or UC:
-
[^abc]means anything other thana,b,c. You can often achieve a similar effect using thenegateargument tostr_detect, but you need this specifically forstr_match
Note that this does not require that all elements of the string don’t match, only that at least one character doesn’t fall in the specified range. If you want to have no matches, the negate=TRUE argument can help:2
-
^outside of a bracket denotes the start of a line:
-
$denotes the end of a line:
Note that we made the e optional here, so we matched both the American (whiskey) and the Scottish spelling (whisky) of a distilled grain spirit.
See the stringr RE docs for more examples of regular expressions.
Exercises
- Use a regular expression to find which of these are fish species:
- Use a regular expression to find words with three or more vowels in a row:
- Find the words where “q” is not followed by a “u”
Replacement
The str_replace function allows us to replace a string with something else. This is particularly useful when cleaning up text:
Note that this replace only the first match in a string:
In some circumstances, you may want to replace all matches, and should instead use str_replace_all:
Character Classes
Some patterns are so common with regular expressions that ‘shorthand’ is given for them:
-
\dis short-hand for the set of digits[0123456789] -
\Dis short-hand for non-digits[^0123456789] -
\sis short-hand for any sort of space (space, newline, tab, etc.) -
\Sis short-hand for anything that is not a space -
\wis short-hand for the ‘normal word elements’ of basic English:[A-Za-z0-9_], that is, all digits, upper case letters, lower case letters, or an underscore -
\Wis short-hand for anything not in\w:[^A-Za-z0-9_]
Use of these is not required, but they are recommended as they will save some typing and be a bit more robust: e.g., you might think you are covering all letters with [A-Za-z] but unexpected accented letters might throw you off:
One word of warning that can be a bit tricky: when using these, you will typically need to put a double-slash in front of the character class name, e.g., how I put \\S above. The reasons for this are a bit technical, but essentially, \x has a separate (non-regex) meaning, so we have to ‘escape’ the \ by putting an extra slash in front of it: \\. You should really read \\S as \\ + S, not \ + \S. We will discuss this a bit more in class since it gets tricky.
Capture Groups
Before we end, let’s introduce one more powerful feature of regular expressions: capture groups. This lets us give an identity to the part of a string that matches a text and reference it again later.
The basic syntax for a capture group is simple: you simply surround the desired part with parentheses: for example,
Here, we are using the short-hand \S for non-space character (i.e., letters, numbers, or punctuation).3 Here, we see that the output of str_match now returns the part of the string that matched the entire regex (Albany, NY) and the part that matched the capture group (Albany). If we want to just get the capture group and nothing else, we can use str_extract and provide the group argument:
Extensions to str_match_all will return multiple matches
When we combine this with str_replace, we can actually reference the capture group in the output:
Here, \\N codes for the result of the \(N^{\text{th}}\) capture group.
This is very useful when dealing with semi-structured text:
though this usually requires extensive trial-and-error to get the regular expression exactly correct.
Review of Last Week’s Web Parsing Practice
Before class, make sure you are comfortable with Exercise #03 from last class. For your convenience, I’ve copied that material here.
Hadley Wickham is the author of many of the “tidy tools” we use in this course. He is also an excellent bartender and chef. In this exercise, we are going to web scrape his cocktail recipe book which can be found at https://cocktails.hadley.nz/.
Our goal is to create a data frame that records all 150+ recipes on this site (as rows) and the different ingredients (as columns). This week, we are going to pull the different recipes into R: next week we are going to process the text and create our final data frame (so stay tuned!).
Before you ever start scraping it’s essential to have a plan. Let’s first go through the following steps to build a scraping strategy:
- Poke around the website to see how it is organized. Is there a single page listing all of the cocktails? If not, how else can you make sure that you’ve explored the entire site?
- Once you know how you’re going to explore the whole site, use your browser tools to see if you can identify an HTML element that corresponds to a single recipe. (This element will occur several times per page) Remember that you want to select “as small as possible” but no smaller.
- Once you have found the right HTML element for a recipe, identify an HTML element that corresponds to i) the title; and ii) individual ingredients.
For this task, you will likely see several recipes more than once. Don’t worry about this for now - we can distinct out the duplicates later in our analysis. It’s better to be over-inclusive than under-inclusive.
After you have built your plan, it’s time to start putting this all to code.
- Write code to get a list of all pages you will need to process. Construct full URLs for your future requests.
[1] "https://cocktails.hadley.nz/ingredient-absinthe.html"
[2] "https://cocktails.hadley.nz/ingredient-allspice-liqueur.html"
[3] "https://cocktails.hadley.nz/ingredient-ancho-reyes.html"
[4] "https://cocktails.hadley.nz/ingredient-angostura.html"
[5] "https://cocktails.hadley.nz/ingredient-aperol.html"
[6] "https://cocktails.hadley.nz/ingredient-apricot-liqueur.html"
[7] "https://cocktails.hadley.nz/ingredient-averna.html"
[8] "https://cocktails.hadley.nz/ingredient-banana-liqueur.html"
[9] "https://cocktails.hadley.nz/ingredient-benedictine.html"
[10] "https://cocktails.hadley.nz/ingredient-blackberry-liqueur.html"
[11] "https://cocktails.hadley.nz/ingredient-blanc-vermouth.html"
[12] "https://cocktails.hadley.nz/ingredient-bourbon.html"
[13] "https://cocktails.hadley.nz/ingredient-brandy.html"
[14] "https://cocktails.hadley.nz/ingredient-campari.html"
[15] "https://cocktails.hadley.nz/ingredient-champagne.html"
[16] "https://cocktails.hadley.nz/ingredient-cherry-liqueur.html"
[17] "https://cocktails.hadley.nz/ingredient-coconut-cream.html"
[18] "https://cocktails.hadley.nz/ingredient-coffee-liqueur.html"
[19] "https://cocktails.hadley.nz/ingredient-cream.html"
[20] "https://cocktails.hadley.nz/ingredient-cr-me-de-cacao.html"
[21] "https://cocktails.hadley.nz/ingredient-cr-me-de-cassis.html"
[22] "https://cocktails.hadley.nz/ingredient-cynar.html"
[23] "https://cocktails.hadley.nz/ingredient-dry-vermouth.html"
[24] "https://cocktails.hadley.nz/ingredient-elderflower-liqueur.html"
[25] "https://cocktails.hadley.nz/ingredient-espresso.html"
[26] "https://cocktails.hadley.nz/ingredient-fernet.html"
[27] "https://cocktails.hadley.nz/ingredient-gin.html"
[28] "https://cocktails.hadley.nz/ingredient-ginger-liqueur.html"
[29] "https://cocktails.hadley.nz/ingredient-grapefruit-bitters.html"
[30] "https://cocktails.hadley.nz/ingredient-grapefruit-juice.html"
[31] "https://cocktails.hadley.nz/ingredient-green-chartreuse.html"
[32] "https://cocktails.hadley.nz/ingredient-grenadine.html"
[33] "https://cocktails.hadley.nz/ingredient-herbstura.html"
[34] "https://cocktails.hadley.nz/ingredient-honey-syrup.html"
[35] "https://cocktails.hadley.nz/ingredient-lemon-juice.html"
[36] "https://cocktails.hadley.nz/ingredient-lillet-blanc.html"
[37] "https://cocktails.hadley.nz/ingredient-lime-juice.html"
[38] "https://cocktails.hadley.nz/ingredient-luxardo-bitter-bianco.html"
[39] "https://cocktails.hadley.nz/ingredient-maple-syrup.html"
[40] "https://cocktails.hadley.nz/ingredient-maraschino-liqueur.html"
[41] "https://cocktails.hadley.nz/ingredient-meletti.html"
[42] "https://cocktails.hadley.nz/ingredient-mezcal.html"
[43] "https://cocktails.hadley.nz/ingredient-orange-bitters.html"
[44] "https://cocktails.hadley.nz/ingredient-orange-liqueur.html"
[45] "https://cocktails.hadley.nz/ingredient-orgeat.html"
[46] "https://cocktails.hadley.nz/ingredient-peychauds.html"
[47] "https://cocktails.hadley.nz/ingredient-pimms.html"
[48] "https://cocktails.hadley.nz/ingredient-pineapple.html"
[49] "https://cocktails.hadley.nz/ingredient-pineapple-juice.html"
[50] "https://cocktails.hadley.nz/ingredient-ramazotti.html"
[51] "https://cocktails.hadley.nz/ingredient-rhubarb-bitters.html"
[52] "https://cocktails.hadley.nz/ingredient-rhum-agricole.html"
[53] "https://cocktails.hadley.nz/ingredient-rum.html"
[54] "https://cocktails.hadley.nz/ingredient-rye.html"
[55] "https://cocktails.hadley.nz/ingredient-scotch.html"
[56] "https://cocktails.hadley.nz/ingredient-sherry.html"
[57] "https://cocktails.hadley.nz/ingredient-simple-syrup.html"
[58] "https://cocktails.hadley.nz/ingredient-sloe-gin.html"
[59] "https://cocktails.hadley.nz/ingredient-soda-water.html"
[60] "https://cocktails.hadley.nz/ingredient-swedish-punsch.html"
[61] "https://cocktails.hadley.nz/ingredient-sweet-vermouth.html"
[62] "https://cocktails.hadley.nz/ingredient-tequila.html"
[63] "https://cocktails.hadley.nz/ingredient-velvet-falernum.html"
[64] "https://cocktails.hadley.nz/ingredient-vodka.html"
[65] "https://cocktails.hadley.nz/ingredient-yellow-chartreuse.html"
[66] "https://cocktails.hadley.nz/tag-3-star.html"
[67] "https://cocktails.hadley.nz/tag-adventurous.html"
[68] "https://cocktails.hadley.nz/tag-bubbles.html"
[69] "https://cocktails.hadley.nz/tag-daiquiri.html"
[70] "https://cocktails.hadley.nz/tag-fizz.html"
[71] "https://cocktails.hadley.nz/tag-flip.html"
[72] "https://cocktails.hadley.nz/tag-manhattan.html"
[73] "https://cocktails.hadley.nz/tag-negroni.html"
[74] "https://cocktails.hadley.nz/tag-sazerac.html"
[75] "https://cocktails.hadley.nz/tag-sour.html"
[76] "https://cocktails.hadley.nz/tag-tiki.html"
[77] "https://cocktails.hadley.nz/tag-vieux-carr.html"
-
Write a function that takes a single URL and extracts all recipes on that page as HTML elements.
Try this out on a fixed URL and confirm that it gets the right number of recipes
We can test this against this page and see that it works as expected:
BUBBLES <- get_recipes("https://cocktails.hadley.nz/tag-bubbles.html")
BUBBLES{xml_nodeset (7)}
[1] <article class="cocktail" id="air-mail"><div class="title">\n <h2>\n ...
[2] <article class="cocktail" id="campari-spritz"><div class="title">\n ...
[3] <article class="cocktail" id="dandelo"><div class="title">\n <h2>\n ...
[4] <article class="cocktail" id="french-75"><div class="title">\n <h2>\ ...
[5] <article class="cocktail" id="hinky-dinks-fizzy"><div class="title">\n ...
[6] <article class="cocktail" id="kir-royale"><div class="title">\n <h2> ...
[7] <article class="cocktail" id="negroni-sbagliato"><div class="title">\n ...
- Write a function that takes in a single recipe and returns a small data frame with the title and ingredients:
library(tidyverse)
library(httr2)
library(rvest)
process_recipe <- function(art){
title <- art |> html_element(".title h2") |> html_text2()
ingredients <- art |> html_elements("li") |> html_text2()
tibble(title=title, ingredients=ingredients)
}We can again test this using the recipes we extracted above:
process_recipe(BUBBLES[[1]])# A tibble: 4 × 2
title ingredients
<chr> <chr>
1 Air mail 3 oz champagne
2 Air mail 1 oz rum, white
3 Air mail ½ oz lime juice
4 Air mail ¼ oz simple syrup
-
Write a function that combines your results from the previous two steps to build a data frame with all recipes on a single page.
Try this out on a fixed URL and confirm it works as expected.
We can test this on the page of bubbly recipes:
process_url("https://cocktails.hadley.nz/tag-bubbles.html")# A tibble: 29 × 2
title ingredients
<chr> <chr>
1 Air mail 3 oz champagne
2 Air mail 1 oz rum, white
3 Air mail ½ oz lime juice
4 Air mail ¼ oz simple syrup
5 Campari spritz 3 oz champagne
6 Campari spritz 1½ oz Campari
7 Campari spritz 1 oz soda water
8 Campari spritz 2 drops lime acid
9 Dandelo 2 oz luxardo bitter bianco
10 Dandelo ½ oz lemon juice
# ℹ 19 more rows
-
Use your function from the first step with the list of URLs from Step 1 so that you get all of the recipes on the site. Combine your results into one large data frame that looks something like:
Name Ingredient Daiquiri 2 oz Rum Daiquiri 1 oz Lime Juice Daiquiri 0.75 oz Simple Syrup To be polite, you may want to add
req_cacheto your requests so you are not requesting the same page many times over.
# A tibble: 3,548 × 2
title ingredients
<chr> <chr>
1 Bachelor 1 oz rum, dark
2 Bachelor 1 oz Meletti
3 Bachelor ¼ oz absinthe
4 Bachelor 1 dash orange bitters
5 Bachelor 1 dash Angostura
6 Banana Manhattan 1½ oz rum, aged
7 Banana Manhattan ½ oz sweet vermouth
8 Banana Manhattan ½ oz banana liqueur
9 Banana Manhattan 1 dash absinthe
10 Industry sour 1 oz absinthe, pernod
# ℹ 3,538 more rows
If you are comfortable with (or want to become comfortable with) the functional programming tools of purrr, you can write this entire analysis in one swoop:
library(tidyverse)
library(rvest)
process_recipe <- function(recipe){
title <- recipe |> html_element(".title") |> html_text2()
ingredients <- recipe |> html_elements("li") |> html_text2()
tibble(title=title, ingredients=ingredients)
}
BASE_URL <- "https://cocktails.hadley.nz/"
read_html(BASE_URL) |>
html_elements("nav a") |>
html_attr("href") |>
map(\(p) request(BASE_URL) |> req_url_path(p)) |>
map(req_perform) |>
map(resp_body_html) |>
map(html_elements, "article") |>
reduce(c) |>
map(process_recipe) |>
list_rbind()# A tibble: 3,548 × 2
title ingredients
<chr> <chr>
1 "Bachelor" 1 oz rum, dark
2 "Bachelor" 1 oz Meletti
3 "Bachelor" ¼ oz absinthe
4 "Bachelor" 1 dash orange bitters
5 "Bachelor" 1 dash Angostura
6 "Banana Manhattan" 1½ oz rum, aged
7 "Banana Manhattan" ½ oz sweet vermouth
8 "Banana Manhattan" ½ oz banana liqueur
9 "Banana Manhattan" 1 dash absinthe
10 "Industry sour\n\nadventurous, sour" 1 oz absinthe, pernod
# ℹ 3,538 more rows
This is clearly quite terse and would benefit from some additional comments, but I’m omitting them so you can try to deconstruct this code yourself if you want to practice functional programming idioms.
At this point, we’re almost done, but we would really like to transform a table like:
| Name | Ingredient |
|---|---|
| Daiquiri | 2 oz Rum |
| Daiquiri | 1 oz Lime Juice |
| Daiquiri | 0.75 oz Simple Syrup |
into
| Name | Ingredient | Amount |
|---|---|---|
| Daiquiri | Rum | 2 |
| Daiquiri | Lime Juice | 1 |
| Daiquiri | Simple Syrup | 0.75 |
by splitting the amount (number) from the ingredient (string). This type of manipulation is the major goal of this week’s class. Hopefully you have a sense of how the string processing covered above can be used to separate "2 oz Rum" into 2 and "Rum".
Footnotes
Technically, the Unicode consortium releases two standards, Unicode 17 and Emoji 17, but we’re not going to worry about this level of detail.↩︎
-
Not all langauges or functions have an equivalent to the
negate=TRUEargument. We can get a similar effect using markers for beginning and end (see below) along with quanitifers, but the logic is a bit trickier:↩︎ We code this as
\\Sfor somewhat technical reasons that will be discussed in class.↩︎