STA 9750 Week 12 Pre Assignment: Strings and Things

Due Date: 2025-11-24 (Monday) at 11:59pm

This week, we begin to study the world of text data. While numerical data is reasonably straight-forward to deal with, text data is remarkably complex. A full discussion of text data requires understanding the vast world of human written language, but we will discuss enough of the major points to hopefully solve 95% of the challenges you will face in your career.

Goals

In our quest to understand text data, we have two major goals:

Understanding String Encodings and Unicode
Manipulating Strings with Regular Expressions

Before we get into these, let’s begin with a basic review of the character data type in R.

String Vectors

Recall that R works by default on vectors - ordered collections of the “same sort” of thing. R supports the following vector types:

Raw for pure access to bytes without any additional meaning: rarely useful for pure data-analytic work, but commonly used to interact with binary file formats and with non-R software
Integer: 32-bit signed integers, ranging from $-2^{30}$ to $2^{30}-1$. (If you have done low-level work before, you might ask where the extra bit went: it’s used for encoding NA values.)
Numeric: 64-bit (double precision) floating point values, ranging from 0 to (approximately) $\pm 10^{308}$. The detailed behavior of numeric (often called double) data is beyond this course, but it is well documented elsewhwere.
Character: the topic of today’s discussion.

R makes no difference between a character -in the sense of a single letter- and a string: in particular, each element of a character vector is an (arbitrary length) string. Specialized functions are required for work at the true “single letter” scale. If you come from other languages, this behavior might be surprising, but it allows R to handle much of the complexity associated with characters ‘auto-magically’, which greatly simplifies data analysis.

When speaking, I often refer to R as using strings because of this flexibility, even if R itself calls them character elements for historical reasons.

Encoding

How are strings represented on a computer? The answer has evolved over time, but the current state of the art - used by almost all non-legacy software - is based on the Unicode system and the UTF-8 encoding.

The Unicode system is comprised of two essential parts:

A numbered list of “letter-like” elements
Rules for manipulating those elements

While this seems simple, it is anything but. The history of string representations in computers is a long and painful story of programmers repeatedly underestimating the complexity of the seemingly simple task of listing “all the letters.”

The Unicode consortium makes a long list of characters that computers should be able to represent: the most recent version of the Unicode standard (Version 17.0 released 2025-09-09) includes 159,801 characters divided into 172 scripts. These include everything from the basic (Anglo-American) Latin alphabet to the Greek and Cyrillic alphabets to Chinese and Japanese characters to the undeciphered Linear A alphabet and Tengwar, the fictional script used in the Lord of the Rings novels. The Unicode standard also includes a wide set of Emoji (approximately 4000) and many “modifying” characters.¹ Recent additions include a Big Foot emoji, an Avalanche Emoji, and a Trombone Emoji.

To each of these, the Unicode consortium assigns a code point : a numerical identifier. Even superficially similar characters may be assigned different code points to distinguish them: for example, “H” is code point U+0048 with the official description “Latin Capital Letter H” while “Η” is U+0397, “Greek Capital Letter Eta.”

Visually, these look identical, but the difference between these characters is essential to know how to manipulate them:

Use the tolower function to lower-case each of these:

The Unicode standard defines the lower case mapping of U+0048 as the Latin lower case h (U+0068), while the lower case mapping of U+0397 is the Greek lower case eta (U+03B7), which looks something like a streched n.

The Unicode package provides some tools to investigate these:

In general, these sorts of ‘case-fold’ mappings are incredibly complicated and depend not only on the specific code point, but also the set of translation rules being used. (For historical and political reasons, certain languages have different lower/upper mappings for what are otherwise the same letter in Unicode.)

While you don’t need to know all of this complexity, it is essential to know that it’s out there and to rely on battle-tested libraries to perform these mappings.

Unicode is supplemented by the UTF-8 encodings, which controls how 0/1-bit strings are actually translated to code points. (Fonts then map code points to what you see on the screen.) UTF-8 is more-or-less back-compatible with other major encodings, so it’s a good default. When dealing with modern websites or public data sources, they almost always present their contents in a UTF-8 compatible encoding (if not UTF-8 proper) so you should be ok.

A well-formatted website will state its encoding near the top of the page:

library(rvest)
read_html("http://www.baruch.cuny.edu") |>
    html_elements("meta[charset]") |>
    html_attr("charset")

[1] "UTF-8"

Advice: Whenever possible, make sure you are using UTF-8 strings: if your data is not UTF-8, reencode it to UTF-8 as soon as possible. This will save you much pain.

String Manipulation

Once data is in R and encoded as UTF-8 Unicode points, we have several tools for dealing with strings. Your first port of call should be the stringr package.

All the functions of the stringr package start with str_ and take a vector of strings as the first argument, making them well suited for chained analysis.

Let’s start with str_length which simply computes the length of each element. For the basic Latin alphabet, this more or less matches our intuition:

but it can be tricky for strings that involve Unicode combining characters.

Here the “overbar” is a combining character which we add on to the X. This is commonly (though not always) used for languages with accents (e.g. French) or for languages where vowels are written above and below the main script (Arabic or Hebrew). This same idea is used for certain Emoji constructs:

Here, “Man with Dark Skin Tone” is the combination of “Man” and “Dark Skin Tone.” (Compare how this appears in your browser to how RStudio prints it if you copy this code to your computer.)

While there is complexity in all of Unicode, str_length will behave as you might expect for “regular” text. I’m going to stop showing the “scary side” of Unicode, but you should be aware of it for the remainder of these exercises.

Concatenation

You have already seen the base paste and paste0 functions for combining two string vectors together.

By default, paste combines strings with a space between them, while paste0 omits the space. paste is typically what you want for strings for human reading, while paste0 is a better guess for computer-oriented text (e.g., putting together a URL).

You can change the separator by passing a sep argument to paste:

You can also combine together multiple elements of a vector using the collapse argument:

Exercises:

Using the paste function, make a vector of strings like “John’s favorite color is blue” from the following data:

Solution

Modify your answer to write a (run-on) sentence of favorite colors: “John’s favorite color is blue and Jane’s favorite color is orange and …”

Solution

Note that this does not provide terminating punctuation.

Substring Selection

When cleaning up data for analysis, it is common to need to take substrings from larger text. The str_sub function will do this for a fixed length:

This behavior is useful when you are trying to extract a single piece from a longer bit of consistently formatted text, e.g., a computer log file or a set of IDs.

If you want to go all the way to the end, set the end element to -1:

Exercises

Using str_sub, remove the system name (CUNY or UC) and return only the campus name:

Solution

Detect and Matching

Often we only need to know whether a particular substring is present in a larger string. We can use str_detect to do this:

This is particularly useful inside of dplyr commands:

The str_match function will return the text of the match. Here it’s useless, but we’ll see that it becomes more powerful when we allow more flexible pattern specifications.

The str_subset function will return only those strings which match a certain pattern:

Exercises

Use str_subset to find the CUNY schools:

Solution

Specifying Patterns

While working by individual characters is sometimes useful (for very predictably formatted data), we generally need more powerful tools: regular expressions (RE) provide a compact language for specifying patterns in strings. We’ll introduce the basics here to help with string functions and then explore some more advanced RE features.

The most basic pattern is a set of elements in brackets: this means “any of these”.

For example, we want to see which names have an “A” in them:

Here, we consider a string to be a match if it has an A, an a or both.

Alternatively, we can see which strings contain numbers:

If we use str_match we can pull out the matching element:

By default, this only finds one appearance of the pattern:

We can modify the pattern specifier to include count information. The basic behavior is to add explicit count bounds:

Here a single number is an exact count ({2}), while pairs ({2,3}) specify a range. If one end of the range is left empty, it is 0 or infinite (depending on the direction).

Certain count specifications are sufficiently useful to get their own syntax:

One or more: + is equivalent to {1,}
Zero or more: * is equivalent to {0,}
One or zero: ? is equivalent to {0,1}.

Use these specifications for the next set of exercises.

Exercises

Which strings contain a three digit number?

Solution

A more compact way of writing [0123456789] is \\d, so this could also be done with

Combining patterns

You can combine REs to make more complex patterns:

(a|b) means a or b. This is like [] notation but a, b can be more complex than single characters. For example, we can detect schools that are either CUNY or UC:

[^abc] means anything other than a, b, c. You can often achieve a similar effect using the negate argument to str_detect, but you need this specifically for str_match

Note that this does not require that all elements of the string don’t match, only that at least one character doesn’t fall in the specified range. If you want to have no matches, the negate=TRUE argument can help:²

^ outside of a bracket denotes the start of a line:

$ denotes the end of a line:

Note that we made the e optional here, so we matched both the American (whiskey) and the Scottish spelling (whisky) of a distilled grain spirit.

See the stringr RE docs for more examples of regular expressions.

Exercises

Use a regular expression to find which of these are fish species:

Solution

Use a regular expression to find words with three or more vowels in a row:

Solution

Find the words where “q” is not followed by a “u”

Solution

Replacement

The str_replace function allows us to replace a string with something else. This is particularly useful when cleaning up text:

Note that this replace only the first match in a string:

In some circumstances, you may want to replace all matches, and should instead use str_replace_all:

Character Classes

Some patterns are so common with regular expressions that ‘shorthand’ is given for them:

\d is short-hand for the set of digits [0123456789]
\D is short-hand for non-digits [^0123456789]
\s is short-hand for any sort of space (space, newline, tab, etc.)
\S is short-hand for anything that is not a space
\w is short-hand for the ‘normal word elements’ of basic English: [A-Za-z0-9_], that is, all digits, upper case letters, lower case letters, or an underscore
\W is short-hand for anything not in \w: [^A-Za-z0-9_]

Use of these is not required, but they are recommended as they will save some typing and be a bit more robust: e.g., you might think you are covering all letters with [A-Za-z] but unexpected accented letters might throw you off:

One word of warning that can be a bit tricky: when using these, you will typically need to put a double-slash in front of the character class name, e.g., how I put \\S above. The reasons for this are a bit technical, but essentially, \x has a separate (non-regex) meaning, so we have to ‘escape’ the \ by putting an extra slash in front of it: \\. You should really read \\S as \\ + S, not \ + \S. We will discuss this a bit more in class since it gets tricky.

Capture Groups

Before we end, let’s introduce one more powerful feature of regular expressions: capture groups. This lets us give an identity to the part of a string that matches a text and reference it again later.

The basic syntax for a capture group is simple: you simply surround the desired part with parentheses: for example,

Here, we are using the short-hand \S for non-space character (i.e., letters, numbers, or punctuation).³ Here, we see that the output of str_match now returns the part of the string that matched the entire regex (Albany, NY) and the part that matched the capture group (Albany). If we want to just get the capture group and nothing else, we can use str_extract and provide the group argument:

Extensions to str_match_all will return multiple matches

When we combine this with str_replace, we can actually reference the capture group in the output:

Here, \\N codes for the result of the $N^{\text{th}}$ capture group.

This is very useful when dealing with semi-structured text:

though this usually requires extensive trial-and-error to get the regular expression exactly correct.

Footnotes

Technically, the Unicode consortium releases two standards, Unicode 17 and Emoji 17, but we’re not going to worry about this level of detail.↩︎
Not all langauges or functions have an equivalent to the negate=TRUE argument. We can get a similar effect using markers for beginning and end (see below) along with quanitifers, but the logic is a bit trickier:

↩︎
We code this as \\S for somewhat technical reasons that will be discussed in class.↩︎