STA 9750 Week 12 Pre-Assignment: Strings and Things

This week, we begin to study the world of text data. While numerical data is reasonably straight-forward to deal with, text data is remarkably complex. A full discussion of text data requires understanding the vast world of human written language, but we will discuss enough of the major points to hopefully solve 95% of the challenges you will face in your career.

Goals

In our quest to understand text data, we have two major goals:

  • Understanding String Encodings and Unicode
  • Manipulating Strings with Regular Expressions

Before we get into these, let’s begin with a basic review of the character data type in R.

String Vectors

Recall that R works by default on vectors - ordered collections of the “same sort” of thing. R supports the following vector types:

  • Raw for pure access to bytes without any additional meaning: rarely useful for pure data-analytic work
  • Integer: 32-bit signed integers, ranging from \(-2^{30}\) to \(2^{30}-1\). (If you have done low-level work before, you might ask where the extra bit went: it’s used for encoding NA values.)
  • Numeric: 64-bit (double precision) floating point values, ranging from (approximately) \(\pm 10^{308}\). The detailed behavior of numeric (often called double) data is beyond this course, but it is well documented elsewhwere.
  • Character: the topic of today’s discussion.

R makes no difference between a character - in the sense of a single letter - and a string: in particular, each element of a character vector is an (arbitrary length) string. Specialized functions are required for work at the true “single letter” scale. If you come from other languages, this behavior might be surprising, but it allows R to handle much of the complexity associated with characters automagically, which greatly simplifies data analysis.

When speaking, we refer to R as using strings, even if R itself calls them character elements for historical reasons.

Encoding

How are strings represented on a computer? The answer has evolved over time, but the current state of the art - used by almost all non-legacy software - is based on the Unicode system and the UTF-8 encoding.

The Unicode system is comprised of two essential parts: - A numbered list of “letter-like” elements - Rules for manipulating those elements

While this seems simple, it is anything but. The history of string representations in computers is a long and painful story of programmers repeatedly underestimating the complexity of the seemingly simple task of listing “all the letters.”

The Unicode consortium makes a long list of characters that computers should be able to represent: the most recent version of the Unicode standard includes 149,813 characters divided into 161 scripts. These include everything from the basic (Anglo-American) Latin alphabet to the Greek and Cyrillic alphabets to Chinese and Japanese characters to the undeciphered Linear A alphabet and Tengwar, the fictional script used in the Lord of the Rings novels. The Unicode standard also includes a wide set of Emoji (approximately 4000) and many “modifying” characters.

To each of these, the Unicode consortium assigns a code point : a numerical identifier. Even superficially similar characters may be assigned different code points to distinguish them: for example, “H” is code point U+0048 with the official description “Latin Capital Letter H” while “Η” is U+0397, “Greek Capital Letter Eta.”

The difference between these characters is essential to know how to manipulate them:

Use the tolower function to lower-case each of these:

The Unicode standard defines the lower case mapping of U+0048 as the Latin lower case h, while the lower case mapping of U+0397 as the Greek lower case eta, which looks something like a streched n. 

In general, these mappings are incredibly complicated and depend not only on the specific code point, but also the set of translation rules being used. (Certain languages have different lower/upper mappings for what are otherwise the same letter.)

While you don’t need to know all of this complexity, it is essential to know that it’s out there and to rely on battle-tested libraries to perform these mappings.

Unicode is supplemented by the UTF-8 encodings, which controls how 0/1-bit strings are actually translated to code points. (Fonts then map code points to what you see on the screen.) UTF-8 is more-or-less back-compatible with other major encodings, so it’s a good default. When dealing with modern websites or public data sources, they almost always present their contents in a UTF-8 compatible encoding (if not UTF-8 proper) so you should be ok.

A well-formatted website will state its encoding near the top of the page:

library(rvest)
read_html("https://baruch.cuny.edu") |>
    html_elements("meta[charset]") |>
    html_attr("charset")
[1] "UTF-8"

Advice: Whenever possible, make sure you are using UTF-8 strings: if your data is not UTF-8, reencode it to UTF-8 as soon as possible. This will save you much pain.

String Manipulation

Once data is in R and encoded as UTF-8 Unicode points, we have several tools for dealing with strings. Your first port of call should be the stringr package.

All the functions of the stringr package start with str_ and take a vector of strings as the first argument, making them well suited for chained analysis.

Let’s start with str_length which simply computes the length of each element. For the basic Latin alphabet, this more or less matches our intuition:

library(stringr)
x <- c("I", "am", "a", "student", "at", "Baruch.")
str_length(x)
[1] 1 2 1 7 2 7

but it can be tricky for strings that involve Unicode combining characters.

str_length("X̅")
[1] 2

Here the “overbar” is a combining character which we add on to the X. This is commonly (though not always) used for languages with accents (e.g. French) or for languages where vowels are written above and below the main script (Arabic or Hebrew). This same idea is used for certain Emoji constructs:

str_length("👨🏿")
[1] 2

Here, “Man with Dark Skin Tone” is the combination of “Man” and “Dark Skin Tone.” (Compare how this appears in the rendered document to how RStudio prints it.)

While there is complexity in all of Unicode, str_length will behave as you might expect for “regular” text. I’m going to stop showing the “scary case” of Unicode, but you should be aware of it for the remainder of these exercises.

Concatenation

You have already seen the base paste and paste0 functions for combining two string vectors together.

x <- c("Michael", "Mary", "Gus")
y <- c("Son", "Daughter", "Dog")

paste(x, y)
[1] "Michael Son"   "Mary Daughter" "Gus Dog"      

By default, paste combines strings with a space between them, while paste0 omits the space. paste is typically what you want for strings for human reading, while paste0 is a better guess for computer-oriented text (e.g., putting together a URL).

You can change the separator by passing a sep argument to paste:

paste(x, y, sep = " is my ")
[1] "Michael is my Son"   "Mary is my Daughter" "Gus is my Dog"      

You can also combine together multiple elements of a vector using the collapse argument:

paste(x, collapse = " and ")
[1] "Michael and Mary and Gus"

Exercises:

Using the paste function, make a vector of strings like “John’s favorite color is blue”:

people <- c("John", "Jane", "Randy", "Tammi")
colors <- c("blue", "orange", "grey", "chartreuse")

Modify your answer to write a (run-on) sentence of favorite colors: “John’s favorite color is blue and Jane’s favorite color is orange and …”

people <- c("John", "Jane", "Randy", "Tammi")
colors <- c("blue", "orange", "grey", "chartreuse")

Substring Selection

When cleaning up data for analysis, it is common to need to take substrings from larger text. The str_sub function will do this:

x <- c("How", "much", "is", "that", "puppy", "in", "the", "window?")
str_sub(x, 1, 2)
[1] "Ho" "mu" "is" "th" "pu" "in" "th" "wi"

If you want to go all the way to the end, set the end element to -1:

str_sub(x, 2, -1)
[1] "ow"     "uch"    "s"      "hat"    "uppy"   "n"      "he"     "indow?"

Exercises

Using str_sub, remove the system name (CUNY or UC) and return only the campus name:

library(stringr)
uc_schools <- c("UC Berkeley", "UC San Diego", "UC Santa Cruz", "UC Davis")
cuny_schools <- c("Baruch College, CUNY", "City College, CUNY", "La Guardia Community College, CUNY")
str_sub(uc_schools)
[1] "UC Berkeley"   "UC San Diego"  "UC Santa Cruz" "UC Davis"     

Detect and Matching

Often we only need to know whether a particular substring is present in a larger string. We can use str_detect to do this:

library(stringr)
dogs <- c("basset hound", "greyhound", "labrador retreiver", "border collie", "Afgahn hound")
str_detect(dogs, "hound")
[1]  TRUE  TRUE FALSE FALSE  TRUE

The str_match function will return the text of the match. Here it’s useless, but we’ll see that it becomes more powerful when we allow more flexible pattern specifications.

library(stringr)
dogs <- c("basset hound", "greyhound", "labrador retreiver", "border collie", "Afgahn hound")
str_match(dogs, "hound")
     [,1]   
[1,] "hound"
[2,] "hound"
[3,] NA     
[4,] NA     
[5,] "hound"

Exercises

Use str_detect to find the CUNY schools:

library(stringr)
schools <- c("UC Davis", 
             "UC Santa Cruz", 
             "City College, CUNY", 
             "UC Berkeley", 
             "La Guardia Community College, CUNY", 
             "Baruch College, CUNY", 
             "UC San Diego")
str_sub(uc_schools)
[1] "UC Berkeley"   "UC San Diego"  "UC Santa Cruz" "UC Davis"     

Specifying Patterns

While working by individual characters is sometimes useful (for very regular data), we generally need more powerful tools: regular expressions (RE) provide a compact language for specifying patterns in strings. We’ll introduce the basics here to help with string functions and then explore some more advanced RE features.

The most basic pattern is a set of elements in brackets: this means “any of these”.

For example, we want to see which names have an “A” in them:

names <- c("Jane", "Rhonda", "Reggie", "Bernie", "Walter", "Arthur")
str_detect(names, "a") ## Wrong!
[1]  TRUE  TRUE FALSE FALSE  TRUE FALSE
str_detect(names, "A") ## Wrong!
[1] FALSE FALSE FALSE FALSE FALSE  TRUE
str_detect(names, "[Aa]") ## Right!
[1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE

Alternatively, we can see which strings contain numbers:

x <- c("73 cows", "47 chickens", "a dozen eggs")
str_detect(x, "[0123456789]")
[1]  TRUE  TRUE FALSE

If we use str_match we can pull out the matching element:

x <- c("2 burgers", "3 soups", "5 fish")
str_match(x, "[0123456789]")
     [,1]
[1,] "2" 
[2,] "3" 
[3,] "5" 

By default, this only finds one appearance of the pattern:

x <- c("23 burgers", "34 soups", "56 fish")

# Why is this wrong?
str_match(x, "[0123456789]")
     [,1]
[1,] "2" 
[2,] "3" 
[3,] "5" 

We can modify the pattern specifier to include count information. The basic behavior is to add explicit count bounds:

x <- c("2 burgers", "34 soups", "567 fish")
str_match(x, "[0123456789]{2}")
     [,1]
[1,] NA  
[2,] "34"
[3,] "56"
str_match(x, "[0123456789]{3}")
     [,1] 
[1,] NA   
[2,] NA   
[3,] "567"
str_match(x, "[0123456789]{2,3}")
     [,1] 
[1,] NA   
[2,] "34" 
[3,] "567"
str_match(x, "[0123456789]{2,}")
     [,1] 
[1,] NA   
[2,] "34" 
[3,] "567"

Here a single number is an exact count ({2}), while pairs ({2,3}) specify a range. If one end of the range is left empty, it is 0 or infinite (depending on the direction).

Certain count specifications are sufficiently useful to get their own syntax:

  • One or more: + is equivalent to {1,}
  • Zero or more: * is equivalent to {0,}
  • One or zero: ? is equivalent to {0,1}.

Use these specifications for the next set of exercises.

Exercises

Which strings contain a three digit number?

str_detect(x, "\\d{3}")
str_detect(x, "\\d{3}")

Combining patterns

You can combine REs to make more complex patterns:

  • (a|b) means a or b. This is like [] notation but a, b can be more complex than single characters
x <- c("Baruch College, CUNY", "UC Berkeley", "Harvard University")
str_detect(x, "(CUNY|UC)")
[1]  TRUE  TRUE FALSE
  • [^abc] means anything other than a, b, c. You can often achieve a similar effect using the negate argument to str_detect, but you need this specifically for str_match
x <- c("10 blue fish", "three wet goats", "15 otters in hats")
str_detect(x, "[^0123456789]")
[1] TRUE TRUE TRUE
  • ^ outside of a bracket denotes the start of a line:
x <- c("rum", "white rum", "flavored rum")
str_detect(x, "^rum")
[1]  TRUE FALSE FALSE
  • $ denotes the end of a line:
x <- c("bourbon whiskey", "scotch whisky", "whiskey liqueurs")
str_detect(x, "whisk[e]?y$")
[1]  TRUE  TRUE FALSE

See the stringr RE docs for more examples of regular expressions.

Exercises

  1. Use a regular expression to find which of these are fish species:
  1. Use a regular expression to find words with three or more vowels in a row:
  1. Find the words where “q” is not followed by a “u”

Replacement

The str_replace function allows us to replace a string with something else. This is particularly useful when cleaning up text:

x <- c("Manhattan, NY", "Bronx, New York", "Brooklyn, ny", "Queens, nY")
str_replace(x, "([nN][yY]|New York)", "NY")
[1] "Manhattan, NY" "Bronx, NY"     "Brooklyn, NY"  "Queens, NY"