library(tidyverse)STA 9750 Week 4 Pre Assignment: Single-Table dplyr Verbs
Due Date: 2025-02-19 (Wednesday) at 11:45pm
Submission: CUNY Brightspace
This week, we begin manipulating R’s most important structure the data.frame. While base R provides tools for working with data.frame objects, our primary focus will be on the tidyverse family of tools. These provide a unified and consistent set of tools for working with data objects.
What is a data.frame?
Last week, we discussed vectors, one-dimensional collections of the same type of object. While vectors are an important computational tool, real data is rarely quite so simple: we collect multiple features (covariates) from each of our observations and these features may be of different type. For example, when performing a political survey, we may record the:
- Name (
character) - Age (
integer) - Gender (
factor)1 - Date of Contact (
Date) - Level of candidate support (
doubleornumeric)
of each respondant. It is natural to organize this in tabular (“spreadsheet”) form:
| Name | Age | Gender | Date of Contact | Level of Support |
|---|---|---|---|---|
| Timmy | 25 | M | 2024-09-13 | 0.25 |
| Tammy | 50 | F | 2024-06-20 | -1.35 |
| Taylor | 70 | X | 2024-08-15 | 200 |
| Tony | 40 | M | 2024-12-25 | 0 |
| Toni | 65 | F | 2024-11-28 | -4 |
Note several important features of this data:
- Each row corresponds to one, and only one, sample
- Each column corresponds to one, and only one, feature
- The values in each column are all of the same type
- The order doesn’t matter: all important data is reflected in the values, not the presentation
Generally, data in this pattern will be called “tidy”. R represents this type of data as a data.frame object. For more on what it means for data to be “tidy”, read this paper.
Tibbles and the Tidyverse
Many of the tools we will discuss in this class are from a set of related R packages, collectively known as the tidyverse. They are designed to i) read data from outside of R into tidy formats; ii) manipulate data from one tidy format to another; iii) communicate the results of tidy data analyses.
While you can load these packages separately, they are used together so frequently that the tidyverse package exists to load them all simultaneously. I recommend you start most of your analyses with the command:
This will automatically load the following packages:
lubridatefor date and time manipulationforcatsfor factor manipulationstringrfor string manipulationdplyrfor data frame manipulationpurrrfor functional programmingreadrfor tidy data importtidyrfor tidy data manipulationtibblefor data frame enhancementggplot2for data visualization
This week, we are focusing on functionality from the dplyr package.
You may, from time to time, see reference to tibbles in R documentation. A tibble is a “souped-up” data frame with somewhat better default printing. For almost all purposes-and everywhere in this course- you can substitute tibble with data.frame without issue.
Before we get into dplyr, let’s make sure we have a data frame to play with. Let’s bring back our friends, the Palmer penguins:
library(tidyverse)
library(palmerpenguins)
penguins# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Subsetting Data Frames
Our first task will be to subset data frames: that is, to select only some rows and columns and to return a smaller data frame than the one we started with.
Subsetting Columns
The main dplyr function for column subsetting is select(). In its simplest form, we provide a set of column names we want to keep and everything else gets dropped.
select(penguins, species, island)# A tibble: 344 × 2
species island
<fct> <fct>
1 Adelie Torgersen
2 Adelie Torgersen
3 Adelie Torgersen
4 Adelie Torgersen
5 Adelie Torgersen
6 Adelie Torgersen
7 Adelie Torgersen
8 Adelie Torgersen
9 Adelie Torgersen
10 Adelie Torgersen
# ℹ 334 more rows
Here, we keep the species and island columns and remove the others.
Pause for a moment to note how the select function is structured: the first argument is the data frame on which we are operating; all following arguments control the resulting behavior. This is a common pattern in dplyr functions, designed to take advantage of another key R functionality, the pipe operator.
R provides an operator |> which “rewrites” code, so
select(penguins, species, island)and
penguins |> select(species, island)are exactly the same thing as far as R is concerned. You may well ask yourself why bother: the second “piped” operator is a bit longer and a bit harder to type. But just hold on - we’ll see this makes for far cleaner code in the long run.
Just like base R, if we include - signs in our select statement, we get everything except a certain column:
penguins |> select(-bill_length_mm, -bill_depth_mm, -flipper_length_mm)# A tibble: 344 × 5
species island body_mass_g sex year
<fct> <fct> <int> <fct> <int>
1 Adelie Torgersen 3750 male 2007
2 Adelie Torgersen 3800 female 2007
3 Adelie Torgersen 3250 female 2007
4 Adelie Torgersen NA <NA> 2007
5 Adelie Torgersen 3450 female 2007
6 Adelie Torgersen 3650 male 2007
7 Adelie Torgersen 3625 female 2007
8 Adelie Torgersen 4675 male 2007
9 Adelie Torgersen 3475 <NA> 2007
10 Adelie Torgersen 4250 <NA> 2007
# ℹ 334 more rows
Here, we dropped three columns.
The select operator provides more advanced functionality which is useful for very complex data structures, but we don’t need to dive into that just yet.
Subsetting Rows
Often in data analysis, we want to focus on a subset of the entire population: e.g., we might want to know the fraction of women supporting a certain political candidate or the rate of a rare cancer among patients 65 or older. In this case, we need to select only those rows of our data that match some criterion. This brings us to the filter operator.
filter takes a logical vector and uses it to select rows of a data frame. Most commonly, this logical vector is created by performing some sort of tests on the values in the data frame. For example, if we want to select only the male penguins in our data set, we may write:
penguins |> filter(sex == "male")# A tibble: 168 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.3 20.6 190 3650
3 Adelie Torgersen 39.2 19.6 195 4675
4 Adelie Torgersen 38.6 21.2 191 3800
5 Adelie Torgersen 34.6 21.1 198 4400
6 Adelie Torgersen 42.5 20.7 197 4500
7 Adelie Torgersen 46 21.5 194 4200
8 Adelie Biscoe 37.7 18.7 180 3600
9 Adelie Biscoe 38.2 18.1 185 3950
10 Adelie Biscoe 38.8 17.2 180 3800
# ℹ 158 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Here, note the use of the == operator for testing equality. A very common mistake is to use the single equals (assignment) operator inside of filter. Thankfully, dplyr will alert us with an error if we make this mistake:
penguins |> filter(sex = "male")Error in `filter()`:
! We detected a named input.
ℹ This usually means that you've used `=` instead of `==`.
ℹ Did you mean `sex == "male"`?
If we supply multiple tests to filter, we get the intersection: that is, we get rows that pass all tests.
penguins |> filter(sex == "male", bill_length_mm > 38)# A tibble: 156 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.3 20.6 190 3650
3 Adelie Torgersen 39.2 19.6 195 4675
4 Adelie Torgersen 38.6 21.2 191 3800
5 Adelie Torgersen 42.5 20.7 197 4500
6 Adelie Torgersen 46 21.5 194 4200
7 Adelie Biscoe 38.2 18.1 185 3950
8 Adelie Biscoe 38.8 17.2 180 3800
9 Adelie Biscoe 40.6 18.6 183 3550
10 Adelie Biscoe 40.5 18.9 180 3950
# ℹ 146 more rows
# ℹ 2 more variables: sex <fct>, year <int>
If we want the union-rows that satisfy any of the tests-we have to use the logical operators we previous applied to vectors:
penguins |> filter( (sex == "male") | (bill_length_mm > 38))# A tibble: 292 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen 39.3 20.6 190 3650
5 Adelie Torgersen 38.9 17.8 181 3625
6 Adelie Torgersen 39.2 19.6 195 4675
7 Adelie Torgersen 42 20.2 190 4250
8 Adelie Torgersen 41.1 17.6 182 3200
9 Adelie Torgersen 38.6 21.2 191 3800
10 Adelie Torgersen 34.6 21.1 198 4400
# ℹ 282 more rows
# ℹ 2 more variables: sex <fct>, year <int>
It’s a bit clunky, but thankfully, this is somewhat less common than checking that all conditions are satisfied.
dplyr provides all sorts of useful helpers for creating test statements, e.g., the
betweennear
functions.
Even more useful than these, however, are the slice_*() functions which can be used to perform “top \(k\)” type operations. If we want the five largest Adelie penguins, we might try something like:
penguins |> filter(species == "Adelie") |> slice_max(body_mass_g, n=5)# A tibble: 5 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Biscoe 43.2 19 197 4775
2 Adelie Biscoe 41 20 203 4725
3 Adelie Torgersen 42.9 17.6 196 4700
4 Adelie Torgersen 39.2 19.6 195 4675
5 Adelie Dream 39.8 19.1 184 4650
# ℹ 2 more variables: sex <fct>, year <int>
slice_min() works similarly. slice_head and slice_tail get the first and last rows by sort order - in general, I recommend against using these. A key rule of data analysis is that the row order is not semantically meaningful, but it’s good to keep them in the back of your mind just in case. slice_sample can be used to select random subsets of data.
Another important subseting function is drop_na which will drop any rows with missing (NA) data.2 This is a good and useful tool, but before you apply it, always ask yourself “why is this data missing?” Understanding the abscence of data is often just as important as understanding the non-missing data. For example, is a student’s SAT score missing on a college application because they i) forgot to list it; ii) never took the SAT; or iii) took the test but chose to omit their score because they did poorly? Proper handling of missing data is often very problem-specific and very hard.
Manipulating and Creating Columns
A key task in data analysis is transforming data. As discussed in class, one of the guiding themes of R programming is data integrity and an important way R ensures this is by applying commands to the entire vector. In the data frame context, we apply commands to an entire column. The mutate function is dplyr’s main interface for column creation and manipulation.
In general, each argument to mutate takes a name = value pair: the name is the name of a column to be created from value. value can be an arbitrary function of other columns. If name corresponds to an existing column, that column is silently overwritten.
For example, if we want to convert penguin bill lengths from millimeters to inches, we might operator:
penguins |> mutate(bill_length_in = bill_length_mm / 25.4)# A tibble: 344 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, bill_length_in <dbl>
Here, note that the bill_length_mm column is retained - and all columns are kept! mutate only creates new columns; it won’t secretly drop them.
A particularly common operator is changing the name of a column without changing its values. You can use mutate and select(-) for this, but rename provides essentially the same interface and its semantically clearer:
penguins |> rename(mass = body_mass_g)# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm mass sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Adelie Torgersen 39.1 18.7 181 3750 male
2 Adelie Torgersen 39.5 17.4 186 3800 female
3 Adelie Torgersen 40.3 18 195 3250 female
4 Adelie Torgersen NA NA NA NA <NA>
5 Adelie Torgersen 36.7 19.3 193 3450 female
6 Adelie Torgersen 39.3 20.6 190 3650 male
7 Adelie Torgersen 38.9 17.8 181 3625 female
8 Adelie Torgersen 39.2 19.6 195 4675 male
9 Adelie Torgersen 34.1 18.1 193 3475 <NA>
10 Adelie Torgersen 42 20.2 190 4250 <NA>
# ℹ 334 more rows
# ℹ 1 more variable: year <int>
Operating with Group Structure
We often want to summarize our data in useful ways: these can be simple summaries like mean (average) or standard deviation or more complex operations (trend lines). In the world of dplyr, the summarize function is used whenever we want to reduce multiple rows to a single data point. Note how this differs from filter: filter drops rows, summarize combines and reduces them.
For example, if we want to get the number of male penguins in our data set, we can use the n() function, which counts the number of rows:
penguins |>
filter(sex == "male") |>
summarize(number = n())# A tibble: 1 × 1
number
<int>
1 168
Look at all those male penguins!
This is a relatively simple operation, but we can be a bit more complex: e.g., with the mean function:
penguins |> summarize(body_mass_avg = mean(body_mass_g))# A tibble: 1 × 1
body_mass_avg
<dbl>
1 NA
Wait! Why didn’t that work?
Recall that the penguins data had some missing (NA) values. When we ask R to compute the average, it can’t! Specifically, depending on the missing values, the mean could be anything, so R returns a missing (unknown) value for the mean as well. Many base R functions have this default behavior and have an optional flag for automatically removing NA values:
penguins |> summarize(body_mass_avg = mean(body_mass_g, na.rm=TRUE))# A tibble: 1 × 1
body_mass_avg
<dbl>
1 4202.
Here na.rm=TRUE means to remove all na values before computing the mean.
The summarize function is particularly powerful when applied groupwise: e.g., what is the average body mass by species? In dplyr world, this is a two-step operation:
penguins |>
group_by(species) |>
summarize(body_mass_avg = mean(body_mass_g, na.rm=TRUE))# A tibble: 3 × 2
species body_mass_avg
<fct> <dbl>
1 Adelie 3701.
2 Chinstrap 3733.
3 Gentoo 5076.
We added the group_by operator here. Note that, on its own, group_by doesn’t really do anything:
penguins |>
group_by(species)# A tibble: 344 × 8
# Groups: species [3]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
We have added some grouping metadata but the actual data does not get changed until the summarize step.
We can also group by more than one element:
penguins |>
group_by(species, sex) |>
summarize(body_mass_avg = mean(body_mass_g, na.rm=TRUE))`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 8 × 3
# Groups: species [3]
species sex body_mass_avg
<fct> <fct> <dbl>
1 Adelie female 3369.
2 Adelie male 4043.
3 Adelie <NA> 3540
4 Chinstrap female 3527.
5 Chinstrap male 3939.
6 Gentoo female 4680.
7 Gentoo male 5485.
8 Gentoo <NA> 4588.
Note here that the result is still grouped and that only the last (sex) grouping was removed. That means that any future operations will be automatically grouped by species. If you want to remove all grouping structure, add the ungroup operator at the end:
penguins |>
group_by(species, sex) |>
summarize(body_mass_avg = mean(body_mass_g, na.rm=TRUE)) |>
ungroup()`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 8 × 3
species sex body_mass_avg
<fct> <fct> <dbl>
1 Adelie female 3369.
2 Adelie male 4043.
3 Adelie <NA> 3540
4 Chinstrap female 3527.
5 Chinstrap male 3939.
6 Gentoo female 4680.
7 Gentoo male 5485.
8 Gentoo <NA> 4588.
group_by metadata is also useful when summary statistics are computed implicitly by other functions. E.g., if we want to get all penguins that are above average mass for their species, we might try the following:
penguins |>
group_by(species) |>
filter(body_mass_g >= mean(body_mass_g, na.rm=TRUE)) |>
ungroup()# A tibble: 159 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 39.2 19.6 195 4675
4 Adelie Torgersen 42 20.2 190 4250
5 Adelie Torgersen 38.6 21.2 191 3800
6 Adelie Torgersen 34.6 21.1 198 4400
7 Adelie Torgersen 42.5 20.7 197 4500
8 Adelie Torgersen 46 21.5 194 4200
9 Adelie Biscoe 35.9 19.2 189 3800
10 Adelie Biscoe 38.2 18.1 185 3950
# ℹ 149 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Here, R applies the summarization function mean groupwise for us.
Before we close out, let’s put this all together: what species has the largest difference in average body mass between the sexes?
Answer: Gentoo penguins have the largest sex difference in average body mass.
Before completing the Brightspace submission for this assignment, look up the source for this document on my GitHub (Hint: see the buttons on the sidebar) and see i) how I computed the answer; and ii) how I included it in the rendered text. This will be helpful as you begin preparing your first report.
Footnotes
Recall a
factoris a vector with a fixed set of possible values, often used to represent categorical data. Here, we follow the NY DMV and allowM,F, andXvalues for sex, but, in general, representation of sex and gender in databases is a tricky problem. See this essay for a list of some of the complexity of real people. (This essay follows in a longer tradition of “the world is much more complicated than you would believe” essays: names, time, addresses. People - and the world we create - are infinitely complex.↩︎We will say much more about
R’s missing data model in class this week.↩︎