STA 9750 Week 4 Pre Assignment: Single-Table dplyr
Verbs
Due Date: 2025-09-17 (Wednesday) at 11:59pm
Submission: CUNY Brightspace
This week, we begin manipulating R
’s most important structure the data.frame
. While base R
provides tools for working with data.frame
objects, our primary focus will be on the tidyverse
family of tools. These provide a unified and consistent set of tools for working with data objects.
What is a data.frame
?
Last week, we discussed vectors, one-dimensional collections of the same type of object. While vectors are an important computational tool, real data is rarely quite so simple: we collect multiple features (covariates) from each of our observations and these features may be of different type. For example, when performing a political survey, we may record the:
- Name (
character
) - Age (
integer
) - Gender (
factor
)1 - Date of Contact (
Date
) - Level of candidate support (
double
ornumeric
)
of each respondant. It is natural to organize this in tabular (“spreadsheet”) form:
Name | Age | Gender | Date of Contact | Level of Support |
---|---|---|---|---|
Timmy | 25 | M | 2024-09-13 | 0.25 |
Tammy | 50 | F | 2024-06-20 | -1.35 |
Taylor | 70 | X | 2024-08-15 | 200 |
Tony | 40 | M | 2024-12-25 | 0 |
Toni | 65 | F | 2024-11-28 | -4 |
Note several important features of this data:
- Each row corresponds to one, and only one, sample
- Each column corresponds to one, and only one, feature
- The values in each column are all of the same type
- The order doesn’t matter: all important data is reflected in the values, not the presentation or order of the rows
Generally, data in this pattern will be called “tidy”. R
represents this type of data as a data.frame
object. For more on what it means for data to be “tidy”, read this paper.
Tibbles and the Tidyverse
Many of the tools we will discuss in this class are from a set of related R
packages, collectively known as the tidyverse
. They are designed to i) read data from outside of R
into tidy formats; ii) manipulate data from one tidy format to another; iii) communicate the results of tidy data analyses.
While you can load these packages separately, they are used together so frequently that the tidyverse
package exists to load them all simultaneously. I recommend you start most of your analyses with the command:
library(tidyverse)
This will automatically load the following packages:
lubridate
for date and time manipulationforcats
for factor manipulationstringr
for string manipulationdplyr
for data frame manipulationpurrr
for functional programmingreadr
for tidy data importtidyr
for tidy data manipulationtibble
for data frame enhancementggplot2
for data visualization
When loading these, you will get a message (not error!) if the names used by any of the tidyverse
packages conflict with names used by other packages. The most common issue is a function called filter
from dplyr
conflicting with filter
from stats
. filter
in the stats
package is used to do linear filtering on time series - we’re not doing that in this class, so letting dplyr
“own” the name, which happens by default, is fine for us.
This week, we are focusing on functionality from the dplyr
package.
You may, from time to time, see reference to tibble
s in R
documentation. A tibble
is a “souped-up” data frame with somewhat better default printing. For almost all purposes-and everywhere in this course- you can substitute tibble
with data.frame
without issue.
Before we get into dplyr
, let’s make sure we have a data frame to play with. Let’s bring back our friends, the Palmer penguins:2
So much tuxedo goodness!
Subsetting Data Frames
Our first task will be to subset data frames: that is, to select only some rows and columns and to return a smaller data frame than the one we started with.
Subsetting Columns
The main dplyr
function for column subsetting is select()
. In its simplest form, we provide a set of column names we want to keep and everything else gets dropped.
Here, we keep the species
and island
columns and remove the others.
Pause for a moment to note how the select
function is structured: the first argument is the data frame on which we are operating; all following arguments control the resulting behavior. This is a common pattern in dplyr
functions, designed to take advantage of another key R
functionality, the pipe
operator.
R
provides an operator |>
which “rewrites” code, so
and
are exactly the same thing as far as R
is concerned. You may well ask yourself why bother: the second “piped” operator is a bit longer and a bit harder to type. But just hold on - we’ll see this makes for far cleaner code in the long run.
Just like base R
, if we include -
signs in our select
statement, we get everything except a certain column:
Here, we dropped three columns.
The select
operator provides more advanced functionality which is useful for very complex data structures, but we don’t need to dive into that just yet.
Subsetting Rows
Often in data analysis, we want to focus on a subset of the entire population: e.g., we might want to know the fraction of women supporting a certain political candidate or the rate of a rare cancer among patients 65 or older. In this case, we need to select only those rows of our data that match some criterion. This brings us to the filter
operator.
filter
takes a logical vector and uses it to select rows of a data frame. Most commonly, this logical vector is created by performing some sort of tests on the values in the data frame. For example, if we want to select only the male
penguins in our data set, we may write:
Here, note the use of the ==
operator for testing equality. A very common mistake is to use the single equals (assignment) operator inside of filter
. Thankfully, dplyr
will alert us with an error
if we make this mistake:
If we supply multiple tests to filter
, we get the intersection: that is, we get rows that pass all tests.
If we want the union-rows that satisfy any of the tests-we have to use the logical operators we previous applied to vectors:
It’s a bit clunky, but thankfully, this is somewhat less common than checking that all conditions are satisfied.
dplyr
provides all sorts of useful helpers for creating test statements, e.g., the
between
near
functions.
Even more useful than these, however, are the slice_*()
functions which can be used to perform “top \(k\)” type operations. If we want the five largest Adelie penguins, we might try something like:
slice_min()
works similarly. slice_head
and slice_tail
get the first and last rows by sort order - in general, I recommend against using these. A key rule of data analysis is that the row order is not semantically meaningful, but it’s good to keep them in the back of your mind just in case. slice_sample
can be used to select random subsets of data.
Another important subseting function is drop_na
which will drop any rows with missing (NA
) data.3 This is a good and useful tool, but before you apply it, always ask yourself “why is this data missing?” Understanding the abscence of data is often just as important as understanding the non-missing data. For example, is a student’s SAT score missing on a college application because they i) forgot to list it; ii) never took the SAT; or iii) took the test but chose to omit their score because they did poorly? Proper handling of missing data is often very problem-specific and very hard.
Manipulating and Creating Columns
A key task in data analysis is transforming data. As discussed in class, one of the guiding themes of R
programming is data integrity and an important way R
ensures this is by applying commands to the entire vector. In the data frame context, we apply commands to an entire column. The mutate
function is dplyr
’s main interface for column creation and manipulation.
In general, each argument to mutate
takes a name = value
pair: the name
is the name of a column to be created from value
. value
can be an arbitrary function of other columns. If name
corresponds to an existing column, that column is silently overwritten.
For example, if we want to convert penguin bill lengths from millimeters to inches, we might operator:
Here, note that the original bill_len
column is retained - and all columns are kept! mutate
only creates new columns; it won’t secretly drop them.
A particularly common operator is changing the name of a column without changing its values. You can use mutate
and select(-)
for this, but rename
provides essentially the same interface and its semantically clearer:
Operating with Group Structure
We often want to summarize our data in useful ways: these can be simple summaries like mean (average) or standard deviation or more complex operations (trend lines). In the world of dplyr
, the summarize
function is used whenever we want to reduce multiple rows to a single data point. Note how this differs from filter
: filter
drops rows, summarize
combines and reduces them.
For example, if we want to get the number of male
penguins in our data set, we can use the n()
function, which counts the number of rows:
Look at all those male penguins!
This is a relatively simple operation, but we can be a bit more complex: e.g., with the mean function:
Wait! Why didn’t that work?
Recall that the penguins data had some missing (NA
) values. When we ask R
to compute the average, it can’t! Specifically, depending on the missing values, the mean could be anything, so R
returns a missing (unknown) value for the mean as well. Many base R
functions have this default behavior and have an optional flag for automatically removing NA
values:
Here na.rm=TRUE
means to r
em
ove all na
values before computing the mean.
The summarize
function is particularly powerful when applied groupwise: e.g., what is the average body mass by species? In dplyr
world, this is a two-step operation:
We added the group_by
operator here. Note that, on its own, group_by
doesn’t really do anything:
We have added some grouping metadata but the actual data does not get changed until the summarize
step.
We can also group by more than one element:
Note here that the result is still grouped and that only the last (sex
) grouping was removed. That means that any future operations will be automatically grouped by species
. If you want to remove all grouping structure, add the ungroup
operator at the end:
group_by
metadata is also useful when summary statistics are computed implicitly by other functions. E.g., if we want to get all penguins that are above average mass for their species, we might try the following:
Here, R
applies the summarization function mean
groupwise for us.
Before we close out, let’s put this all together: what species has the largest difference in average body mass between the sexes?
Answer: Gentoo penguins have the largest sex difference in average body mass.
Before completing the Brightspace submission for this assignment, look up the source for this document on my GitHub (Hint: see the buttons on the sidebar) and see i) how I computed the answer; and ii) how I included it in the rendered text. This will be helpful as you begin preparing your first report.
Footnotes
Recall a
factor
is a vector with a fixed set of possible values, often used to represent categorical data. Here, we follow the NY DMV and allowM
,F
, andX
values for sex, but, in general, representation of sex and gender in databases is a tricky problem. See this essay for a list of some of the complexity of real people. (This essay follows in a longer tradition of “the world is much more complicated than you would believe” essays: names, time, addresses. People - and the world we create - are infinitely complex.↩︎The
penguins
data set was originally distributed in a separate package calledpalmerpenguins
. If you have an version ofR
from before R 4.5, you may need toinstall.packages
andlibrary
that package to access this data. Note that that version has slightly different names of certain columns than we have here. The following snippet can be used to check if that package is needed and translate between the two versions:if(getRversion() < "4.5"){ if(!require("palmerpenguins")) install.packages("palmerpenguins") data(penguins, package="palmerpenguins") }colnames(penguins) <- c("species", "island", "bill_len", "bill_dep", "flipper_len", "body_mass", "sex", "year")
We will say much more about
R
’s missing data model in class this week.↩︎