STA 9750 Week 5 Pre Assignment: Multi-Table dplyr Verbs

Due Date: 2025-09-29 (Monday) at 11:59pm

Submission: CUNY Brightspace

Last week, we considered single-table verbs: these are appropriate for asking complex questions of a nicely formatted data frame. Sadly, we are rarely provided data frames suitable for every question we might seek to answer. Instead, we typically need to combine information from multiple sources. For instance, if we want to examine the relationship between demographics and electoral results, we will need to combine information from the US Census Bureau and local elections administrators. Or, if we want to investigate the relationship between a company’s financial status and its stock performance, we might need to combine information from multiple databases.

Basic Joins

The basic operator of combining different tables is the join, patterned after SQL. Each join operates using some notion of “match” between tables. In the best case, this is done using a unique identifier - one universal and consistent name for each entity. Sadly, such perfect identifiers rarely exist. For instance, companies change their names and their ticker symbols somewhat regularly (e.g., Facebook becoming Meta)

The simplest join is the inner_join, which returns rows which match between the two tables:

In this case, the name column forms a unique ID, so we can use it for our join.

We see here that R automatically performed the join using the common column (name): if we want to be clearer, let’s specify the join element ourselves:

Here join_by is a helper function that can be used to specify the form of the join used. In some contexts, the “common” column has different names in the two tables, so we can use a more explicit call to join_by:

Note that this is the same as band_instruments, but with the name column changed to artist.

I like to always use this final - most explicit - form, even when the column names are the same between the two tables (join_by(name == name)).

Let’s look more closely at the result here: we return a table with 3 columns and two rows:

The three columns are pretty easy to understand:

  • name is the (shared) name column from each table
  • band comes from the band_members table
  • plays comes from the band_instruments table.

The two rows are a bit trickier: each of our input tables had three rows, but there were only two “overlaps” so that’s what we get back from an inner_join. Specifically, we drop Mick [Jagger] from band_members because he doesn’t appear in band_instrumentsand we drop Keith [Richards] from band_instruments because he doesn’t appear in band_members.

In brief, an inner join is an intersection join. We only get rows back which have a match in both tables.

Other join operators have complimentary behaviors: the full join (also sometimes called an outer join) is basically a union join. We get back all rows from both tables, regardless of whether a match has been found. But what happens with those unmatched rows?

R fills in the “blanks” with NA values. Here, we can assume Mick [Jagger] plays an instrument, but it is unknown to R here.

Finally, we have the intermediate left join, which keeps all rows from one table whether or not they have a match:

Here we keep Mick because he is in band_members, even though he is missing from band_instruments. Conversely, we drop Keith because he isn’t in band_members (even though he is in band_instruments).

R also provides a right_join, but it’s not really different: it’s just a “flipped” left_join: left_join(x, y) == right_join(y, x).

The following image1 summarizes the different types of joins:

The anti_join returns elements that appear in one data set, but not the other. It’s rarer, but occasionally useful.

Joins with Repeats

In the previous examples, we have seen joins that have a “one-to-one” (inner) or possibly “one-to-none” (full, left) structure. In many circumstances, we find ourselves with a “one-to-many” type structure, even when both data sets are “tidy”. This typically occurs because different data sets have different models of what a “unit” is. For example, consider a hypothetical instructor who has i) a table with student names and contact information; and ii) a table with grades on different assignments.

tribble

In the following examples, I use the tribble function to quickly create a data frame. The specifics of tribble are beyond the scope of this course.

What happens if we join these?

(Note here that we need the explicit join_by since the column names don’t match between the two tables: id in students gets joined to student_id. This pattern of id in table tbl getting joined to tbl_id elsewhere is quite common in database design.)

We get repeats of the student rows: for each valid student-grade pair, we have a row. From here, we can compute final grades:

In this case, everything works well. But let’s try a slightly trickier case, with some students who fail to submit certain assignments.

Why did the final grades go up after we deleted rows?

The “missing” assignments for Hunter and John Jay aren’t reported as zeros - they are just ignored! And hence R takes an average over the two assignments where these students did well, not all three assignments. We’ll talk about one way to fix this below, but for now I’m just flagging it as a possible issue that can come up with missing data and joins. (Here the rows were missing, so it’s harder to catch than a plain NA; better data management would have included a “0” row instead of deleting them, but we don’t always get to assume super well-organized data.)

So far, our join results have been relatively straightforward because we have had ‘good’ unique identifiers. If we find ourselves in a situation where we lack unique IDs, things can go wrong quickly:

In this case, R is kind enough to warn us that a “many-to-many” join has happened (joining multiple students to one grade and multiple grades to one student). This is a very good warning and it highlights a true error here. If faced with data like this, you may not be able to address it with fixing the underlying data, but at least you know something has gone awry.

Compound Joins

Often, data lack a unique identifier, but you can piece one together with several columns: that is, taken on its own, no column is unique, but the tuples formed by comining several columns are unique, e.g., data with year, month, and day columns.

In this case, a simple join on any one column goes astray:

Note the warning!

In this scenario, we should really “tidy” up the data by combining the date information, which is spread across three columns, into a single column, but we have the alternative option of a compound join:

Here, as with filter, the list of conditions looks for an intersection: we want all three parts of the date to match.

Pivots

Finally, we may want to re-arrange the output of a join. Returning to our grades example from above:

This isn’t really how we like to see gradebooks: a “wider” format, with a column for each assignment, may be more preferable. In this case, we want to use the pivot_wider column from the tidyr package.

pivot_wider takes a few key arguments:

  • id_cols which columns that (taken together) uniquely identify a row in the final table
  • names_from: where should we get the column names of the new table
  • values_from: where should we get the values of the new table

This is maybe easier by example:

This pivot trick is particularly useful for finding missing rows, like those that tripped us up earlier:

Here, our missing values are now explicit!

We can also explicitly fill the NA with a value of our choice, here 0:

There is also an inverse operator pivot_longer which takes a wide table (like this) and makes it longer.

To complete our grade book example, we might want to take the average across the three grade columns:

Note here that we need to use a mutate since our final grade book has the same number of rows before and after we add the final average column. The c_across column here is a variant of the standard c function used to combine scalars: here we’re creating a new length-3 vector of the student’s three grades and passing it to the mean function.

What is group_by(name) doing here? See what happens without it:

Recall that mean is a summarization function - it will combine data from across rows if no grouping structure is present. Since we want seperate averages for each student, we need a group_by. In this case, name is a unique identifier for each student, so we can group on it. We also have the rowwise() helper, which automatically creates group structure with each group a separate row. If you don’t have a clean unique identifier, or just can’t think of one easily, this is sometimes a useful helper.

In class, we will explore joins in more detail by combining the flights data with plane, airport, and weather factors.

Please now go fill out the weekly quiz on Brightspace.

Footnotes

  1. Adapted from Data Carpentry↩︎