library(dplyr)
band_members
# A tibble: 3 × 2
name band
<chr> <chr>
1 Mick Stones
2 John Beatles
3 Paul Beatles
dplyr
VerbsDue Date: 2024-09-25 (Wednesday) at 11:45pm
Submission: CUNY Brightspace
Last week, we considered single-table verbs: these are appropriate for asking complex questions of a nicely formatted data frame. Sadly, we are rarely provided data frames suitable for every question we might seek to answer. Instead, we typically need to combine information from multiple sources. For instance, if we want to examine the relationship between demographics and electoral results, we will need to combine information from the US Census Bureau and local elections administrators. Or, if we want to investigate the relationship between a company’s financial status and its stock performance, we might need to combine information from multiple databases.
The basic operator of combining different tables is the join
, patterned after SQL. Each join operates using some notion of “match” between tables. In the best case, this is done using a unique identifier - one universal and consistent name for each entity. Sadly, such perfect identifiers rarely exist. For instance, companies change their names and their ticker symbols somewhat regularly (e.g., Facebook becoming Meta)
The simplest join is the inner_join
, which returns rows which match between the two tables:
# A tibble: 3 × 2
name band
<chr> <chr>
1 Mick Stones
2 John Beatles
3 Paul Beatles
In this case, the name
column forms a unique ID, so we can use it for our join.
Joining with `by = join_by(name)`
# A tibble: 2 × 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass
We see here that R
automatically performed the join using the common column (name
): if we want to be clearer, let’s specify the join element ourselves:
# A tibble: 2 × 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass
Here join_by
is a helper function that can be used to specify the form of the join used. In some contexts, the “common” column has different names in the two tables, so we can use a more explicit call to join_by
:
# A tibble: 3 × 2
artist plays
<chr> <chr>
1 John guitar
2 Paul bass
3 Keith guitar
Note that this is the same as band_instruments
, but with the name
column changed to artist
.
# A tibble: 2 × 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass
I like to always use this final - most explicit - form, even when the column names are the same between the two tables (join_by(name == name)
).
Let’s look more closely at the result here: we return a table with 3 columns and two rows:
# A tibble: 2 × 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass
The three columns are pretty easy to understand:
name
is the (shared) name
column from each tableband
comes from the band_members
tableplays
comes from the band_instruments
table.The two rows are a bit trickier: each of our input tables had three rows, but there were only two “overlaps” so that’s what we get back from an inner_join
. Specifically, we drop Mick [Jagger] from band_members
because he doesn’t appear in band_instruments
and we drop Keith [Richards] from band_instruments
because he doesn’t appear in band_members
.
In brief, an inner join is an intersection join. We only get rows back which have a match in both tables.
Other join operators have complimentary behaviors: the full join (also sometimes called an outer join) is basically a union join. We get back all rows from both tables, regardless of whether a match has been found. But what happens with those unmatched rows?
# A tibble: 4 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
4 Keith <NA> guitar
R
fills in the “blanks” with NA
values. Here, we can assume Mick [Jagger] plays an instrument, but it is unknown to R
here.
Finally, we have the intermediate left join, which keeps all rows from one table whether or not they have a match:
# A tibble: 3 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
Here we keep Mick because he is in band_members
, even though he is missing from band_instruments
. Conversely, we drop Keith because he isn’t in band_members
(even though he is in band_instruments
).
R
also provides a right_join
, but it’s not really different: it’s just a “flipped” left_join
: left_join(x, y) == right_join(y, x)
.
The following image1 summarizes the different types of joins:
The anti_join
returns elements that appear in one data set, but not the other. It’s rarer, but occasionally useful.
In the previous examples, we have seen joins that have a “one-to-one” (inner) or possibly “one-to-none” (full, left) structure. In many circumstances, we find ourselves with a “one-to-many” type structure, even when both data sets are “tidy”. This typically occurs because different data sets have different models of what a “unit” is. For example, consider a hypothetical instructor who has i) a table with student names and contact information; and ii) a table with grades on different assignments.
students <- tribble(
~name, ~email, ~id,
"Bernard", "bernard@cuny.edu", 1,
"Hunter", "hunter@cuny.edu", 2,
"John Jay", "john.jay@cuny.edu", 3
)
grades <- tribble(
~student_id, ~assignment_id, ~grade,
1, "A", 100,
2, "A", 95,
3, "A", 80,
1, "B", 95,
2, "B", 80,
3, "B", 50,
1, "C", 95,
2, "C", 50,
3, "C", 80
)
What happens if we join these?
# A tibble: 9 × 5
name email id assignment_id grade
<chr> <chr> <dbl> <chr> <dbl>
1 Bernard bernard@cuny.edu 1 A 100
2 Bernard bernard@cuny.edu 1 B 95
3 Bernard bernard@cuny.edu 1 C 95
4 Hunter hunter@cuny.edu 2 A 95
5 Hunter hunter@cuny.edu 2 B 80
6 Hunter hunter@cuny.edu 2 C 50
7 John Jay john.jay@cuny.edu 3 A 80
8 John Jay john.jay@cuny.edu 3 B 50
9 John Jay john.jay@cuny.edu 3 C 80
(Note here that we need the explicit join_by
since the column names don’t match between the two tables: id
in students gets joined to student_id
. This pattern of id
in table tbl
getting joined to tbl_id
elsewhere is quite common in database design.)
We get repeats of the student rows: for each valid student-grade pair, we have a row. From here, we can compute final grades:
inner_join(students,
grades,
join_by(id == student_id)) |>
group_by(name, email, id) |>
summarize(final_avg = mean(grade)) |>
mutate(final_grade =
case_when(final_avg > 90 ~ "A",
final_avg > 80 ~ "B",
final_avg > 70 ~ "C",
final_avg > 60 ~ "D",
TRUE ~ "F")) # In a case_when, TRUE == "else"
`summarise()` has grouped output by 'name', 'email'. You can override using the
`.groups` argument.
# A tibble: 3 × 5
# Groups: name, email [3]
name email id final_avg final_grade
<chr> <chr> <dbl> <dbl> <chr>
1 Bernard bernard@cuny.edu 1 96.7 A
2 Hunter hunter@cuny.edu 2 75 C
3 John Jay john.jay@cuny.edu 3 70 D
In this case, everything works well. But let’s try a slightly trickier case, with some students who never fail to submit certain assignments.
students <- tribble(
~name, ~email, ~id,
"Bernard", "bernard@cuny.edu", 1,
"Hunter", "hunter@cuny.edu", 2,
"John Jay", "john.jay@cuny.edu", 3
)
grades <- tribble(
~student_id, ~assignment_id, ~grade,
1, "A", 100,
2, "A", 95,
3, "A", 80,
1, "B", 95,
2, "B", 80,
1, "C", 95,
3, "C", 80
)
inner_join(students,
grades,
join_by(id == student_id)) |>
group_by(name, email, id) |>
summarize(final_avg = mean(grade)) |>
mutate(final_grade =
case_when(final_avg > 90 ~ "A",
final_avg > 80 ~ "B",
final_avg > 70 ~ "C",
final_avg > 60 ~ "D",
TRUE ~ "F"))
`summarise()` has grouped output by 'name', 'email'. You can override using the
`.groups` argument.
# A tibble: 3 × 5
# Groups: name, email [3]
name email id final_avg final_grade
<chr> <chr> <dbl> <dbl> <chr>
1 Bernard bernard@cuny.edu 1 96.7 A
2 Hunter hunter@cuny.edu 2 87.5 B
3 John Jay john.jay@cuny.edu 3 80 C
Why did the final grades go up after we deleted rows?
# A tibble: 7 × 5
name email id assignment_id grade
<chr> <chr> <dbl> <chr> <dbl>
1 Bernard bernard@cuny.edu 1 A 100
2 Bernard bernard@cuny.edu 1 B 95
3 Bernard bernard@cuny.edu 1 C 95
4 Hunter hunter@cuny.edu 2 A 95
5 Hunter hunter@cuny.edu 2 B 80
6 John Jay john.jay@cuny.edu 3 A 80
7 John Jay john.jay@cuny.edu 3 C 80
The “missing” assignments for Hunter and John Jay aren’t reported as zeros - they are just ignored! And hence R
takes an average over the two assignments where these students did well, not all three assignments. We’ll talk about one way to fix this below, but for now I’m just flagging it as a possible issue that can come up with missing data and joins. (Here the rows were missing, so it’s harder to catch than a plain NA
; better data management would have included a “0” row instead of deleting them, but we don’t always get to assume super well-organized data.)
So far, our join results have been relatively straightforward because we have had ‘good’ unique identifiers. If we find ourselves in a situation where we lack unique IDs, things can go wrong quickly:
students <- tribble(
~name, ~email, ~id,
"Bernard", "bernard@cuny.edu", 1,
"Hunter", "hunter@cuny.edu", 2,
"John Jay", "john.jay@cuny.edu", 2 # Accidentally repeat an ID
)
grades <- tribble( # Back to the complete data
~student_id, ~assignment_id, ~grade,
1, "A", 100,
2, "A", 95,
3, "A", 80,
1, "B", 95,
2, "B", 80,
3, "B", 50,
1, "C", 95,
2, "C", 50,
3, "C", 80
)
full_join(students,
grades,
join_by(id == student_id))
Warning in full_join(students, grades, join_by(id == student_id)): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 2 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
# A tibble: 12 × 5
name email id assignment_id grade
<chr> <chr> <dbl> <chr> <dbl>
1 Bernard bernard@cuny.edu 1 A 100
2 Bernard bernard@cuny.edu 1 B 95
3 Bernard bernard@cuny.edu 1 C 95
4 Hunter hunter@cuny.edu 2 A 95
5 Hunter hunter@cuny.edu 2 B 80
6 Hunter hunter@cuny.edu 2 C 50
7 John Jay john.jay@cuny.edu 2 A 95
8 John Jay john.jay@cuny.edu 2 B 80
9 John Jay john.jay@cuny.edu 2 C 50
10 <NA> <NA> 3 A 80
11 <NA> <NA> 3 B 50
12 <NA> <NA> 3 C 80
In this case, R
is kind enough to warn us that a “many-to-many” join has happened (joining multiple students to one grade and multiple grades to one student). This is a very good warning and it highlights a true error here. If faced with data like this, you may not be able to address it with fixing the underlying data, but at least you know something has gone awry.
Often, data lack a unique identifier, but you can piece one together with several columns: that is, taken on its own, no column is unique, but the tuples formed by comining several columns are unique, e.g., data with year, month, and day columns.
revenues <- tribble(
~year, ~month, ~day, ~revenue,
2024, 09, 22, 100,
2024, 09, 23, 200,
2024, 10, 22, 200,
2024, 10, 22, 200,
2025, 09, 22, 500
)
expenses <- tribble(
~year, ~month, ~day, ~expenses,
2024, 09, 22, -200,
2024, 09, 23, -200,
2024, 10, 22, -200,
2024, 10, 23, -200,
2025, 09, 22, -300
)
In this case, a simple join on any one column goes astray:
Warning in inner_join(revenues, expenses, join_by(day == day)): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
# A tibble: 14 × 7
year.x month.x day revenue year.y month.y expenses
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2024 9 22 100 2024 9 -200
2 2024 9 22 100 2024 10 -200
3 2024 9 22 100 2025 9 -300
4 2024 9 23 200 2024 9 -200
5 2024 9 23 200 2024 10 -200
6 2024 10 22 200 2024 9 -200
7 2024 10 22 200 2024 10 -200
8 2024 10 22 200 2025 9 -300
9 2024 10 22 200 2024 9 -200
10 2024 10 22 200 2024 10 -200
11 2024 10 22 200 2025 9 -300
12 2025 9 22 500 2024 9 -200
13 2025 9 22 500 2024 10 -200
14 2025 9 22 500 2025 9 -300
Note the warning!
In this scenario, we should really “tidy” up the data by combining the date information, which is spread across three columns, into a single column, but we have the alternative option of a compound join:
# A tibble: 5 × 5
year month day revenue expenses
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2024 9 22 100 -200
2 2024 9 23 200 -200
3 2024 10 22 200 -200
4 2024 10 22 200 -200
5 2025 9 22 500 -300
Here, as with filter
, the list of conditions looks for an intersection: we want all three parts of the date to match.
Finally, we may want to re-arrange the output of a join. Returning to our grades example from above:
students <- tribble(
~name, ~email, ~id,
"Bernard", "bernard@cuny.edu", 1,
"Hunter", "hunter@cuny.edu", 2,
"John Jay", "john.jay@cuny.edu", 3
)
grades <- tribble(
~student_id, ~assignment_id, ~grade,
1, "A", 100,
2, "A", 95,
3, "A", 80,
1, "B", 95,
2, "B", 80,
3, "B", 50,
1, "C", 95,
2, "C", 50,
3, "C", 80
)
grade_book <- inner_join(students,
grades,
join_by(id == student_id))
grade_book
# A tibble: 9 × 5
name email id assignment_id grade
<chr> <chr> <dbl> <chr> <dbl>
1 Bernard bernard@cuny.edu 1 A 100
2 Bernard bernard@cuny.edu 1 B 95
3 Bernard bernard@cuny.edu 1 C 95
4 Hunter hunter@cuny.edu 2 A 95
5 Hunter hunter@cuny.edu 2 B 80
6 Hunter hunter@cuny.edu 2 C 50
7 John Jay john.jay@cuny.edu 3 A 80
8 John Jay john.jay@cuny.edu 3 B 50
9 John Jay john.jay@cuny.edu 3 C 80
This isn’t really how we like to see gradebooks: a “wider” format, with a column for each assignment, may be more preferable. In this case, we want to use the pivot_wider
column from the tidyr
package.
pivot_wider
takes a few key arguments:
id_cols
which columns that (taken together) uniquely identify a row in the final tablenames_from
: where should we get the column names of the new tablevalues_from
: where should we get the values of the new tableThis is maybe easier by example:
grade_book |>
pivot_wider(id_cols = c(name, email, id),
names_from=assignment_id,
values_from=grade)
# A tibble: 3 × 6
name email id A B C
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Bernard bernard@cuny.edu 1 100 95 95
2 Hunter hunter@cuny.edu 2 95 80 50
3 John Jay john.jay@cuny.edu 3 80 50 80
This pivot
trick is particularly useful for finding missing rows, like those that tripped us up earlier:
grades <- tribble( # Implicit missing values
~student_id, ~assignment_id, ~grade,
1, "A", 100,
2, "A", 95,
3, "A", 80,
1, "B", 95,
2, "B", 80,
1, "C", 95,
3, "C", 80
)
inner_join(students,
grades,
join_by(id == student_id)) |>
pivot_wider(id_cols = c(name, email, id),
names_from = assignment_id,
values_from = grade)
# A tibble: 3 × 6
name email id A B C
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Bernard bernard@cuny.edu 1 100 95 95
2 Hunter hunter@cuny.edu 2 95 80 NA
3 John Jay john.jay@cuny.edu 3 80 NA 80
Here, our missing values are now explicit!
We can also explicitly fill the NA
with a value of our choice, here 0:
inner_join(students,
grades,
join_by(id == student_id)) |>
pivot_wider(id_cols = c(name, email, id),
names_from = assignment_id,
values_from = grade,
values_fill = 0)
# A tibble: 3 × 6
name email id A B C
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Bernard bernard@cuny.edu 1 100 95 95
2 Hunter hunter@cuny.edu 2 95 80 0
3 John Jay john.jay@cuny.edu 3 80 0 80
There is also an inverse operator pivot_longer
which takes a wide table (like this) and makes it longer.
To complete our grade book example, we might want to take the average across the three grade columns:
inner_join(students,
grades,
join_by(id == student_id)) |>
pivot_wider(id_cols = c(name, email, id),
names_from = assignment_id,
values_from = grade,
values_fill = 0) |>
group_by(name) |>
mutate(final_avg = mean(c_across(A:C)))
# A tibble: 3 × 7
# Groups: name [3]
name email id A B C final_avg
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Bernard bernard@cuny.edu 1 100 95 95 96.7
2 Hunter hunter@cuny.edu 2 95 80 0 58.3
3 John Jay john.jay@cuny.edu 3 80 0 80 53.3
Note here that we need to use a mutate since our final grade book has the same number of rows before and after we add the final average column. The c_across
column here is a variant of the standard c
function used to combine scalars: here we’re creating a new length-3 vector of the student’s three grades and passing it to the mean
function.
What is group_by(name)
doing here? See what happens without it:
inner_join(students,
grades,
join_by(id == student_id)) |>
pivot_wider(id_cols = c(name, email, id),
names_from = assignment_id,
values_from = grade,
values_fill = 0) |>
mutate(final_avg = mean(c_across(A:C)))
# A tibble: 3 × 7
name email id A B C final_avg
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Bernard bernard@cuny.edu 1 100 95 95 69.4
2 Hunter hunter@cuny.edu 2 95 80 0 69.4
3 John Jay john.jay@cuny.edu 3 80 0 80 69.4
Recall that mean
is a summarization function - it will combine data from across rows if no grouping structure is present. Since we want seperate averages for each student, we need a group_by
. In this case, name
is a unique identifier for each student, so we can group on it. We also have the rowwise()
helper, which automatically creates group structure with each group a separate row. If you don’t have a clean unique identifier, or just can’t think of one easily, this is sometimes a useful helper.
inner_join(students,
grades,
join_by(id == student_id)) |>
pivot_wider(id_cols = c(name, email, id),
names_from = assignment_id,
values_from = grade,
values_fill = 0) |>
rowwise() |>
mutate(final_avg = mean(c_across(A:C)))
# A tibble: 3 × 7
# Rowwise:
name email id A B C final_avg
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Bernard bernard@cuny.edu 1 100 95 95 96.7
2 Hunter hunter@cuny.edu 2 95 80 0 58.3
3 John Jay john.jay@cuny.edu 3 80 0 80 53.3
In class, we will explore joins in more detail by combining the flights
data with plane, airport, and weather factors.
Please now go fill out the weekly quiz on Brightspace.
Adapted from Data Carpentry↩︎