STA/OPR 9750 Week 5 Pre Assignment: Multi-Table dplyr Verbs

Due Date: 2024-09-25 (Wednesday) at 11:45pm

Submission: CUNY Brightspace

Last week, we considered single-table verbs: these are appropriate for asking complex questions of a nicely formatted data frame. Sadly, we are rarely provided data frames suitable for every question we might seek to answer. Instead, we typically need to combine information from multiple sources. For instance, if we want to examine the relationship between demographics and electoral results, we will need to combine information from the US Census Bureau and local elections administrators. Or, if we want to investigate the relationship between a company’s financial status and its stock performance, we might need to combine information from multiple databases.

Basic Joins

The basic operator of combining different tables is the join, patterned after SQL. Each join operates using some notion of “match” between tables. In the best case, this is done using a unique identifier - one universal and consistent name for each entity. Sadly, such perfect identifiers rarely exist. For instance, companies change their names and their ticker symbols somewhat regularly (e.g., Facebook becoming Meta)

The simplest join is the inner_join, which returns rows which match between the two tables:

library(dplyr)
band_members
# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles
band_instruments
# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar

In this case, the name column forms a unique ID, so we can use it for our join.

inner_join(band_members, band_instruments)
Joining with `by = join_by(name)`
# A tibble: 2 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  

We see here that R automatically performed the join using the common column (name): if we want to be clearer, let’s specify the join element ourselves:

inner_join(band_members, band_instruments, join_by(name))
# A tibble: 2 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  

Here join_by is a helper function that can be used to specify the form of the join used. In some contexts, the “common” column has different names in the two tables, so we can use a more explicit call to join_by:

band_instruments2
# A tibble: 3 × 2
  artist plays 
  <chr>  <chr> 
1 John   guitar
2 Paul   bass  
3 Keith  guitar

Note that this is the same as band_instruments, but with the name column changed to artist.

inner_join(band_members, band_instruments2, join_by(name == artist))
# A tibble: 2 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  

I like to always use this final - most explicit - form, even when the column names are the same between the two tables (join_by(name == name)).

Let’s look more closely at the result here: we return a table with 3 columns and two rows:

inner_join(band_members, band_instruments, join_by(name == name))
# A tibble: 2 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  

The three columns are pretty easy to understand:

  • name is the (shared) name column from each table
  • band comes from the band_members table
  • plays comes from the band_instruments table.

The two rows are a bit trickier: each of our input tables had three rows, but there were only two “overlaps” so that’s what we get back from an inner_join. Specifically, we drop Mick [Jagger] from band_members because he doesn’t appear in band_instrumentsand we drop Keith [Richards] from band_instruments because he doesn’t appear in band_members.

In brief, an inner join is an intersection join. We only get rows back which have a match in both tables.

Other join operators have complimentary behaviors: the full join (also sometimes called an outer join) is basically a union join. We get back all rows from both tables, regardless of whether a match has been found. But what happens with those unmatched rows?

full_join(band_members, band_instruments, join_by(name == name))
# A tibble: 4 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
4 Keith <NA>    guitar

R fills in the “blanks” with NA values. Here, we can assume Mick [Jagger] plays an instrument, but it is unknown to R here.

Finally, we have the intermediate left join, which keeps all rows from one table whether or not they have a match:

left_join(band_members, band_instruments, join_by(name == name))
# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  

Here we keep Mick because he is in band_members, even though he is missing from band_instruments. Conversely, we drop Keith because he isn’t in band_members (even though he is in band_instruments).

R also provides a right_join, but it’s not really different: it’s just a “flipped” left_join: left_join(x, y) == right_join(y, x).

The following image1 summarizes the different types of joins:

The anti_join returns elements that appear in one data set, but not the other. It’s rarer, but occasionally useful.

Joins with Repeats

In the previous examples, we have seen joins that have a “one-to-one” (inner) or possibly “one-to-none” (full, left) structure. In many circumstances, we find ourselves with a “one-to-many” type structure, even when both data sets are “tidy”. This typically occurs because different data sets have different models of what a “unit” is. For example, consider a hypothetical instructor who has i) a table with student names and contact information; and ii) a table with grades on different assignments.

students <- tribble(
    ~name, ~email, ~id,
    "Bernard",  "bernard@cuny.edu",  1,
    "Hunter",   "hunter@cuny.edu",   2,
    "John Jay", "john.jay@cuny.edu", 3
)

grades <- tribble(
    ~student_id, ~assignment_id, ~grade,
    1,           "A",            100,
    2,           "A",            95,
    3,           "A",            80,
    1,           "B",            95,
    2,           "B",            80,
    3,           "B",            50,
    1,           "C",            95,
    2,           "C",            50,
    3,           "C",            80
)

What happens if we join these?

inner_join(students, grades, join_by(id == student_id))
# A tibble: 9 × 5
  name     email                id assignment_id grade
  <chr>    <chr>             <dbl> <chr>         <dbl>
1 Bernard  bernard@cuny.edu      1 A               100
2 Bernard  bernard@cuny.edu      1 B                95
3 Bernard  bernard@cuny.edu      1 C                95
4 Hunter   hunter@cuny.edu       2 A                95
5 Hunter   hunter@cuny.edu       2 B                80
6 Hunter   hunter@cuny.edu       2 C                50
7 John Jay john.jay@cuny.edu     3 A                80
8 John Jay john.jay@cuny.edu     3 B                50
9 John Jay john.jay@cuny.edu     3 C                80

(Note here that we need the explicit join_by since the column names don’t match between the two tables: id in students gets joined to student_id. This pattern of id in table tbl getting joined to tbl_id elsewhere is quite common in database design.)

We get repeats of the student rows: for each valid student-grade pair, we have a row. From here, we can compute final grades:

inner_join(students, 
           grades, 
           join_by(id == student_id)) |>
    group_by(name, email, id) |>
    summarize(final_avg = mean(grade)) |>
    mutate(final_grade = 
               case_when(final_avg > 90 ~ "A", 
                         final_avg > 80 ~ "B", 
                         final_avg > 70 ~ "C", 
                         final_avg > 60 ~ "D", 
                         TRUE ~ "F")) # In a case_when, TRUE == "else"
`summarise()` has grouped output by 'name', 'email'. You can override using the
`.groups` argument.
# A tibble: 3 × 5
# Groups:   name, email [3]
  name     email                id final_avg final_grade
  <chr>    <chr>             <dbl>     <dbl> <chr>      
1 Bernard  bernard@cuny.edu      1      96.7 A          
2 Hunter   hunter@cuny.edu       2      75   C          
3 John Jay john.jay@cuny.edu     3      70   D          

In this case, everything works well. But let’s try a slightly trickier case, with some students who never fail to submit certain assignments.

students <- tribble(
    ~name, ~email, ~id,
    "Bernard",  "bernard@cuny.edu",  1,
    "Hunter",   "hunter@cuny.edu",   2,
    "John Jay", "john.jay@cuny.edu", 3
)

grades <- tribble(
    ~student_id, ~assignment_id, ~grade,
    1,           "A",            100,
    2,           "A",            95,
    3,           "A",            80,
    1,           "B",            95,
    2,           "B",            80,
    1,           "C",            95,
    3,           "C",            80
)

inner_join(students, 
           grades, 
           join_by(id == student_id)) |>
    group_by(name, email, id) |>
    summarize(final_avg = mean(grade)) |>
    mutate(final_grade = 
               case_when(final_avg > 90 ~ "A", 
                         final_avg > 80 ~ "B", 
                         final_avg > 70 ~ "C", 
                         final_avg > 60 ~ "D", 
                         TRUE ~ "F"))
`summarise()` has grouped output by 'name', 'email'. You can override using the
`.groups` argument.
# A tibble: 3 × 5
# Groups:   name, email [3]
  name     email                id final_avg final_grade
  <chr>    <chr>             <dbl>     <dbl> <chr>      
1 Bernard  bernard@cuny.edu      1      96.7 A          
2 Hunter   hunter@cuny.edu       2      87.5 B          
3 John Jay john.jay@cuny.edu     3      80   C          

Why did the final grades go up after we deleted rows?

inner_join(students, 
           grades, 
           join_by(id == student_id))
# A tibble: 7 × 5
  name     email                id assignment_id grade
  <chr>    <chr>             <dbl> <chr>         <dbl>
1 Bernard  bernard@cuny.edu      1 A               100
2 Bernard  bernard@cuny.edu      1 B                95
3 Bernard  bernard@cuny.edu      1 C                95
4 Hunter   hunter@cuny.edu       2 A                95
5 Hunter   hunter@cuny.edu       2 B                80
6 John Jay john.jay@cuny.edu     3 A                80
7 John Jay john.jay@cuny.edu     3 C                80

The “missing” assignments for Hunter and John Jay aren’t reported as zeros - they are just ignored! And hence R takes an average over the two assignments where these students did well, not all three assignments. We’ll talk about one way to fix this below, but for now I’m just flagging it as a possible issue that can come up with missing data and joins. (Here the rows were missing, so it’s harder to catch than a plain NA; better data management would have included a “0” row instead of deleting them, but we don’t always get to assume super well-organized data.)

So far, our join results have been relatively straightforward because we have had ‘good’ unique identifiers. If we find ourselves in a situation where we lack unique IDs, things can go wrong quickly:

students <- tribble(
    ~name, ~email, ~id,
    "Bernard",  "bernard@cuny.edu",  1,
    "Hunter",   "hunter@cuny.edu",   2,
    "John Jay", "john.jay@cuny.edu", 2  # Accidentally repeat an ID
)
grades <- tribble( # Back to the complete data
    ~student_id, ~assignment_id, ~grade,
    1,           "A",            100,
    2,           "A",            95,
    3,           "A",            80,
    1,           "B",            95,
    2,           "B",            80,
    3,           "B",            50,
    1,           "C",            95,
    2,           "C",            50,
    3,           "C",            80
)

full_join(students, 
          grades, 
          join_by(id == student_id))
Warning in full_join(students, grades, join_by(id == student_id)): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 2 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# A tibble: 12 × 5
   name     email                id assignment_id grade
   <chr>    <chr>             <dbl> <chr>         <dbl>
 1 Bernard  bernard@cuny.edu      1 A               100
 2 Bernard  bernard@cuny.edu      1 B                95
 3 Bernard  bernard@cuny.edu      1 C                95
 4 Hunter   hunter@cuny.edu       2 A                95
 5 Hunter   hunter@cuny.edu       2 B                80
 6 Hunter   hunter@cuny.edu       2 C                50
 7 John Jay john.jay@cuny.edu     2 A                95
 8 John Jay john.jay@cuny.edu     2 B                80
 9 John Jay john.jay@cuny.edu     2 C                50
10 <NA>     <NA>                  3 A                80
11 <NA>     <NA>                  3 B                50
12 <NA>     <NA>                  3 C                80

In this case, R is kind enough to warn us that a “many-to-many” join has happened (joining multiple students to one grade and multiple grades to one student). This is a very good warning and it highlights a true error here. If faced with data like this, you may not be able to address it with fixing the underlying data, but at least you know something has gone awry.

Compound Joins

Often, data lack a unique identifier, but you can piece one together with several columns: that is, taken on its own, no column is unique, but the tuples formed by comining several columns are unique, e.g., data with year, month, and day columns.

revenues <- tribble(
    ~year, ~month, ~day, ~revenue,
    2024,  09,     22,   100,
    2024,  09,     23,   200,
    2024,  10,     22,   200,
    2024,  10,     22,   200,
    2025,  09,     22,   500
    )

expenses <- tribble(
    ~year, ~month, ~day, ~expenses,
    2024,  09,     22,   -200,
    2024,  09,     23,   -200,
    2024,  10,     22,   -200,
    2024,  10,     23,   -200,
    2025,  09,     22,   -300
    )

In this case, a simple join on any one column goes astray:

inner_join(revenues, expenses, join_by(day == day))
Warning in inner_join(revenues, expenses, join_by(day == day)): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# A tibble: 14 × 7
   year.x month.x   day revenue year.y month.y expenses
    <dbl>   <dbl> <dbl>   <dbl>  <dbl>   <dbl>    <dbl>
 1   2024       9    22     100   2024       9     -200
 2   2024       9    22     100   2024      10     -200
 3   2024       9    22     100   2025       9     -300
 4   2024       9    23     200   2024       9     -200
 5   2024       9    23     200   2024      10     -200
 6   2024      10    22     200   2024       9     -200
 7   2024      10    22     200   2024      10     -200
 8   2024      10    22     200   2025       9     -300
 9   2024      10    22     200   2024       9     -200
10   2024      10    22     200   2024      10     -200
11   2024      10    22     200   2025       9     -300
12   2025       9    22     500   2024       9     -200
13   2025       9    22     500   2024      10     -200
14   2025       9    22     500   2025       9     -300

Note the warning!

In this scenario, we should really “tidy” up the data by combining the date information, which is spread across three columns, into a single column, but we have the alternative option of a compound join:

inner_join(revenues, expenses, 
           join_by(day == day, 
                   month == month, 
                   year == year))
# A tibble: 5 × 5
   year month   day revenue expenses
  <dbl> <dbl> <dbl>   <dbl>    <dbl>
1  2024     9    22     100     -200
2  2024     9    23     200     -200
3  2024    10    22     200     -200
4  2024    10    22     200     -200
5  2025     9    22     500     -300

Here, as with filter, the list of conditions looks for an intersection: we want all three parts of the date to match.

Pivots

Finally, we may want to re-arrange the output of a join. Returning to our grades example from above:

students <- tribble(
    ~name, ~email, ~id,
    "Bernard",  "bernard@cuny.edu",  1,
    "Hunter",   "hunter@cuny.edu",   2,
    "John Jay", "john.jay@cuny.edu", 3
)

grades <- tribble(
    ~student_id, ~assignment_id, ~grade,
    1,           "A",            100,
    2,           "A",            95,
    3,           "A",            80,
    1,           "B",            95,
    2,           "B",            80,
    3,           "B",            50,
    1,           "C",            95,
    2,           "C",            50,
    3,           "C",            80
)

grade_book <- inner_join(students, 
                         grades, 
                         join_by(id == student_id))

grade_book
# A tibble: 9 × 5
  name     email                id assignment_id grade
  <chr>    <chr>             <dbl> <chr>         <dbl>
1 Bernard  bernard@cuny.edu      1 A               100
2 Bernard  bernard@cuny.edu      1 B                95
3 Bernard  bernard@cuny.edu      1 C                95
4 Hunter   hunter@cuny.edu       2 A                95
5 Hunter   hunter@cuny.edu       2 B                80
6 Hunter   hunter@cuny.edu       2 C                50
7 John Jay john.jay@cuny.edu     3 A                80
8 John Jay john.jay@cuny.edu     3 B                50
9 John Jay john.jay@cuny.edu     3 C                80

This isn’t really how we like to see gradebooks: a “wider” format, with a column for each assignment, may be more preferable. In this case, we want to use the pivot_wider column from the tidyr package.

pivot_wider takes a few key arguments:

  • id_cols which columns that (taken together) uniquely identify a row in the final table
  • names_from: where should we get the column names of the new table
  • values_from: where should we get the values of the new table

This is maybe easier by example:

grade_book |>
    pivot_wider(id_cols = c(name, email, id), 
                names_from=assignment_id,
                values_from=grade)
# A tibble: 3 × 6
  name     email                id     A     B     C
  <chr>    <chr>             <dbl> <dbl> <dbl> <dbl>
1 Bernard  bernard@cuny.edu      1   100    95    95
2 Hunter   hunter@cuny.edu       2    95    80    50
3 John Jay john.jay@cuny.edu     3    80    50    80

This pivot trick is particularly useful for finding missing rows, like those that tripped us up earlier:

grades <- tribble( # Implicit missing values
    ~student_id, ~assignment_id, ~grade,
    1,           "A",            100,
    2,           "A",            95,
    3,           "A",            80,
    1,           "B",            95,
    2,           "B",            80,
    1,           "C",            95,
    3,           "C",            80
)

inner_join(students, 
           grades, 
           join_by(id == student_id)) |>
    pivot_wider(id_cols = c(name, email, id), 
                names_from = assignment_id,
                values_from = grade)
# A tibble: 3 × 6
  name     email                id     A     B     C
  <chr>    <chr>             <dbl> <dbl> <dbl> <dbl>
1 Bernard  bernard@cuny.edu      1   100    95    95
2 Hunter   hunter@cuny.edu       2    95    80    NA
3 John Jay john.jay@cuny.edu     3    80    NA    80

Here, our missing values are now explicit!

We can also explicitly fill the NA with a value of our choice, here 0:

inner_join(students, 
           grades, 
           join_by(id == student_id)) |>
    pivot_wider(id_cols = c(name, email, id), 
                names_from = assignment_id,
                values_from = grade, 
                values_fill = 0)
# A tibble: 3 × 6
  name     email                id     A     B     C
  <chr>    <chr>             <dbl> <dbl> <dbl> <dbl>
1 Bernard  bernard@cuny.edu      1   100    95    95
2 Hunter   hunter@cuny.edu       2    95    80     0
3 John Jay john.jay@cuny.edu     3    80     0    80

There is also an inverse operator pivot_longer which takes a wide table (like this) and makes it longer.

To complete our grade book example, we might want to take the average across the three grade columns:

inner_join(students, 
           grades, 
           join_by(id == student_id)) |>
    pivot_wider(id_cols = c(name, email, id), 
                names_from = assignment_id,
                values_from = grade, 
                values_fill = 0) |>
    group_by(name) |>
    mutate(final_avg = mean(c_across(A:C)))
# A tibble: 3 × 7
# Groups:   name [3]
  name     email                id     A     B     C final_avg
  <chr>    <chr>             <dbl> <dbl> <dbl> <dbl>     <dbl>
1 Bernard  bernard@cuny.edu      1   100    95    95      96.7
2 Hunter   hunter@cuny.edu       2    95    80     0      58.3
3 John Jay john.jay@cuny.edu     3    80     0    80      53.3

Note here that we need to use a mutate since our final grade book has the same number of rows before and after we add the final average column. The c_across column here is a variant of the standard c function used to combine scalars: here we’re creating a new length-3 vector of the student’s three grades and passing it to the mean function.

What is group_by(name) doing here? See what happens without it:

inner_join(students, 
           grades, 
           join_by(id == student_id)) |>
    pivot_wider(id_cols = c(name, email, id), 
                names_from = assignment_id,
                values_from = grade, 
                values_fill = 0) |>
    mutate(final_avg = mean(c_across(A:C)))
# A tibble: 3 × 7
  name     email                id     A     B     C final_avg
  <chr>    <chr>             <dbl> <dbl> <dbl> <dbl>     <dbl>
1 Bernard  bernard@cuny.edu      1   100    95    95      69.4
2 Hunter   hunter@cuny.edu       2    95    80     0      69.4
3 John Jay john.jay@cuny.edu     3    80     0    80      69.4

Recall that mean is a summarization function - it will combine data from across rows if no grouping structure is present. Since we want seperate averages for each student, we need a group_by. In this case, name is a unique identifier for each student, so we can group on it. We also have the rowwise() helper, which automatically creates group structure with each group a separate row. If you don’t have a clean unique identifier, or just can’t think of one easily, this is sometimes a useful helper.

inner_join(students, 
           grades, 
           join_by(id == student_id)) |>
    pivot_wider(id_cols = c(name, email, id), 
                names_from = assignment_id,
                values_from = grade, 
                values_fill = 0) |>
    rowwise() |>
    mutate(final_avg = mean(c_across(A:C)))
# A tibble: 3 × 7
# Rowwise: 
  name     email                id     A     B     C final_avg
  <chr>    <chr>             <dbl> <dbl> <dbl> <dbl>     <dbl>
1 Bernard  bernard@cuny.edu      1   100    95    95      96.7
2 Hunter   hunter@cuny.edu       2    95    80     0      58.3
3 John Jay john.jay@cuny.edu     3    80     0    80      53.3

In class, we will explore joins in more detail by combining the flights data with plane, airport, and weather factors.

Please now go fill out the weekly quiz on Brightspace.

Footnotes

  1. Adapted from Data Carpentry↩︎