Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 5

STA 9750 Week 5

Today:

  • Tuesday Section: 2025-09-30
  • Thursday Section: 2025-09-25

Lecture #05: Multi-Table dplyr Verbs

Today

Today

  • Special Presentation
  • Course Administration
  • Review
  • Diving Deeper into Multi-Table Verbs
  • PA#05 FAQs
  • Wrap-Up
    • Life Tip of the Day

Special Presentation

Baruch Data Resources

Jason Amey

Newman Library, Baruch College CUNY

Course Administration

MP#00 Peer Feedback

Mini-Project #00 Peer Feedback deadline:

  • Hopefully you benefitted from interaction with peers (giving and receiving comments)
  • As of 10:30am Wednesday:
    • 37 completed properly
    • 46 completed with formatting issues
    • 13 not yet started

Remember you can use mp_feedback_verify to confirm proper formatting:

  • Need to get structure (##), not just arrangement
  • Mixing of scores and text

STA 9750 MP #01

Due on 2025-10-03 at 11:59pm ET (+ grace period)

  • Submit early and submit often
    • Less “last minute” tech support going forward
  • Use Piazza and use your peers

You don’t need fancy graphics yet, but I of course love to see above and beyond

STA 9750 MP #01

Make sure your code is included in your submission

Follow submission instructions

  • You need to have mp01.qmd and docs/mp01.html in your STA9750-2025-FALL repository

  • Helper function can verify submission is properly formatted

Pre-Assignments

Brightspace - day before class at 11:45

  • Reading, typically on course website
  • Brightspace auto-grades
    • I have to manually change to completion grading
  • Two “drops” + otherwise free points

No pre-assignment next week!

Course Support

  • Synchronous
    • Office Hours 2x / week
  • Asynchronous
    • Piazza (\(\approx\) 20 minute average response time)

Ask questions! This course is demanding but rewarding.

Social contract: I push you hard, but I also provide lots of support.

Course Project

Roster formation deadline: 2025-09-30

  • 7 teams already formed (38 students)
  • 59 students still need a team

I’ve started setting teams up. If you don’t have a team and want to be randomly assigned, let me know.

Proposal Presentations

Next Week - Project Proposal Presentations (Tuesday Oct 07 and Thursday Oct 09)

Official Description

  • 6 minute presentation
  • Key topics:
    • Animating Question
    • Team Roster
  • Also discuss: Possible specific questions, data sources, analytical plan, anticipated challenges

Aims: make sure you’ve started thinking seriously, not locking you in to answers

Review from Last Week

Single-Table Verbs

dplyr single-table verbs

  • select, filter: Selecting rows and columns
  • rename, mutate: Changing rows and columns
  • summarize, group_by: Combining multiple rows
  • arrange, slice_min/max: Re-ordering

Review of PA#05

Multi-Table Analysis

Multiple Tables:

  • More insights than from a single table
  • Maintain ‘tidy’ structure throughout

Will create new (compound) rows:

  • Dangers: drops and (over) duplication

Primary Keys

Keys are unique identifiers for individual records

  • Primary (one column) or compound (multiple columns together)

The history of corporate IT is largely one of (failed) primary keys

  • Finance: Tickers, Tickers + Exchange, Tickers + Share Class, CUSIP, ISIN, SEDOL, …

Meaningful true keys are vanishingly rare - cherish them when you find them

Often ‘unique enough’ for an analysis

dplyr::group_by() + dplyr::count() is helpful here

Joins

Joins combine tables by identity - not simple ‘stacking’

Specify a join key - ideally this is an actual key, but doesn’t have to be

In dplyr, we use the join_by function:

dplyr::join_by(table_1_name == table_2_name)

Here table_1_name and table_2_name are column names from two tables

Join rows where these values are equal (advanced joins possible)

Inner and Outer Joins

When tables are perfectly matched, not an issue:

cunys
# A tibble: 4 × 2
  college campus_borough
  <chr>   <chr>         
1 CCNY    Manhattan     
2 Baruch  Manhattan     
3 CSI     Staten Island 
4 York    Queens        
routes
# A tibble: 3 × 2
  borough_name  bus_code
  <chr>         <chr>   
1 Manhattan     M       
2 Staten Island S       
3 Queens        Q       

Inner and Outer Joins

When tables are perfectly matched, not an issue:

inner_join(cunys, routes, join_by(campus_borough == borough_name))
# A tibble: 4 × 3
  college campus_borough bus_code
  <chr>   <chr>          <chr>   
1 CCNY    Manhattan      M       
2 Baruch  Manhattan      M       
3 CSI     Staten Island  S       
4 York    Queens         Q       

Default to inner but irrelevant

Note automatic repetition of "M" row

Inner and Outer Joins

How to handle ‘unaligned’ values?

cunys <- tribble(~college, ~campus_borough, 
                 "CCNY", "Manhattan",
                 "Baruch", "Manhattan", 
                 "CSI", "Staten Island",
                 "York", "Queens", 
                 "Medgar Evers", "Brooklyn")

inner_join(cunys, routes, join_by(campus_borough == borough_name))
# A tibble: 4 × 3
  college campus_borough bus_code
  <chr>   <chr>          <chr>   
1 CCNY    Manhattan      M       
2 Baruch  Manhattan      M       
3 CSI     Staten Island  S       
4 York    Queens         Q       

MEC vanished!

Inner and Outer Joins

left_join(cunys, routes, join_by(campus_borough == borough_name))
# A tibble: 5 × 3
  college      campus_borough bus_code
  <chr>        <chr>          <chr>   
1 CCNY         Manhattan      M       
2 Baruch       Manhattan      M       
3 CSI          Staten Island  S       
4 York         Queens         Q       
5 Medgar Evers Brooklyn       <NA>    

MEC stays, but no bus code - NA value

  • inner_join - Keep only matches
  • left_join - Keep all rows in left (first) table even w/o matches
  • right_join - Keep all rows in right (second) table even w/o matches
  • full_join - Keep all rows from both tables, even w/o matches

left_ and right_ are types of ‘outer’ joins

Pivoting

The pivot_* functions change the shape of data

  • Values are not created or destroyed, just moved around
  • wider data sets are formed by forming multiple rows into columns
  • longer data sets are splitting columns from the same row into new rows

These functions come from the tidyr package - not dplyr

library(tidyr) # included in library(tidyverse)

Pivoting

Untidy example from last week:

# A tibble: 12 × 4
   Semester Course     Number Type      
   <chr>    <chr>       <dbl> <chr>     
 1 Fall     Accounting    200 Enrollment
 2 Fall     Accounting    250 Cap       
 3 Fall     Law           100 Enrollment
 4 Fall     Law           125 Cap       
 5 Fall     Statistics    200 Enrollment
 6 Fall     Statistics    200 Cap       
 7 Spring   Accounting    300 Enrollment
 8 Spring   Accounting    350 Cap       
 9 Spring   Law            50 Enrollment
10 Spring   Law           100 Cap       
11 Spring   Statistics    400 Enrollment
12 Spring   Statistics    400 Cap       

Pivoting

This data was untidy because it split a single unit (course) across multiple rows

pivot_wider to get to the right format

pivot_wider(BARUCH_UNTIDY, names_from=Type, values_from=Number)
# A tibble: 6 × 4
  Semester Course     Enrollment   Cap
  <chr>    <chr>           <dbl> <dbl>
1 Fall     Accounting        200   250
2 Fall     Law               100   125
3 Fall     Statistics        200   200
4 Spring   Accounting        300   350
5 Spring   Law                50   100
6 Spring   Statistics        400   400

Pivots

pivot_ changes the shape of a data set. Purposes:

  • Get ready for presentation
  • Prep for a join
  • Combine rows before looking at ‘cross-row’ structure

Pivots

Which penguin species has the largest between-sex mass difference?

library(tidyr)
avg_mass_tbl <- penguins |> drop_na() |> 
    group_by(sex, species) |> 
    summarize(avg_mass = mean(body_mass), .groups="drop")
    # .groups="drop" is equivalent to |> ungroup()
avg_mass_tbl
# A tibble: 6 × 3
  sex    species   avg_mass
  <fct>  <fct>        <dbl>
1 female Adelie       3369.
2 female Chinstrap    3527.
3 female Gentoo       4680.
4 male   Adelie       4043.
5 male   Chinstrap    3939.
6 male   Gentoo       5485.

Pivots

We want data that is wider than our current data:

species male_avg female_avg
Adelie
Chinstrap
Gentoo

Pivots

pivot_wider(avg_mass_tbl, 
            id_cols = species, 
            names_from=sex, 
            values_from=avg_mass)
# A tibble: 3 × 3
  species   female  male
  <fct>      <dbl> <dbl>
1 Adelie     3369. 4043.
2 Chinstrap  3527. 3939.
3 Gentoo     4680. 5485.
pivot_wider(avg_mass_tbl, 
            id_cols = species, 
            names_from=sex, 
            values_from=avg_mass) |>
    mutate(sex_diff = male - female) |>
    slice_max(sex_diff)
# A tibble: 1 × 4
  species female  male sex_diff
  <fct>    <dbl> <dbl>    <dbl>
1 Gentoo   4680. 5485.     805.

Pivots

pivot_wider Arguments:

  • id_cols: kept as ‘keys’ for new table
  • names_from: existing column ‘spread’ to create new columns names
  • values_from: values in new table

pivot_longer:

  • ‘Inverse’ operation
  • Spread one row + multiple columns => one col + multiple rows

pivot_wider and pivot_longer have many additional arguments for dealing with repeats / missing values. The help page (+ experimenting) is your friend

Legos of Data Analysis

These functions are like Legos:

  • Simple individually
  • Combine for complex structures

Legos of Data Analysis

Q: How many distinct flights left NYC in 2013?

library(dplyr)
library(nycflights13)
flights |> 
    n_distinct()
[1] 336776

Not quite what we wanted…

Legos of Data Analysis

Q: How many distinct flights left NYC in 2013?

How many unique combinations of carrier + flight (e.g., United 101)?

flights |>
    select(carrier, flight) |>
    n_distinct()
[1] 5725

Legos of Data Analysis

Q: How many distinct flights left NYC in 2013?

💡 Did airlines re-use flight numbers for different destinations?

flights |>
    distinct(carrier, flight, dest) |>
    # Find reuse of number across different destinations
    # Shorthand for group_by + summarize(n = n())
    count(carrier, flight) 
# A tibble: 5,725 × 3
   carrier flight     n
   <chr>    <int> <int>
 1 9E        2900     1
 2 9E        2901     1
 3 9E        2902     1
 4 9E        2903     2
 5 9E        2904     2
 6 9E        2905     1
 7 9E        2906     1
 8 9E        2907     1
 9 9E        2908     1
10 9E        2909     2
# ℹ 5,715 more rows

Seems so!

Legos of Data Analysis

Q: How many distinct flights left NYC in 2013?

Find examples of re-use:

flights |>
    distinct(carrier, flight, dest) |>
    count(carrier, flight) |> 
    slice_max(n)
# A tibble: 1 × 3
  carrier flight     n
  <chr>    <int> <int>
1 UA        1162    16

Legos of Data Analysis

Q: How many distinct flights left NYC in 2013?

Finding most re-used flight number:

flights |>
    distinct(carrier, flight, dest) |>
    count(carrier, flight) |> 
    slice_max(n) |> 
    left_join(flights, join_by(carrier == carrier, flight == flight)) |>
    pull(dest) |> # pull out column as vector
    table() # frequency table

BOS CLE DEN DFW IAH JAC LAS MIA MSY ORD SAN SAT SEA SFO SNA TPA 
 13   1  19   5   8   2   1   2   8  55   4   1  18   2  27   4 

Legos of Data Analysis

Q: How many distinct flights left NYC in 2013?

Finding most re-used flight number:

flights |>
    distinct(carrier, flight, dest) |>
    count(carrier, flight) |> 
    slice_max(n)
# A tibble: 1 × 3
  carrier flight     n
  <chr>    <int> <int>
1 UA        1162    16

Legos of Data Analysis

Q: How many distinct flights left NYC in 2013?

Seeing where our most reused number went:

flights |>
    distinct(carrier, flight, dest) |>
    count(carrier, flight) |> 
    slice_max(n) |>
    inner_join(flights, join_by(carrier == carrier, flight == flight)) |>
    count(flight, carrier, dest)
# A tibble: 16 × 4
   flight carrier dest      n
    <int> <chr>   <chr> <int>
 1   1162 UA      BOS      13
 2   1162 UA      CLE       1
 3   1162 UA      DEN      19
 4   1162 UA      DFW       5
 5   1162 UA      IAH       8
 6   1162 UA      JAC       2
 7   1162 UA      LAS       1
 8   1162 UA      MIA       2
 9   1162 UA      MSY       8
10   1162 UA      ORD      55
11   1162 UA      SAN       4
12   1162 UA      SAT       1
13   1162 UA      SEA      18
14   1162 UA      SFO       2
15   1162 UA      SNA      27
16   1162 UA      TPA       4

Legos of Data Analysis

Q: How many distinct flights left NYC in 2013?

Additional join to get airport information + formatting:

flights |>
    distinct(carrier, flight, dest) |>
    count(carrier, flight) |> 
    slice_max(n) |>
    inner_join(flights, join_by(carrier == carrier, flight == flight)) |>
    count(flight, carrier, dest) |> 
    inner_join(airports, join_by(dest == faa)) |>
    select(name, n, carrier, flight) |>
    arrange(desc(n)) |>
    rename(`Destination Airport` = name, 
           `Number of Times Flown` = n, 
           `Carrier Code` = carrier, 
           `Flight Number`= flight)
# A tibble: 16 × 4
   `Destination Airport`   Number of Times Flow…¹ `Carrier Code` `Flight Number`
   <chr>                                    <int> <chr>                    <int>
 1 Chicago Ohare Intl                          55 UA                        1162
 2 John Wayne Arpt Orange…                     27 UA                        1162
 3 Denver Intl                                 19 UA                        1162
 4 Seattle Tacoma Intl                         18 UA                        1162
 5 General Edward Lawrenc…                     13 UA                        1162
 6 George Bush Interconti…                      8 UA                        1162
 7 Louis Armstrong New Or…                      8 UA                        1162
 8 Dallas Fort Worth Intl                       5 UA                        1162
 9 San Diego Intl                               4 UA                        1162
10 Tampa Intl                                   4 UA                        1162
11 Jackson Hole Airport                         2 UA                        1162
12 Miami Intl                                   2 UA                        1162
13 San Francisco Intl                           2 UA                        1162
14 Cleveland Hopkins Intl                       1 UA                        1162
15 Mc Carran Intl                               1 UA                        1162
16 San Antonio Intl                             1 UA                        1162
# ℹ abbreviated name: ¹​`Number of Times Flown`

Legos of Data Analysis

Q: How many distinct flights left NYC in 2013?

Extra join to match to airlines as well:

head(airlines, 3)
# A tibble: 3 × 2
  carrier name                  
  <chr>   <chr>                 
1 9E      Endeavor Air Inc.     
2 AA      American Airlines Inc.
3 AS      Alaska Airlines Inc.  

Also has a column named name - need to disambiguate!

Legos of Data Analysis

Q: How many distinct flights left NYC in 2013?

Additional join to get airport information + formatting:

flights |>
    distinct(carrier, flight, dest) |>
    count(carrier, flight) |> 
    slice_max(n) |>
    inner_join(flights, join_by(carrier == carrier, flight == flight)) |>
    count(flight, carrier, dest) |>  
    inner_join(airports, join_by(dest == faa)) |>
    select(name, n, carrier, flight) |>
    rename(dest_name = name) |> 
    inner_join(airlines, join_by(carrier == carrier)) |> 
    arrange(desc(n)) |>
    select(-carrier) |> 
    rename(`Destination Airport` = dest_name, 
           `Number of Times Flown` = n, 
           `Carrier` = name, 
           `Flight Number`= flight)
# A tibble: 16 × 4
   `Destination Airport`          Number of Times Flow…¹ `Flight Number` Carrier
   <chr>                                           <int>           <int> <chr>  
 1 Chicago Ohare Intl                                 55            1162 United…
 2 John Wayne Arpt Orange Co                          27            1162 United…
 3 Denver Intl                                        19            1162 United…
 4 Seattle Tacoma Intl                                18            1162 United…
 5 General Edward Lawrence Logan…                     13            1162 United…
 6 George Bush Intercontinental                        8            1162 United…
 7 Louis Armstrong New Orleans I…                      8            1162 United…
 8 Dallas Fort Worth Intl                              5            1162 United…
 9 San Diego Intl                                      4            1162 United…
10 Tampa Intl                                          4            1162 United…
11 Jackson Hole Airport                                2            1162 United…
12 Miami Intl                                          2            1162 United…
13 San Francisco Intl                                  2            1162 United…
14 Cleveland Hopkins Intl                              1            1162 United…
15 Mc Carran Intl                                      1            1162 United…
16 San Antonio Intl                                    1            1162 United…
# ℹ abbreviated name: ¹​`Number of Times Flown`

Legos of Data Analysis

Question: What does this do I can’t do in Excel?

Technically, nothing. All programming languages of sufficient complexity are equally powerful (Turing equivalence).


In actuality, quite a lot:

  • filter allows more complex filtering than clicking on values

  • group_by + summarize extend array formulas

  • *_join provides more complex matching than VLOOKUP

  • pivot_* provide general formulation of pivot tables

  • everything else you can do in R.

Ability to script minimizes “hard-coding” of names and values.

But truthfully

fortunes::fortune(59)

Let's not kid ourselves: the most widely used piece of software for statistics
is Excel.
   -- Brian D. Ripley ('Statistical Methods Need Software: A View of
      Statistical Computing')
      Opening lecture RSS 2002, Plymouth (September 2002)
fortunes::fortune(222)

Some people familiar with R describe it as a supercharged version of
Microsoft's Excel spreadsheet software.
   -- Ashlee Vance (in his article "Data Analysts Captivated by R's Power")
      The New York Times (January 2009)

Diving Deeper into Multi-Table Verbs

Diving deeper Into Joins

Data Set: nycflights13

Exercises: Lab #05

Additional dplyr Tricks

  • Ranking functions
    • row_number, min_rank, dense_rank: differ in ties
    • Use with desc() to flip ordering
    • cum_dist, percent_rank: compute quantiles
  • Cumulative Statistics
    • cummean, cummax, cummin, …
  • Conditional mutation
    • if_else, case_when, default TRUE trick

Pre-Assignment #05 FAQs

Subqueries

[W]ill we be learning how to perform joins within a subquery?

You don’t need subqueries in R since it’s an imperative language. Just create a new variable to represent the result of the subquery and use that in the next command.

SELECT first_name, last_name
FROM collectors
WHERE id IN (
    SELECT collector_id
    FROM sales
);
collector_ids <- sales |> pull(collector_id)
collectors |> filter(id %in% collector_ids) |> select(first_name, last_name)

Data Integrity

[H]ow can we ensure that the information [resulting from a join] is accurate and not repeated?

  1. If you have a true unique ID, you’re usually safe
  2. Pay attention to all warnings
  3. Manually examine the result of any joins

Performance

Will joining large data sets […] affect performance?

Somewhat - larger data sets are always slower.

Bigger danger is “bad joins” creating huge data automatically.

Note that R is less “smart” than SQL, so won’t optimize execution order for you automatically.

dplyr joins vs SQL joins

What is the difference between dplyr and SQL joins?

Not too much - biggest difference is no INDEX or FOREIGN KEY in R so less guarantees of data integrity.

When to use anti_join()?

Rare: looking for unmatched rows.

  • Useful to find data integrity issues or ‘implicit’ missingness.
  • I use an anti_join to find students who haven’t submitted an assignment.

many-to-many Warning

Tricky to address, but fortunately pretty rare.

  • SQL explicitly forbids many-to-many
  • Usually a sign that a “key” isn’t really unique
    • Check for duplicates in x and y tables
    • Can occur with “fancy” joins (rolling, inequality)
  • Add additional join variables to break “duplication”

How to Check Efficiency?

No automatic way. Some rules of thumb:

  • Don’t create large tables just to filter down
    • filter before join when possible
  • full_outer join is a bit dangerous
  • cross_join is rarely the right answer

tidyr vs dplyr

Is tidyr more efficient than dplyr?

Nope - different packages from the same developers.

Designed to work together elegantly.

Rare Joins

What are cross_join, filter joins, and nest_join?

  • cross_join: dangerous.
    • Creates “all pairs” of rows. Useful for ‘design’ problems
  • filter joins (anti_, semi_):
    • Hunting down quietly missing data.
    • Filtering to sub-samples
  • nest_join: beyond this course.
    • left_join with extra structure to output.

Wrap-Up

Review

Multi-Table dplyr:

  • inner_join and left_join
  • join_by specifications
  • pivot_longer and pivot_wider to get data into optimal formats (tidyr)

Additional dplyr:

  • Ranking, cumulative, and shift functions
  • Conditional mutation

Upcoming Work

Upcoming work from course calendar

  • Pre-Assignment #07 due 2025-10-20 at 11:59pm ET
  • Mini-Project #01 due on 2025-10-03 at 11:59pm ET
  • Project Teams by 2025-09-30
  • Project Proposals on Tuesday Oct 07 and Thursday Oct 09

Life Tip of the Week

Easy Duplicate Emails with Gmail

If you have a gmail account, you actually have infinite accounts:

email+tag@gmail.com gets forwarded to email@gmail.com

for any (reasonable) tag

Duplicate Emails with Gmail

Use this to:

  • Track who is reselling your email
  • Get access to ‘new account sales’ / ‘signup promotions’
    • Typically based on new email
  • Quickly sort and organize emails

Musical Treat