Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 5

STA 9750 Week 5

Today:

Tuesday Section: 2025-09-30
Thursday Section: 2025-09-25

Lecture #05: Multi-Table dplyr Verbs, Lecture #05: Multi-Table dplyr Verbs

Today

Special Presentation
Course Administration
Review
Diving Deeper into Multi-Table Verbs
PA#05 FAQs
Wrap-Up
- Life Tip of the Day

Special Presentation

Baruch Data Resources

Jason Amey

Newman Library, Baruch College CUNY

Research Guide: Open Data

Course Administration

MP#00 Peer Feedback

Mini-Project #00 Peer Feedback deadline:

Hopefully you benefitted from interaction with peers (giving and receiving comments)
As of 10:30am Wednesday:
- 37 completed properly
- 46 completed with formatting issues
- 13 not yet started

Remember you can use mp_feedback_verify to confirm proper formatting:

Need to get structure (##), not just arrangement
Mixing of scores and text

STA 9750 MP #01

Due on 2025-10-03 at 11:59pm ET (+ grace period)

Submit early and submit often
- Less “last minute” tech support going forward
Use Piazza and use your peers

You don’t need fancy graphics yet, but I of course love to see above and beyond

STA 9750 MP #01

Make sure your code is included in your submission

Code-Folding is very useful

Follow submission instructions

You need to have mp01.qmd and docs/mp01.html in your STA9750-2025-FALL repository
Helper function can verify submission is properly formatted

Pre-Assignments

Brightspace - day before class at 11:45

Reading, typically on course website
Brightspace auto-grades
- I have to manually change to completion grading
Two “drops” + otherwise free points

No pre-assignment next week!

Course Support

Synchronous
- Office Hours 2x / week
Asynchronous
- Piazza (\(\approx\) 20 minute average response time)

Ask questions! This course is demanding but rewarding.

Social contract: I push you hard, but I also provide lots of support.

Course Project

Roster formation deadline: 2025-09-30

20 teams already formed (97 students)
0 students still need a team

I’ve started setting teams up. If you don’t have a team and want to be randomly assigned, let me know.

Proposal Presentations

Next Week - Project Proposal Presentations (Tuesday Oct 07 and Thursday Oct 09)

Official Description

6 minute presentation
Key topics:
- Animating Question
- Team Roster
Also discuss: Possible specific questions, data sources, analytical plan, anticipated challenges

Aims: make sure you’ve started thinking seriously, not locking you in to answers

Review from Last Week

Single-Table Verbs

dplyr single-table verbs

select, filter: Selecting rows and columns
rename, mutate: Changing rows and columns
summarize, group_by: Combining multiple rows
arrange, slice_min/max: Re-ordering

Lab #05 Review Activity

Review of PA#05

Multi-Table Analysis

Multiple Tables:

More insights than from a single table
Maintain ‘tidy’ structure throughout

Will create new (compound) rows:

Dangers: drops and (over) duplication

Primary Keys

Keys are unique identifiers for individual records

Primary (one column) or compound (multiple columns together)

The history of corporate IT is largely one of (failed) primary keys

Finance: Tickers, Tickers + Exchange, Tickers + Share Class, CUSIP, ISIN, SEDOL, …

Meaningful true keys are vanishingly rare - cherish them when you find them

Often ‘unique enough’ for an analysis

dplyr::group_by() + dplyr::count() is helpful here

Joins

Joins combine tables by identity - not simple ‘stacking’

Specify a join key - ideally this is an actual key, but doesn’t have to be

In dplyr, we use the join_by function:

dplyr::join_by(table_1_name == table_2_name)

Here table_1_name and table_2_name are column names from two tables

Join rows where these values are equal (advanced joins possible)

Inner and Outer Joins

When tables are perfectly matched, not an issue:

# A tibble: 4 × 2
  college campus_borough
  <chr>   <chr>         
1 CCNY    Manhattan     
2 Baruch  Manhattan     
3 CSI     Staten Island 
4 York    Queens

# A tibble: 3 × 2
  borough_name  bus_code
  <chr>         <chr>   
1 Manhattan     M       
2 Staten Island S       
3 Queens        Q

Inner and Outer Joins

When tables are perfectly matched, not an issue:

# A tibble: 4 × 3
  college campus_borough bus_code
  <chr>   <chr>          <chr>   
1 CCNY    Manhattan      M       
2 Baruch  Manhattan      M       
3 CSI     Staten Island  S       
4 York    Queens         Q

Default to inner but irrelevant

Note automatic repetition of "M" row

Inner and Outer Joins

How to handle ‘unaligned’ values?

cunys <- tribble(~college, ~campus_borough, 
                 "CCNY", "Manhattan",
                 "Baruch", "Manhattan", 
                 "CSI", "Staten Island",
                 "York", "Queens", 
                 "Medgar Evers", "Brooklyn")

inner_join(cunys, routes, join_by(campus_borough == borough_name))

# A tibble: 4 × 3
  college campus_borough bus_code
  <chr>   <chr>          <chr>   
1 CCNY    Manhattan      M       
2 Baruch  Manhattan      M       
3 CSI     Staten Island  S       
4 York    Queens         Q

MEC vanished!

Inner and Outer Joins

left_join(cunys, routes, join_by(campus_borough == borough_name))

# A tibble: 5 × 3
  college      campus_borough bus_code
  <chr>        <chr>          <chr>   
1 CCNY         Manhattan      M       
2 Baruch       Manhattan      M       
3 CSI          Staten Island  S       
4 York         Queens         Q       
5 Medgar Evers Brooklyn       <NA>

MEC stays, but no bus code - NA value

inner_join - Keep only matches
left_join - Keep all rows in left (first) table even w/o matches
right_join - Keep all rows in right (second) table even w/o matches
full_join - Keep all rows from both tables, even w/o matches

left_ and right_ are types of ‘outer’ joins

Pivoting

The pivot_* functions change the shape of data

Values are not created or destroyed, just moved around
wider data sets are formed by forming multiple rows into columns
longer data sets are splitting columns from the same row into new rows

These functions come from the tidyr package - not dplyr

Pivoting

Untidy example from last week:

# A tibble: 12 × 4
   Semester Course     Number Type      
   <chr>    <chr>       <dbl> <chr>     
 1 Fall     Accounting    200 Enrollment
 2 Fall     Accounting    250 Cap       
 3 Fall     Law           100 Enrollment
 4 Fall     Law           125 Cap       
 5 Fall     Statistics    200 Enrollment
 6 Fall     Statistics    200 Cap       
 7 Spring   Accounting    300 Enrollment
 8 Spring   Accounting    350 Cap       
 9 Spring   Law            50 Enrollment
10 Spring   Law           100 Cap       
11 Spring   Statistics    400 Enrollment
12 Spring   Statistics    400 Cap

Pivoting

This data was untidy because it split a single unit (course) across multiple rows

pivot_wider to get to the right format

pivot_wider(BARUCH_UNTIDY, names_from=Type, values_from=Number)

# A tibble: 6 × 4
  Semester Course     Enrollment   Cap
  <chr>    <chr>           <dbl> <dbl>
1 Fall     Accounting        200   250
2 Fall     Law               100   125
3 Fall     Statistics        200   200
4 Spring   Accounting        300   350
5 Spring   Law                50   100
6 Spring   Statistics        400   400

Pivots

pivot_ changes the shape of a data set. Purposes:

Get ready for presentation
Prep for a join
Combine rows before looking at ‘cross-row’ structure

Pivots

Which penguin species has the largest between-sex mass difference?

library(tidyr)
avg_mass_tbl <- penguins |> drop_na() |> 
    group_by(sex, species) |> 
    summarize(avg_mass = mean(body_mass), .groups="drop")
    # .groups="drop" is equivalent to |> ungroup()
avg_mass_tbl

# A tibble: 6 × 3
  sex    species   avg_mass
  <fct>  <fct>        <dbl>
1 female Adelie       3369.
2 female Chinstrap    3527.
3 female Gentoo       4680.
4 male   Adelie       4043.
5 male   Chinstrap    3939.
6 male   Gentoo       5485.

Pivots

We want data that is wider than our current data:

species	male_avg	female_avg
Adelie	…	…
Chinstrap	…	…
Gentoo	…	…

Pivots

pivot_wider(avg_mass_tbl, 
            id_cols = species, 
            names_from=sex, 
            values_from=avg_mass)

# A tibble: 3 × 3
  species   female  male
  <fct>      <dbl> <dbl>
1 Adelie     3369. 4043.
2 Chinstrap  3527. 3939.
3 Gentoo     4680. 5485.

pivot_wider(avg_mass_tbl, 
            id_cols = species, 
            names_from=sex, 
            values_from=avg_mass) |>
    mutate(sex_diff = male - female) |>
    slice_max(sex_diff)

# A tibble: 1 × 4
  species female  male sex_diff
  <fct>    <dbl> <dbl>    <dbl>
1 Gentoo   4680. 5485.     805.

Pivots

pivot_wider Arguments:

id_cols: kept as ‘keys’ for new table
names_from: existing column ‘spread’ to create new columns names
values_from: values in new table

pivot_longer:

‘Inverse’ operation
Spread one row + multiple columns => one col + multiple rows

pivot_wider and pivot_longer have many additional arguments for dealing with repeats / missing values. The help page (+ experimenting) is your friend