STA 9750
Week 5 Update
Tue 2025-09-30
Thu 2025-09-25

Michael Weylandt

Today

Course Administration
PA#05 Review
PA#05 FAQs
Diving Deeper into Multi-Table Verbs
Wrap-Up
- Life Tip of the Day

STA 9750 Mini-Project #01

Due in 6 Days Four already submitted - fantastic!

(Hopefully an excess of preparedness, not a last minute panic)

Submit early and submit often
- Less “last minute” tech support going forward
Use Piazza and use your peers

Very impressed by Detailed Analyses, Fancy Tables, Graphics

You don’t need fancy graphics yet, but I love to see above and beyond

STA 9750 Mini-Project #01

Make sure your code is included in your submission

Code-Folding is very useful

Follow submission instructions

You need to have mp01.qmd and docs/mp01.html in your STA9750-2025-FALL repository
Helper function can verify submission is properly formatted

STA 9750 Mini-Project #01

Your classmates have asked several excellent questions on Piazza.

Reminders:

This data is messy because NYC is messy
Can handle by either (or a mix):
1. By fixing as many issues as possible
2. By using ‘robust’ methods
Document your choices and do your best

STA 9750 Mini-Project #01 - Peer Feedback

To be assigned on GitHub in one week (March 5th):

\(\approx 4\) feedbacks each
Take this seriously: around 20% of this assignment is “meta-review”
Goal: rigorous constructive critique

Use helper functions to find submissions assigned to you. Ask on Piazza if still having trouble.

STA 9750 Mini-Project #01 - Peer Feedback

Submissions may not map perfectly to rubric - use your best judgement

Learn from this! What can you adapt for MP#02?

STA 9750 Mini-Project #02

MP#02 released - Public Transit

Due 2025-10-24 at 11:59pm ET

GitHub post (used for peer feedback) AND Brightspace
One Month: don’t wait until the very end

Pay attention to the rubric

Writing and presentation are about 50% of your grade
Evaluated on rigor and thoughtfulness, not necessarily correctness

Upcoming Mini-Projects

Topics

MP#03: Spotify Analytics - 2025-11-07 at 11:59pm ET
MP#04: Political Analysis - 2025-11-21 at 11:59pm ET

Pre-Assignments

Brightspace - Wednesdays at 11:45

Reading, typically on course website
Brightspace auto-grades
- I have to manually change to completion grading

Next pre-assignment is March 20th - few weeks off

Apologies for misleading question this week. I forgot to change it when giving the extension

Course Support

Synchronous
- Office Hours 2x / week
Asynchronous
- Piazza (<30 minute average response time)

Long list of helpful resources added on Piazza, esp. https://learnr.numbat.space/

Ask questions! This course is demanding but rewarding.

Social contract: I push you hard, but I also provide lots of support.

Course Project

2 teams already formed

Team 1: DL + DS + SG + VH
Team 2: VG
More to come

All team commitments due via email 2025-09-30 at 11:59pm ET

March 13th - Project Proposal Presentations

Official Description

6 minute presentation
Key topics:
- Animating Question
- Team Roster
Also discuss: Possible specific questions, data sources, analytical plan, anticipated challenges

Pre-Assignment #05

Multi-Table Analysis

Multiple Tables:

More insights than from a single table
Maintain ‘tidy’ structure throughout

Will create new (compound) rows - be ware of accidental over-duplication

Primary Keys

Keys are unique identifiers for individual records

Primary (one column) or compound (multiple columns together)

The history of corporate IT is largely one of (failed) primary keys

Finance: Tickers, Tickers + Exchange, Tickers + Share Class, CUSIP, ISIN, SEDOL, …

Meaningful true keys are vanishingly rare - cherish them when you find them

Often ‘unique enough’ for an analysis

dplyr::count is helpful here

Joins

Joins combine tables by identity - not simple ‘stacking’

Specify a join key - ideally this is an actual key, but doesn’t have to be

In dplyr, we use the join_by function:

join_by(table_1_name == table_2_name)

Here table_1_name and table_2_name are column names from two tables

Join rows where these values are equal (advanced joins possible)

Inner and Outer Joins

When tables are perfectly matched, not an issue:

library(dplyr)
x <- tribble(~college, ~campus_borough, 
             "CCNY", "Manhattan",
             "Baruch", "Manhattan", 
             "CSI", "Staten Island",
             "York", "Queens")

y <- tribble(~borough_name, ~bus_code,
             "Manhattan", "M",
             "Staten Island", "S", 
             "Queens", "Q")

inner_join(x, y, join_by(campus_borough == borough_name))

# A tibble: 4 × 3
  college campus_borough bus_code
  <chr>   <chr>          <chr>   
1 CCNY    Manhattan      M       
2 Baruch  Manhattan      M       
3 CSI     Staten Island  S       
4 York    Queens         Q

Default to inner but irrelevant

Note automatic repetition of "M" row

Inner and Outer Joins

How to handle ‘unaligned’ values?

x <- tribble(~college, ~campus_borough, 
             "CCNY", "Manhattan",
             "Baruch", "Manhattan", 
             "CSI", "Staten Island",
             "York", "Queens", 
             "Medgar Evers", "Brooklyn")

inner_join(x, y, join_by(campus_borough == borough_name))

# A tibble: 4 × 3
  college campus_borough bus_code
  <chr>   <chr>          <chr>   
1 CCNY    Manhattan      M       
2 Baruch  Manhattan      M       
3 CSI     Staten Island  S       
4 York    Queens         Q

MEC vanished!

Inner and Outer Joins

left_join(x, y, join_by(campus_borough == borough_name))

# A tibble: 5 × 3
  college      campus_borough bus_code
  <chr>        <chr>          <chr>   
1 CCNY         Manhattan      M       
2 Baruch       Manhattan      M       
3 CSI          Staten Island  S       
4 York         Queens         Q       
5 Medgar Evers Brooklyn       <NA>

MEC stays, but no bus code - NA value

inner_join - Keep only matches
left_join - Keep all rows in left (first) table even w/o matches
right_join - Keep all rows in right (second) table even w/o matches
full_join - Keep all rows from both tables, even w/o matches

left_ and right_ are types of ‘outer’ joins

Pre-Assignment #05 FAQs

FAQ: Subqueries

[W]ill we be learning how to perform joins within a subquery?

You don’t need subqueries in R since it’s an imperative language. Just create a new variable to represent the result of the subquery and use that in the next command.

SELECT first_name, last_name
FROM collectors
WHERE id IN (
    SELECT collector_id
    FROM sales
);

collector_ids <- sales |> pull(collector_id)
collectors |> filter(id %in% collector_ids) |> select(first_name, last_name)

FAQ: Data Integrity

[H]ow can we ensure that the information [resulting from a join] is accurate and not repeated?

If you have a true unique ID, you’re usually safe
Pay attention to all warnings
Manually examine the result of any joins

FAQ: Performance on Large Data Sets

Will joining large data sets […] affect performance?

Somewhat - larger data sets are always slower.

Bigger danger is “bad joins” creating huge data automatically.

Note that R is less “smart” than SQL, so won’t optimize execution order for you automatically.

FAQ: What is the Role of `pivot_wider`?

Is [pivot_wider] just for formatting?

library(dplyr); library(tidyr); library(palmerpenguins)
penguins |> drop_na() |> 
    group_by(sex, species) |> 
    summarize(weight = mean(body_mass_g)) |>
    pivot_wider(id_cols=species, 
                names_from=sex,
                values_from=weight) |>
    mutate(gender_diff = male - female)

# A tibble: 3 × 4
  species   female  male gender_diff
  <fct>      <dbl> <dbl>       <dbl>
1 Adelie     3369. 4043.        675.
2 Chinstrap  3527. 3939.        412.
3 Gentoo     4680. 5485.        805.

FAQ: `dplyr` joins vs `SQL` joins

What is the difference between dplyr and SQL joins?

Not too much - biggest difference is no INDEX or FOREIGN KEY in R so less guarantees of data integrity.

FAQ: When to use `anti_join`?

Rare: looking for unmatched rows. - Useful to find data integrity issues or ‘implicit’ missingness. - I use an anti_join to find students who haven’t submitted an assignment.

FAQ: `many-to-many` Warning

Tricky to address, but fortunately pretty rare.

SQL explicitly forbids many-to-many
Usually a sign that a “key” isn’t really unique
- Check for duplicates in x and y tables
- Can occur with “fancy” joins (rolling, inequality)
Add additional join variables to break “duplication”

FAQ: How to Check Efficiency?

No automatic way. Some rules of thumb:

Don’t create large tables just to filter down
- filter before join when possible
full_outer join is a bit dangerous
cross_join is rarely the right answer

FAQ: `tidyr` vs `dplyr`

Is tidyr more efficient than dplyr?

Nope - different packages from the same developers.

Designed to work together elegantly.

FAQ: Rare Joins

Please explain what cross_join, filter joins, and nest_join are?

cross_join: dangerous.
- Creates “all pairs” of rows. Useful for ‘design’ problems
filter joins (anti_, semi_):
- Hunting down quietly missing data.
- Filtering to sub-samples
nest_join: beyond this course.
- left_join with extra structure to output.

FAQ: How to Pick a Join

How do I decide which type of join is most approriate for a given analysis?

Topic of today’s work.

Today

Diving Deeper with `dplyr` - Joins and Pivots

Data Set: nycflights13

Exercise: Lab #05

Pivots

pivot_ changes the shape of a data set

Get ready for presentation
Prep for a join
Combine rows before looking at ‘cross-row’ structure

Pivots

Q: Which penguin species has the largest between-sex mass difference?

library(tidyr)
library(palmerpenguins)
avg_mass_tbl <- penguins |> drop_na() |> 
    group_by(sex, species) |> 
    summarize(avg_mass = mean(body_mass_g)) |> 
    ungroup()
avg_mass_tbl

# A tibble: 6 × 3
  sex    species   avg_mass
  <fct>  <fct>        <dbl>
1 female Adelie       3369.
2 female Chinstrap    3527.
3 female Gentoo       4680.
4 male   Adelie       4043.
5 male   Chinstrap    3939.
6 male   Gentoo       5485.

Pivots

We want data that is wider than our current data:

species	male_avg	female_avg
Adelie	…	…
Chinstrap	…	…
Gentoo	…	…

Pivots

pivot_wider(avg_mass_tbl, 
            id_cols = species, 
            names_from=sex, 
            values_from=avg_mass)

# A tibble: 3 × 3
  species   female  male
  <fct>      <dbl> <dbl>
1 Adelie     3369. 4043.
2 Chinstrap  3527. 3939.
3 Gentoo     4680. 5485.

pivot_wider(avg_mass_tbl, 
            id_cols = species, 
            names_from=sex, 
            values_from=avg_mass) |>
    mutate(sex_diff = male - female) |>
    slice_max(sex_diff)

# A tibble: 1 × 4
  species female  male sex_diff
  <fct>    <dbl> <dbl>    <dbl>
1 Gentoo   4680. 5485.     805.

Pivots

pivot_wider Aguments:

id_cols: kept as ‘keys’ for new table
names_from: existing column ‘spread’ to create new columns names
values_from: values in new table

pivot_longer:

‘Inverse’ operation
Spread one row + multiple columns => one col + multiple rows

Wrap-Up

Upcoming Two Weeks

Next Wednesday (March 5th) at 11:45pm:

MP#01 Initial Submission due
Team membership due

No class on March 6th

Following Wednesday (March 12th) at 11:45pm:

MP#01 Peer Feedback Due
No Pre-Assignment

March 13th: Proposal Presentations

Life Tip of the Week

West Virginia State Board of Education v. Barnette (March 11, 1943)

Students cannot be compelled to recite the Pleldge of Allegiance, even during a period of war

Iconic First Amendment Victory

Barnette

Justice Jackson’s Opinion (6-3):

If there is any fixed star in our constitutional constellation, it is that no official, high or petty, can prescribe what shall be orthodox in politics, nationalism, religion, or other matters of opinion or force citizens to confess by word or act their faith therein.

Story Behind the Case

1940 Case Minersville School District v. Gobitis

JW Students in PA refused to recite Pledge of Allegience

Justice Frankfurter (8-1 Majority Opinion):

National Unity is the basis of National Security

Students could be forced to pledge

After the decision, waves of violence against JW students and adults accused of “treason” against the war effort

Story Behind the Case

Justice Stone (Dissent):

[T]he guarantees of civil liberty are but guarantees of freedom of the human mind and spirit and of reasonable freedom and opportunity to express them. [… T]he very essence of the liberty which they guarantee is the freedom of the individual from compulsion as to what he shall think and what he shall say.

A few years later, changed court wanted to revisit, leading to Barnette

More from J. Jackson

As governmental pressure toward unity becomes greater, so strife becomes more bitter as to whose unity it shall be.[…] Those who begin coercive elimination of dissent soon find themselves exterminating dissenters. Compulsory unification of opinion achieves only the unanimity of the graveyard.

Authority [in the United States] is to be controlled by public opinion, not public opinion by authority.

More from J. Jackson

[F]reedom to differ is not limited to things that do not matter much. That would be a mere shadow of freedom. The test of its substance is the right to differ as to things that touch the heart of the existing order.

Lessons

We get things wrong, often very wrong, in times of public fear

Law of Free Speech is necessary but not sufficient for a Culture of Free Speech

Freedom to Dissent is at the core of a pluralistic society

Rules and norms exist for the hard cases, not the easy ones

STA 9750 Week 5 Update Tue 2025-09-30 Thu 2025-09-25

Today

Today

STA 9750 Mini-Project #01

STA 9750 Mini-Project #01

STA 9750 Mini-Project #01

STA 9750 Mini-Project #01 - Peer Feedback

STA 9750 Mini-Project #01 - Peer Feedback

STA 9750 Mini-Project #02

Upcoming Mini-Projects

Pre-Assignments

Course Support

Course Project

March 13th - Project Proposal Presentations

Pre-Assignment #05

Multi-Table Analysis

Primary Keys

Joins

Inner and Outer Joins

Inner and Outer Joins

Inner and Outer Joins

Pre-Assignment #05 FAQs

FAQ: Subqueries

FAQ: Data Integrity

FAQ: Performance on Large Data Sets

FAQ: What is the Role of pivot_wider?

FAQ: dplyr joins vs SQL joins

FAQ: When to use anti_join?

FAQ: many-to-many Warning

FAQ: How to Check Efficiency?

FAQ: tidyr vs dplyr

FAQ: Rare Joins

FAQ: How to Pick a Join

Today

Diving Deeper with dplyr - Joins and Pivots

Pivots

Pivots

Pivots

Pivots

Pivots

Wrap-Up

Upcoming Two Weeks

Life Tip of the Week

Barnette

Story Behind the Case

Story Behind the Case

More from J. Jackson

More from J. Jackson

Lessons

Baruch Connects - Civil Discourse Initiative

STA 9750
Week 5 Update
Tue 2025-09-30
Thu 2025-09-25

FAQ: What is the Role of `pivot_wider`?

FAQ: `dplyr` joins vs `SQL` joins

FAQ: When to use `anti_join`?

FAQ: `many-to-many` Warning

FAQ: `tidyr` vs `dplyr`

Diving Deeper with `dplyr` - Joins and Pivots