STA 9750
Week 5 Update
2025-02-27

Michael Weylandt

Today

Today

  • Course Administration
  • PA#05 Review
  • PA#05 FAQs
  • Diving Deeper into Multi-Table Verbs
  • Wrap-Up
    • Life Tip of the Day

STA 9750 Mini-Project #01

Due in 6 Days Four already submitted - fantastic!

(Hopefully an excess of preparedness, not a last minute panic)

  • Submit early and submit often
    • Less “last minute” tech support going forward
  • Use Piazza and use your peers

Very impressed by Detailed Analyses, Fancy Tables, Graphics

You don’t need fancy graphics yet, but I love to see above and beyond

STA 9750 Mini-Project #01

Make sure your code is included in your submission

Follow submission instructions

  • You need to have mp01.qmd and docs/mp01.html in your STA9750-2025-SPRING repository

  • Helper function can verify submission is properly formatted

STA 9750 Mini-Project #01

Your classmates have asked several excellent questions on Piazza.

Reminders:

  • This data is messy because NYC is messy
  • Can handle by either (or a mix):
    1. By fixing as many issues as possible
    2. By using ‘robust’ methods
  • Document your choices and do your best

STA 9750 Mini-Project #01 - Peer Feedback

To be assigned on GitHub in one week (March 5th):

  • \(\approx 4\) feedbacks each
  • Take this seriously: around 20% of this assignment is “meta-review”
  • Goal: rigorous constructive critique

Use helper functions to find submissions assigned to you. Ask on Piazza if still having trouble.

STA 9750 Mini-Project #01 - Peer Feedback

Submissions may not map perfectly to rubric - use your best judgement

Learn from this! What can you adapt for MP#02?

STA 9750 Mini-Project #02

MP#02 released - Public Transit

Due 2025-03-26 at 11:45pm ET

  • GitHub post (used for peer feedback) AND Brightspace
  • One Month: don’t wait until the very end

Pay attention to the rubric

  • Writing and presentation are about 50% of your grade
  • Evaluated on rigor and thoughtfulness, not necessarily correctness

Upcoming Mini-Projects

Topics

  • MP#03: Spotify Analytics - 2025-04-23 at 11:45pm ET
  • MP#04: Political Analysis - 2025-05-07 at 11:45pm ET

Pre-Assignments

Brightspace - Wednesdays at 11:45

  • Reading, typically on course website
  • Brightspace auto-grades
    • I have to manually change to completion grading

Next pre-assignment is March 20th - few weeks off

Apologies for misleading question this week. I forgot to change it when giving the extension

Course Support

  • Synchronous
    • Office Hours 2x / week
  • Asynchronous
    • Piazza (<30 minute average response time)

Long list of helpful resources added on Piazza, esp. https://learnr.numbat.space/

Ask questions! This course is demanding but rewarding.

Social contract: I push you hard, but I also provide lots of support.

Course Project

2 teams already formed

  • Team 1: DL + DS + SG + VH
  • Team 2: VG
  • More to come

All team commitments due via email 2025-03-05 at 11:45pm ET

March 13th - Project Proposal Presentations

Official Description

  • 6 minute presentation
  • Key topics:
    • Animating Question
    • Team Roster
  • Also discuss: Possible specific questions, data sources, analytical plan, anticipated challenges

Pre-Assignment #05

Multi-Table Analysis

Multiple Tables:

  • More insights than from a single table
  • Maintain ‘tidy’ structure throughout

Will create new (compound) rows - be ware of accidental over-duplication

Primary Keys

Keys are unique identifiers for individual records

  • Primary (one column) or compound (multiple columns together)

The history of corporate IT is largely one of (failed) primary keys

  • Finance: Tickers, Tickers + Exchange, Tickers + Share Class, CUSIP, ISIN, SEDOL, …

Meaningful true keys are vanishingly rare - cherish them when you find them

Often ‘unique enough’ for an analysis

dplyr::count is helpful here

Joins

Joins combine tables by identity - not simple ‘stacking’

Specify a join key - ideally this is an actual key, but doesn’t have to be

In dplyr, we use the join_by function:

join_by(table_1_name == table_2_name)

Here table_1_name and table_2_name are column names from two tables

Join rows where these values are equal (advanced joins possible)

Inner and Outer Joins

When tables are perfectly matched, not an issue:

library(dplyr)
x <- tribble(~college, ~campus_borough, 
             "CCNY", "Manhattan",
             "Baruch", "Manhattan", 
             "CSI", "Staten Island",
             "York", "Queens")

y <- tribble(~borough_name, ~bus_code,
             "Manhattan", "M",
             "Staten Island", "S", 
             "Queens", "Q")

inner_join(x, y, join_by(campus_borough == borough_name))
# A tibble: 4 × 3
  college campus_borough bus_code
  <chr>   <chr>          <chr>   
1 CCNY    Manhattan      M       
2 Baruch  Manhattan      M       
3 CSI     Staten Island  S       
4 York    Queens         Q       

Default to inner but irrelevant

Note automatic repetition of "M" row

Inner and Outer Joins

How to handle ‘unaligned’ values?

x <- tribble(~college, ~campus_borough, 
             "CCNY", "Manhattan",
             "Baruch", "Manhattan", 
             "CSI", "Staten Island",
             "York", "Queens", 
             "Medgar Evers", "Brooklyn")

inner_join(x, y, join_by(campus_borough == borough_name))
# A tibble: 4 × 3
  college campus_borough bus_code
  <chr>   <chr>          <chr>   
1 CCNY    Manhattan      M       
2 Baruch  Manhattan      M       
3 CSI     Staten Island  S       
4 York    Queens         Q       

MEC vanished!

Inner and Outer Joins

left_join(x, y, join_by(campus_borough == borough_name))
# A tibble: 5 × 3
  college      campus_borough bus_code
  <chr>        <chr>          <chr>   
1 CCNY         Manhattan      M       
2 Baruch       Manhattan      M       
3 CSI          Staten Island  S       
4 York         Queens         Q       
5 Medgar Evers Brooklyn       <NA>    

MEC stays, but no bus code - NA value

  • inner_join - Keep only matches
  • left_join - Keep all rows in left (first) table even w/o matches
  • right_join - Keep all rows in right (second) table even w/o matches
  • full_join - Keep all rows from both tables, even w/o matches

left_ and right_ are types of ‘outer’ joins

Pre-Assignment #05 FAQs

FAQ: Subqueries

[W]ill we be learning how to perform joins within a subquery?

You don’t need subqueries in R since it’s an imperative language. Just create a new variable to represent the result of the subquery and use that in the next command.

SELECT first_name, last_name
FROM collectors
WHERE id IN (
    SELECT collector_id
    FROM sales
);
collector_ids <- sales |> pull(collector_id)
collectors |> filter(id %in% collector_ids) |> select(first_name, last_name)

FAQ: Data Integrity

[H]ow can we ensure that the information [resulting from a join] is accurate and not repeated?

  1. If you have a true unique ID, you’re usually safe
  2. Pay attention to all warnings
  3. Manually examine the result of any joins

FAQ: Performance on Large Data Sets

Will joining large data sets […] affect performance?

Somewhat - larger data sets are always slower.

Bigger danger is “bad joins” creating huge data automatically.

Note that R is less “smart” than SQL, so won’t optimize execution order for you automatically.

FAQ: What is the Role of pivot_wider?

Is [pivot_wider] just for formatting?

library(dplyr); library(tidyr); library(palmerpenguins)
penguins |> drop_na() |> 
    group_by(sex, species) |> 
    summarize(weight = mean(body_mass_g)) |>
    pivot_wider(id_cols=species, 
                names_from=sex,
                values_from=weight) |>
    mutate(gender_diff = male - female)
# A tibble: 3 × 4
  species   female  male gender_diff
  <fct>      <dbl> <dbl>       <dbl>
1 Adelie     3369. 4043.        675.
2 Chinstrap  3527. 3939.        412.
3 Gentoo     4680. 5485.        805.

FAQ: dplyr joins vs SQL joins

What is the difference between dplyr and SQL joins?

Not too much - biggest difference is no INDEX or FOREIGN KEY in R so less guarantees of data integrity.

FAQ: When to use anti_join?

Rare: looking for unmatched rows. - Useful to find data integrity issues or ‘implicit’ missingness. - I use an anti_join to find students who haven’t submitted an assignment.

FAQ: many-to-many Warning

Tricky to address, but fortunately pretty rare.

  • SQL explicitly forbids many-to-many
  • Usually a sign that a “key” isn’t really unique
    • Check for duplicates in x and y tables
    • Can occur with “fancy” joins (rolling, inequality)
  • Add additional join variables to break “duplication”

FAQ: How to Check Efficiency?

No automatic way. Some rules of thumb:

  • Don’t create large tables just to filter down
    • filter before join when possible
  • full_outer join is a bit dangerous
  • cross_join is rarely the right answer

FAQ: tidyr vs dplyr

Is tidyr more efficient than dplyr?

Nope - different packages from the same developers.

Designed to work together elegantly.

FAQ: Rare Joins

Please explain what cross_join, filter joins, and nest_join are?

  • cross_join: dangerous.
    • Creates “all pairs” of rows. Useful for ‘design’ problems
  • filter joins (anti_, semi_):
    • Hunting down quietly missing data.
    • Filtering to sub-samples
  • nest_join: beyond this course.
    • left_join with extra structure to output.

FAQ: How to Pick a Join

How do I decide which type of join is most approriate for a given analysis?

Topic of today’s work.

Today

Diving Deeper with dplyr - Joins and Pivots

Data Set: nycflights13

Exercise: Lab #05

Pivots

pivot_ changes the shape of a data set

  • Get ready for presentation
  • Prep for a join
  • Combine rows before looking at ‘cross-row’ structure

Pivots

Q: Which penguin species has the largest between-sex mass difference?

library(tidyr)
library(palmerpenguins)
avg_mass_tbl <- penguins |> drop_na() |> 
    group_by(sex, species) |> 
    summarize(avg_mass = mean(body_mass_g)) |> 
    ungroup()
avg_mass_tbl
# A tibble: 6 × 3
  sex    species   avg_mass
  <fct>  <fct>        <dbl>
1 female Adelie       3369.
2 female Chinstrap    3527.
3 female Gentoo       4680.
4 male   Adelie       4043.
5 male   Chinstrap    3939.
6 male   Gentoo       5485.

Pivots

We want data that is wider than our current data:

species male_avg female_avg
Adelie
Chinstrap
Gentoo

Pivots

pivot_wider(avg_mass_tbl, 
            id_cols = species, 
            names_from=sex, 
            values_from=avg_mass)
# A tibble: 3 × 3
  species   female  male
  <fct>      <dbl> <dbl>
1 Adelie     3369. 4043.
2 Chinstrap  3527. 3939.
3 Gentoo     4680. 5485.
pivot_wider(avg_mass_tbl, 
            id_cols = species, 
            names_from=sex, 
            values_from=avg_mass) |>
    mutate(sex_diff = male - female) |>
    slice_max(sex_diff)
# A tibble: 1 × 4
  species female  male sex_diff
  <fct>    <dbl> <dbl>    <dbl>
1 Gentoo   4680. 5485.     805.

Pivots

pivot_wider Aguments:

  • id_cols: kept as ‘keys’ for new table
  • names_from: existing column ‘spread’ to create new columns names
  • values_from: values in new table

pivot_longer:

  • ‘Inverse’ operation
  • Spread one row + multiple columns => one col + multiple rows

Wrap-Up

Upcoming Two Weeks

Next Wednesday (March 5th) at 11:45pm:

  • MP#01 Initial Submission due
  • Team membership due

No class on March 6th

Following Wednesday (March 12th) at 11:45pm:

  • MP#01 Peer Feedback Due
  • No Pre-Assignment

March 13th: Proposal Presentations

Life Tip of the Week

West Virginia State Board of Education v. Barnette (March 11, 1943)

Students cannot be compelled to recite the Pleldge of Allegiance, even during a period of war

Iconic First Amendment Victory

Barnette

Justice Jackson’s Opinion (6-3):

If there is any fixed star in our constitutional constellation, it is that no official, high or petty, can prescribe what shall be orthodox in politics, nationalism, religion, or other matters of opinion or force citizens to confess by word or act their faith therein.

Story Behind the Case

1940 Case Minersville School District v. Gobitis

  • JW Students in PA refused to recite Pledge of Allegience

Justice Frankfurter (8-1 Majority Opinion):

National Unity is the basis of National Security

Students could be forced to pledge


After the decision, waves of violence against JW students and adults accused of “treason” against the war effort

Story Behind the Case

Justice Stone (Dissent):

[T]he guarantees of civil liberty are but guarantees of freedom of the human mind and spirit and of reasonable freedom and opportunity to express them. [… T]he very essence of the liberty which they guarantee is the freedom of the individual from compulsion as to what he shall think and what he shall say.

A few years later, changed court wanted to revisit, leading to Barnette

More from J. Jackson

As governmental pressure toward unity becomes greater, so strife becomes more bitter as to whose unity it shall be.[…] Those who begin coercive elimination of dissent soon find themselves exterminating dissenters. Compulsory unification of opinion achieves only the unanimity of the graveyard.

Authority [in the United States] is to be controlled by public opinion, not public opinion by authority.

More from J. Jackson

[F]reedom to differ is not limited to things that do not matter much. That would be a mere shadow of freedom. The test of its substance is the right to differ as to things that touch the heart of the existing order.

Lessons

  • We get things wrong, often very wrong, in times of public fear
  • Law of Free Speech is necessary but not sufficient for a Culture of Free Speech
  • Freedom to Dissent is at the core of a pluralistic society
  • Rules and norms exist for the hard cases, not the easy ones

Baruch Connects - Civil Discourse Initiative