STA 9750
Week 4 Update
2025-02-20

Michael Weylandt

Today

Today

Hello from New Mexico!

I’m on hotel Wifi, so if I drop, hold on a bit and I’ll rejoin.

  • Course Administration
  • PA#04 FAQs
  • Single-Table Verbs
  • Wrap-Up
    • Life Tip of the Day

Course Administration

STA 9750 Mini-Project #00

Thank you to those of you who provided peer feedback!

A few of you still haven’t completed MP#00.

Too late for peer feedback, but you need to get this done in order to submit MP#01.

No late work accepted on graded MPs.

Mini-Project #00 - Pay Attention to the Details

At least one submission had title “YOUR TITLE GOES HERE”

Mini-Project #00 - Peer Feedback

Over 75% of the class reported receiving useful peer feedback.

Instructor’s Note: For graded MPs #01-04, be a bit more direct in peer feedback. Goal is to help your peers improve: constructive criticism.

When submitting peer feedback on graded MPs, use comment template.

If you didn’t get useful peer feedback on MP#00, please post a follow-up comment in your thread and I’ll take a look.

Mini-Project Helper Scripts

Remember course helper functions

  • mp_submission_create - Open an issue for your submission
  • mp_submission_verify - Check that your issue is formatted and page is available for review
  • mp_feedback_locate - Find issues on which you’re being asked to comment
  • mp_feedback_verify - Check that your peer feedback comments are formatted

STA 9750 Mini-Project #01

MP#01 released - Welcome to the Commission to Analyze Taxpayer Spending (CATS)

Due 2025-03-05 at 11:45pm ET

  • GitHub post (used for peer feedback) AND Brightspace
  • Significant penalties for only submitting one

Pay attention to the rubric

  • Writing and presentation are about 50% of your grade
  • Evaluated on rigor and thoughtfulness, not necessarily correctness

MP #01

Happy to see folks already getting started!

  • A bit of debugging of network connection issues (possibly transient)
  • Treatment of OT for per Annum and per diem employees
    • Great questions on this (HZ😎) - Piazza pinned

Not everything has a single right answer - be reasonable, justify, and document

MP #01

How to deal with messy / incorrect data?

  • Process it intensely
  • Go ‘robust’

Course Project

Roster due at 2025-03-05 at 11:45pm ET by email to me.

All teammates need to agree, so takes a bit of time.

Once you set a team, start thinking about a team name!

Upcoming Mini-Projects

MP#02 assigned next week:

  • Identifying Environmentally Responsible US Public Transit Systems due at 2025-03-26 at 11:45pm ET

    With revised MP #01 deadline, MP #02 released before MP #01 due

Later:

  • MP#03 due at 2025-04-23 at 11:45pm ET
  • MP#04 due at 2025-05-07 at 11:45pm ET

Pre-Assignments

Brightspace - Wednesdays at 11:45pm

  • Reading, typically on course website
  • Brightspace auto-grades.
    • I have to manually change to completion grading.

Pre-Assignment #04 FAQs

FAQ: select(-)

data |> select(colname) keeps colname, dropping everything else

data |> select(-colname) drops colname, keeping everything else

Dropping is mainly useful for

  • Presentation (removing unwanted columns)
  • Advanced:
    • Operations across columns

FAQ: filter vs group_by

group_by is an adverb. On its own, it does nothing; it changes the behavior of later functionality.

penguins |> drop_na() |> print(n=2)
# A tibble: 333 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
# ℹ 331 more rows
# ℹ 2 more variables: sex <fct>, year <int>
penguins |> drop_na() |> group_by(species) |> print(n=2)
# A tibble: 333 × 8
# Groups:   species [3]
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
# ℹ 331 more rows
# ℹ 2 more variables: sex <fct>, year <int>

FAQ: filter vs group_by

No group_by - full summarization:

penguins |> drop_na() |> summarize(mean(body_mass_g))
# A tibble: 1 × 1
  `mean(body_mass_g)`
                <dbl>
1               4207.

With group_by - summary within groups.

penguins |> drop_na() |> group_by(species) |> summarize(mean(body_mass_g))
# A tibble: 3 × 2
  species   `mean(body_mass_g)`
  <fct>                   <dbl>
1 Adelie                  3706.
2 Chinstrap               3733.
3 Gentoo                  5092.

FAQ: filter vs group_by

With multiple grouping - “cross-tabs” of results:

penguins |> drop_na() |> group_by(species, sex) |> summarize(mean(body_mass_g))
# A tibble: 6 × 3
# Groups:   species [3]
  species   sex    `mean(body_mass_g)`
  <fct>     <fct>                <dbl>
1 Adelie    female               3369.
2 Adelie    male                 4043.
3 Chinstrap female               3527.
4 Chinstrap male                 3939.
5 Gentoo    female               4680.
6 Gentoo    male                 5485.

Note that result of multi-group_by is still grouped:

penguins |> drop_na() |> group_by(species, sex) |> summarize(mean(body_mass_g))
# A tibble: 6 × 3
# Groups:   species [3]
  species   sex    `mean(body_mass_g)`
  <fct>     <fct>                <dbl>
1 Adelie    female               3369.
2 Adelie    male                 4043.
3 Chinstrap female               3527.
4 Chinstrap male                 3939.
5 Gentoo    female               4680.
6 Gentoo    male                 5485.

FAQ: filter vs group_by

Changes next call to summarize:

penguins |> drop_na() |> group_by(species) |> 
    summarize(mbmg = mean(body_mass_g)) |> summarize(mean(mbmg))
# A tibble: 1 × 1
  `mean(mbmg)`
         <dbl>
1        4177.
penguins |> drop_na() |> group_by(species, sex) |> 
    summarize(mbmg = mean(body_mass_g)) |> summarize(mean(mbmg))
# A tibble: 3 × 2
  species   `mean(mbmg)`
  <fct>            <dbl>
1 Adelie           3706.
2 Chinstrap        3733.
3 Gentoo           5082.
penguins |> drop_na() |> group_by(sex, species) |> 
    summarize(mbmg = mean(body_mass_g)) |> summarize(mean(mbmg))
# A tibble: 2 × 2
  sex    `mean(mbmg)`
  <fct>         <dbl>
1 female        3859.
2 male          4489.

FAQ: Order of group_by

  • No change to first “grouped” operations
  • Change in grouping structure of result
  • Last group “removed” by summarize
  • No impact on grouped operations performed by mutate or filter

FAQ: ungroup

  • Remove all grouping structure
  • Defensive to keep group structure from “propogating” unwantedly
sum_penguins <- penguins |> 
    group_by(sex, species) |> 
    summarize(mbmg = mean(body_mass_g))

... # Lots of code 

sum_penguins |> filter(mbmg == max(mbmg)) # Still grouped!!

FAQ: Named Arguments in mutate and summarize

mutate and summarize create new columns:

  • mutate creates “one-to-one”
  • summarize creates “one-per-group”

If you want to name them (so you can use them later), use named argument

penguins |> group_by(species) |> summarize(n())
# A tibble: 3 × 2
  species   `n()`
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

vs

penguins |> group_by(species) |> summarize(n_species = n())
# A tibble: 3 × 2
  species   n_species
  <fct>         <int>
1 Adelie          152
2 Chinstrap        68
3 Gentoo          124

FAQ: Pipe Syntax

Pipe syntax (|>) is “syntactic sugar”

Just makes code easier to read:

penguins |> group_by(species) |> summarize(n_species = n())
# vs
summarize(group_by(penguins, species), n_species=n())

Exactly the same execution: improved UX

%>% is an older way of doing essentially the same thing

FAQ: Assignment of Pipeline Results

When to start a pipeline with NAME <-? Creating a new variable:

  • Data you intend to reuse
  • Assignment operator ‘up front’ indicates important
  • My rules of thumb for names:
    • New names for “new complete thoughts” - whole summary in one pipeline
    • Overwrite existing names for “like-for-like improvements” (USAGE <- USAGE |> code(...))
      • Recoding variable names, fixing typos, etc.
      • Use name repeatedly so downstream code picks up effects ‘for free’

FAQ: Comparison with SQL and Pandas (Python)

dplyr is heavily inspired by SQL (standard query language for data bases)

  • MW (2014): “Why bother? Can’t folks just use SQL”

pandas (in Python) inspired by R data.frame and SQL:

  • A bit older than dplyr (cousins?)
  • “New hotness” (polars) directly inspired by dplyr

FAQ: Performance

dplyr is fast, but advanced options:

  • dbplyr: translates dplyr syntax to SQL and executes in DB
  • dtplyr: uses alternate data.table back-end (HFT)

Hard to have bad performance in single-table analysis

  • Danger of accidentally creating ‘extra’ data in multi-table context
  • Will discuss more next week

Tools for slow code:

Don’t worry about improving code performance until:

  1. You’re sure it’s right
  2. You’re sure it’s slow

Incorrect code is infinitely slow.

New Material - Single Table Verbs

Diving Deeper with group_by, filter, and summarize

Data Set: nycflights13

Exercise: Lab #04

Wrap-Up

Looking Ahead

Life Tip of the Week

ZSB / Baruch / CUNY Benefits

As a student, you have many free and discounted benefits.

I have collected some of these on the course page, but there are many more if you look.

CUNY-Wide

  • Free New York Times and Wall Street Journal
  • Free and Discounted Museum Access via CUNY Arts
  • Discounted Broadway and Off-Broadway via TDF

Baruch / ZSB

  • Free Barron’s Subscription
  • Newman Library Databases

Any Student

  • Free Trial and Discounted Rate Amazon Prime
  • Discounted Spotify + Free Hulu Subscription
  • GitHub Student Developer Pack