STA 9750
Week 4 Update
Tue 2025-09-16
Thu 2025-09-18

Michael Weylandt

Today

Hello from New Mexico!

I’m on hotel Wifi, so if I drop, hold on a bit and I’ll rejoin.

Course Administration
PA#04 FAQs
Single-Table Verbs
Wrap-Up
- Life Tip of the Day

Course Administration

STA 9750 Mini-Project #00

Thank you to those of you who provided peer feedback!

A few of you still haven’t completed MP#00.

Too late for peer feedback, but you need to get this done in order to submit MP#01.

No late work accepted on graded MPs.

Mini-Project #00 - Pay Attention to the Details

At least one submission had title “YOUR TITLE GOES HERE”

Mini-Project #00 - Peer Feedback

Over 75% of the class reported receiving useful peer feedback.

Instructor’s Note: For graded MPs #01-04, be a bit more direct in peer feedback. Goal is to help your peers improve: constructive criticism.

When submitting peer feedback on graded MPs, use comment template.

If you didn’t get useful peer feedback on MP#00, please post a follow-up comment in your thread and I’ll take a look.

Mini-Project Helper Scripts

Remember course helper functions

mp_submission_create - Open an issue for your submission
mp_submission_verify - Check that your issue is formatted and page is available for review
mp_feedback_locate - Find issues on which you’re being asked to comment
mp_feedback_verify - Check that your peer feedback comments are formatted

STA 9750 Mini-Project #01

MP#01 released - TBD

Due 2025-10-03 at 11:59pm ET

GitHub post (used for peer feedback) AND Brightspace
Significant penalties for only submitting one

Pay attention to the rubric

Writing and presentation are about 50% of your grade
Evaluated on rigor and thoughtfulness, not necessarily correctness

MP #01

Happy to see folks already getting started!

A bit of debugging of network connection issues (possibly transient)
Treatment of OT for per Annum and per diem employees
- Great questions on this (HZ😎) - Piazza pinned

Not everything has a single right answer - be reasonable, justify, and document

MP #01

How to deal with messy / incorrect data?

Process it intensely
Go ‘robust’

Course Project

Roster due at 2025-09-30 at 11:59pm ET by email to me.

All teammates need to agree, so takes a bit of time.

Once you set a team, start thinking about a team name!

Upcoming Mini-Projects

MP#02 assigned next week:

TBD due at 2025-10-24 at 11:59pm ET

With revised MP #01 deadline, MP #02 released before MP #01 due

Later:

MP#03 due at 2025-11-07 at 11:59pm ET
MP#04 due at 2025-11-21 at 11:59pm ET

Pre-Assignments

Brightspace - Wednesdays at 11:45pm

Reading, typically on course website
Brightspace auto-grades.
- I have to manually change to completion grading.

Pre-Assignment #04 FAQs

FAQ: `select(-)`

data |> select(colname) keeps colname, dropping everything else

data |> select(-colname) drops colname, keeping everything else

Dropping is mainly useful for

Presentation (removing unwanted columns)
Advanced:
- Operations across columns

FAQ: `filter` vs `group_by`

group_by is an adverb. On its own, it does nothing; it changes the behavior of later functionality.

penguins |> drop_na() |> print(n=2)

# A tibble: 333 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
# ℹ 331 more rows
# ℹ 2 more variables: sex <fct>, year <int>

penguins |> drop_na() |> group_by(species) |> print(n=2)

# A tibble: 333 × 8
# Groups:   species [3]
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
# ℹ 331 more rows
# ℹ 2 more variables: sex <fct>, year <int>

FAQ: `filter` vs `group_by`

No group_by - full summarization:

penguins |> drop_na() |> summarize(mean(body_mass_g))

# A tibble: 1 × 1
  `mean(body_mass_g)`
                <dbl>
1               4207.

With group_by - summary within groups.

penguins |> drop_na() |> group_by(species) |> summarize(mean(body_mass_g))

# A tibble: 3 × 2
  species   `mean(body_mass_g)`
  <fct>                   <dbl>
1 Adelie                  3706.
2 Chinstrap               3733.
3 Gentoo                  5092.

FAQ: `filter` vs `group_by`

With multiple grouping - “cross-tabs” of results:

penguins |> drop_na() |> group_by(species, sex) |> summarize(mean(body_mass_g))

# A tibble: 6 × 3
# Groups:   species [3]
  species   sex    `mean(body_mass_g)`
  <fct>     <fct>                <dbl>
1 Adelie    female               3369.
2 Adelie    male                 4043.
3 Chinstrap female               3527.
4 Chinstrap male                 3939.
5 Gentoo    female               4680.
6 Gentoo    male                 5485.

Note that result of multi-group_by is still grouped:

penguins |> drop_na() |> group_by(species, sex) |> summarize(mean(body_mass_g))

# A tibble: 6 × 3
# Groups:   species [3]
  species   sex    `mean(body_mass_g)`
  <fct>     <fct>                <dbl>
1 Adelie    female               3369.
2 Adelie    male                 4043.
3 Chinstrap female               3527.
4 Chinstrap male                 3939.
5 Gentoo    female               4680.
6 Gentoo    male                 5485.

FAQ: `filter` vs `group_by`

Changes next call to summarize:

penguins |> drop_na() |> group_by(species) |> 
    summarize(mbmg = mean(body_mass_g)) |> summarize(mean(mbmg))

# A tibble: 1 × 1
  `mean(mbmg)`
         <dbl>
1        4177.

penguins |> drop_na() |> group_by(species, sex) |> 
    summarize(mbmg = mean(body_mass_g)) |> summarize(mean(mbmg))

# A tibble: 3 × 2
  species   `mean(mbmg)`
  <fct>            <dbl>
1 Adelie           3706.
2 Chinstrap        3733.
3 Gentoo           5082.

penguins |> drop_na() |> group_by(sex, species) |> 
    summarize(mbmg = mean(body_mass_g)) |> summarize(mean(mbmg))

# A tibble: 2 × 2
  sex    `mean(mbmg)`
  <fct>         <dbl>
1 female        3859.
2 male          4489.

FAQ: Order of `group_by`

No change to first “grouped” operations
Change in grouping structure of result
Last group “removed” by summarize
No impact on grouped operations performed by mutate or filter

FAQ: `ungroup`

Remove all grouping structure
Defensive to keep group structure from “propogating” unwantedly

sum_penguins <- penguins |> 
    group_by(sex, species) |> 
    summarize(mbmg = mean(body_mass_g))

... # Lots of code 

sum_penguins |> filter(mbmg == max(mbmg)) # Still grouped!!

FAQ: Named Arguments in `mutate` and `summarize`

mutate and summarize create new columns:

mutate creates “one-to-one”
summarize creates “one-per-group”

If you want to name them (so you can use them later), use named argument

penguins |> group_by(species) |> summarize(n())

# A tibble: 3 × 2
  species   `n()`
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

penguins |> group_by(species) |> summarize(n_species = n())

# A tibble: 3 × 2
  species   n_species
  <fct>         <int>
1 Adelie          152
2 Chinstrap        68
3 Gentoo          124

FAQ: Pipe Syntax

Pipe syntax (|>) is “syntactic sugar”

Just makes code easier to read:

penguins |> group_by(species) |> summarize(n_species = n())
# vs
summarize(group_by(penguins, species), n_species=n())

Exactly the same execution: improved UX

%>% is an older way of doing essentially the same thing

FAQ: Assignment of Pipeline Results

When to start a pipeline with NAME <-? Creating a new variable:

Data you intend to reuse
Assignment operator ‘up front’ indicates important
My rules of thumb for names:
- New names for “new complete thoughts” - whole summary in one pipeline
- Overwrite existing names for “like-for-like improvements” (USAGE <- USAGE |> code(...))
  - Recoding variable names, fixing typos, etc.
  - Use name repeatedly so downstream code picks up effects ‘for free’

FAQ: Comparison with SQL and Pandas (Python)

dplyr is heavily inspired by SQL (standard query language for data bases)

MW (2014): “Why bother? Can’t folks just use SQL”

pandas (in Python) inspired by R data.frame and SQL:

A bit older than dplyr (cousins?)
“New hotness” (polars) directly inspired by dplyr

FAQ: Performance

dplyr is fast, but advanced options:

dbplyr: translates dplyr syntax to SQL and executes in DB
dtplyr: uses alternate data.table back-end (HFT)

Hard to have bad performance in single-table analysis

Danger of accidentally creating ‘extra’ data in multi-table context
Will discuss more next week

Tools for slow code:

Profiler: profvis
Benchmarking: bench

Don’t worry about improving code performance until:

You’re sure it’s right
You’re sure it’s slow

Incorrect code is infinitely slow.

New Material - Single Table Verbs

Diving Deeper with `group_by`, `filter`, and `summarize`

Data Set: nycflights13

Exercise: Lab #04

Wrap-Up

Looking Ahead

Life Tip of the Week

ZSB / Baruch / CUNY Benefits

As a student, you have many free and discounted benefits.

I have collected some of these on the course page, but there are many more if you look.

CUNY-Wide

Free New York Times and Wall Street Journal
Free and Discounted Museum Access via CUNY Arts
Discounted Broadway and Off-Broadway via TDF

Baruch / ZSB

Free Barron’s Subscription
Newman Library Databases

Any Student

Free Trial and Discounted Rate Amazon Prime
Discounted Spotify + Free Hulu Subscription
GitHub Student Developer Pack

STA 9750 Week 4 Update Tue 2025-09-16 Thu 2025-09-18

Today

Today

Course Administration

STA 9750 Mini-Project #00

Mini-Project #00 - Pay Attention to the Details

Mini-Project #00 - Peer Feedback

Mini-Project Helper Scripts

STA 9750 Mini-Project #01

MP #01

MP #01

Course Project

Upcoming Mini-Projects

Pre-Assignments

Pre-Assignment #04 FAQs

FAQ: select(-)

FAQ: filter vs group_by

FAQ: filter vs group_by

FAQ: filter vs group_by

FAQ: filter vs group_by

FAQ: Order of group_by

FAQ: ungroup

FAQ: Named Arguments in mutate and summarize

FAQ: Pipe Syntax

FAQ: Assignment of Pipeline Results

FAQ: Comparison with SQL and Pandas (Python)

FAQ: Performance

New Material - Single Table Verbs

Diving Deeper with group_by, filter, and summarize

Wrap-Up

Looking Ahead

Life Tip of the Week

ZSB / Baruch / CUNY Benefits

CUNY-Wide

Baruch / ZSB

Any Student

STA 9750
Week 4 Update
Tue 2025-09-16
Thu 2025-09-18

FAQ: `select(-)`

FAQ: `filter` vs `group_by`

FAQ: `filter` vs `group_by`

FAQ: `filter` vs `group_by`

FAQ: `filter` vs `group_by`

FAQ: Order of `group_by`

FAQ: `ungroup`

FAQ: Named Arguments in `mutate` and `summarize`

Diving Deeper with `group_by`, `filter`, and `summarize`