Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 4

Today

Today

Hello from New Mexico!

I’m on hotel Wifi, so if I drop, hold on a bit and I’ll rejoin.

  • Course Administration
  • PA#04 FAQs
  • Single-Table Verbs
  • Wrap-Up
    • Life Tip of the Day

Course Administration

STA 9750 Mini-Project #00

Thank you to those of you who provided peer feedback!

A few of you still haven’t completed MP#00.

Too late for peer feedback, but you need to get this done in order to submit MP#01.

No late work accepted on graded MPs.

Mini-Project #00 - Pay Attention to the Details

At least one submission had title “YOUR TITLE GOES HERE”

Mini-Project #00 - Peer Feedback

Over 75% of the class reported receiving useful peer feedback.

Instructor’s Note: For graded MPs #01-04, be a bit more direct in peer feedback. Goal is to help your peers improve: constructive criticism.

When submitting peer feedback on graded MPs, use comment template.

If you didn’t get useful peer feedback on MP#00, please post a follow-up comment in your thread and I’ll take a look.

Mini-Project Helper Scripts

Remember course helper functions

  • mp_submission_create - Open an issue for your submission
  • mp_submission_verify - Check that your issue is formatted and page is available for review
  • mp_feedback_locate - Find issues on which you’re being asked to comment
  • mp_feedback_verify - Check that your peer feedback comments are formatted

STA 9750 Mini-Project #01

MP#01 released - Gourmet Cheeseburgers Across the Globe: Exploring the Most Popular Programming on Netflix

Due 2025-10-03 at 11:59pm ET

  • GitHub post (used for peer feedback) AND Brightspace
  • Significant penalties for only submitting one

Pay attention to the rubric

  • Writing and presentation are about 50% of your grade
  • Evaluated on rigor and thoughtfulness, not necessarily correctness

MP #01

Happy to see folks already getting started!

  • A bit of debugging of network connection issues (possibly transient)
  • Treatment of OT for per Annum and per diem employees
    • Great questions on this (HZ😎) - Piazza pinned

Not everything has a single right answer - be reasonable, justify, and document

MP #01

How to deal with messy / incorrect data?

  • Process it intensely
  • Go ‘robust’

Course Project

Roster due at 2025-09-30 at 11:59pm ET by email to me.

All teammates need to agree, so takes a bit of time.

Once you set a team, start thinking about a team name!

Upcoming Mini-Projects

MP#02 assigned next week:

  • TBD due at 2025-10-24 at 11:59pm ET

    With revised MP #01 deadline, MP #02 released before MP #01 due

Later:

  • MP#03 due at 2025-11-07 at 11:59pm ET
  • MP#04 due at 2025-11-21 at 11:59pm ET

Pre-Assignments

Brightspace - Wednesdays at 11:45pm

  • Reading, typically on course website
  • Brightspace auto-grades.
    • I have to manually change to completion grading.

Pre-Assignment #04 FAQs

FAQ: select(-)

data |> select(colname) keeps colname, dropping everything else

data |> select(-colname) drops colname, keeping everything else

Dropping is mainly useful for

  • Presentation (removing unwanted columns)
  • Advanced:
    • Operations across columns

FAQ: filter vs group_by

group_by is an adverb. On its own, it does nothing; it changes the behavior of later functionality.

penguins |> drop_na() |> head(2)
  species    island bill_len bill_dep flipper_len body_mass    sex year
1  Adelie Torgersen     39.1     18.7         181      3750   male 2007
2  Adelie Torgersen     39.5     17.4         186      3800 female 2007
penguins |> drop_na() |> group_by(species) |> head(2)
# A tibble: 2 × 8
# Groups:   species [1]
  species island    bill_len bill_dep flipper_len body_mass sex     year
  <fct>   <fct>        <dbl>    <dbl>       <int>     <int> <fct>  <int>
1 Adelie  Torgersen     39.1     18.7         181      3750 male    2007
2 Adelie  Torgersen     39.5     17.4         186      3800 female  2007

FAQ: filter vs group_by

No group_by - full summarization:

penguins |> drop_na() |> summarize(mean(body_mass))
  mean(body_mass)
1        4207.057

With group_by - summary within groups.

penguins |> drop_na() |> group_by(species) |> summarize(mean(body_mass))
# A tibble: 3 × 2
  species   `mean(body_mass)`
  <fct>                 <dbl>
1 Adelie                3706.
2 Chinstrap             3733.
3 Gentoo                5092.

FAQ: filter vs group_by

With multiple grouping - “cross-tabs” of results:

penguins |> drop_na() |> group_by(species, sex) |> summarize(mean(body_mass))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 × 3
# Groups:   species [3]
  species   sex    `mean(body_mass)`
  <fct>     <fct>              <dbl>
1 Adelie    female             3369.
2 Adelie    male               4043.
3 Chinstrap female             3527.
4 Chinstrap male               3939.
5 Gentoo    female             4680.
6 Gentoo    male               5485.

Note that result of multi-group_by is still grouped:

penguins |> drop_na() |> group_by(species, sex) |> summarize(mean(body_mass))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 × 3
# Groups:   species [3]
  species   sex    `mean(body_mass)`
  <fct>     <fct>              <dbl>
1 Adelie    female             3369.
2 Adelie    male               4043.
3 Chinstrap female             3527.
4 Chinstrap male               3939.
5 Gentoo    female             4680.
6 Gentoo    male               5485.

FAQ: filter vs group_by

Changes next call to summarize:

penguins |> drop_na() |> group_by(species) |> 
    summarize(mbmg = mean(body_mass)) |> summarize(mean(mbmg))
# A tibble: 1 × 1
  `mean(mbmg)`
         <dbl>
1        4177.
penguins |> drop_na() |> group_by(species, sex) |> 
    summarize(mbmg = mean(body_mass)) |> summarize(mean(mbmg))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 3 × 2
  species   `mean(mbmg)`
  <fct>            <dbl>
1 Adelie           3706.
2 Chinstrap        3733.
3 Gentoo           5082.
penguins |> drop_na() |> group_by(sex, species) |> 
    summarize(mbmg = mean(body_mass)) |> summarize(mean(mbmg))
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.
# A tibble: 2 × 2
  sex    `mean(mbmg)`
  <fct>         <dbl>
1 female        3859.
2 male          4489.

FAQ: Order of group_by

  • No change to first “grouped” operations
  • Change in grouping structure of result
  • Last group “removed” by summarize
  • No impact on grouped operations performed by mutate or filter

FAQ: ungroup

  • Remove all grouping structure
  • Defensive to keep group structure from “propogating” unwantedly
sum_penguins <- penguins |> 
    group_by(sex, species) |> 
    summarize(mbmg = mean(body_mass))

... # Lots of code 

sum_penguins |> filter(mbmg == max(mbmg)) # Still grouped!!

FAQ: Named Arguments in mutate and summarize

mutate and summarize create new columns:

  • mutate creates “one-to-one”
  • summarize creates “one-per-group”

If you want to name them (so you can use them later), use named argument

penguins |> group_by(species) |> summarize(n())
# A tibble: 3 × 2
  species   `n()`
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

vs

penguins |> group_by(species) |> summarize(n_species = n())
# A tibble: 3 × 2
  species   n_species
  <fct>         <int>
1 Adelie          152
2 Chinstrap        68
3 Gentoo          124

FAQ: Pipe Syntax

Pipe syntax (|>) is “syntactic sugar”

Just makes code easier to read:

penguins |> group_by(species) |> summarize(n_species = n())
# vs
summarize(group_by(penguins, species), n_species=n())

Exactly the same execution: improved UX

%>% is an older way of doing essentially the same thing

FAQ: Assignment of Pipeline Results

When to start a pipeline with NAME <-? Creating a new variable:

  • Data you intend to reuse
  • Assignment operator ‘up front’ indicates important
  • My rules of thumb for names:
    • New names for “new complete thoughts” - whole summary in one pipeline
    • Overwrite existing names for “like-for-like improvements” (USAGE <- USAGE |> code(...))
      • Recoding variable names, fixing typos, etc.
      • Use name repeatedly so downstream code picks up effects ‘for free’

FAQ: Comparison with SQL and Pandas (Python)

dplyr is heavily inspired by SQL (standard query language for data bases)

  • MW (2014): “Why bother? Can’t folks just use SQL”

pandas (in Python) inspired by R data.frame and SQL:

  • A bit older than dplyr (cousins?)
  • “New hotness” (polars) directly inspired by dplyr

FAQ: Performance

dplyr is fast, but advanced options:

  • dbplyr: translates dplyr syntax to SQL and executes in DB
  • dtplyr: uses alternate data.table back-end (HFT)

Hard to have bad performance in single-table analysis

  • Danger of accidentally creating ‘extra’ data in multi-table context
  • Will discuss more next week

Tools for slow code:

Don’t worry about improving code performance until:

  1. You’re sure it’s right
  2. You’re sure it’s slow

Incorrect code is infinitely slow.

New Material - Single Table Verbs

Diving Deeper with group_by, filter, and summarize

Data Set: nycflights13

Exercise: Lab #04

Wrap-Up

Looking Ahead

Life Tip of the Week

ZSB / Baruch / CUNY Benefits

As a student, you have many free and discounted benefits.

I have collected some of these on the course page, but there are many more if you look around.

Places love to give discounts to students - use them!

CUNY-Wide

  • Free New York Times and Wall Street Journal
  • Free and Discounted Museum Access via CUNY Arts
  • Discounted Broadway and Off-Broadway via TDF

Baruch / ZSB

  • Free Barron’s Subscription
  • Newman Library Databases

Any Student

  • Free Trial and Discounted Rate Amazon Prime
  • Discounted Spotify + Streaming Subscriptions
  • GitHub Student Developer Pack