STA 9750 Mini-Project #00
Thank you to those of you who provided peer feedback!
Mini-Project #00 - Pay Attention to the Details
At least one submission had title “YOUR TITLE GOES HERE”
Mini-Project #00 - Peer Feedback
Over 75% of the class reported receiving useful peer feedback.
Instructor’s Note: For graded MPs #01-04, be a bit more direct in peer feedback. Goal is to help your peers improve: constructive criticism.
If you didn’t get useful peer feedback on MP#00, please post a follow-up comment in your thread and I’ll take a look.
Mini-Project Helper Scripts
Remember course helper functions
mp_submission_create
- Open an issue for your submission
mp_submission_verify
- Check that your issue is formatted and page is available for review
mp_feedback_locate
- Find issues on which you’re being asked to comment
mp_feedback_verify
- Check that your peer feedback comments are formatted
STA 9750 Mini-Project #01
MP#01 released - Welcome to the Commission to Analyze Taxpayer Spending (CATS)
Due 2025-03-05 at 11:45pm ET
- GitHub post (used for peer feedback) AND Brightspace
- Significant penalties for only submitting one
Pay attention to the rubric
- Writing and presentation are about 50% of your grade
- Evaluated on rigor and thoughtfulness, not necessarily correctness
MP #01
Happy to see folks already getting started!
- A bit of debugging of network connection issues (possibly transient)
- Treatment of OT for per Annum and per diem employees
- Great questions on this (HZ😎) - Piazza pinned
Not everything has a single right answer - be reasonable, justify, and document
MP #01
How to deal with messy / incorrect data?
- Process it intensely
- Go ‘robust’
Course Project
Roster due at 2025-03-05 at 11:45pm ET by email to me.
All teammates need to agree, so takes a bit of time.
Once you set a team, start thinking about a team name!
Upcoming Mini-Projects
MP#02 assigned next week:
Identifying Environmentally Responsible US Public Transit Systems due at 2025-03-26 at 11:45pm ET
With revised MP #01 deadline, MP #02 released before MP #01 due
Later:
- MP#03 due at 2025-04-23 at 11:45pm ET
- MP#04 due at 2025-05-07 at 11:45pm ET
Pre-Assignments
Brightspace - Wednesdays at 11:45pm
- Reading, typically on course website
- Brightspace auto-grades.
- I have to manually change to completion grading.
FAQ: select(-)
data |> select(colname)
keeps colname
, dropping everything else
data |> select(-colname)
drops colname
, keeping everything else
Dropping is mainly useful for
- Presentation (removing unwanted columns)
- Advanced:
- Operations across columns
FAQ: filter
vs group_by
group_by
is an adverb. On its own, it does nothing; it changes the behavior of later functionality.
penguins |> drop_na() |> print(n=2)
# A tibble: 333 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
# ℹ 331 more rows
# ℹ 2 more variables: sex <fct>, year <int>
penguins |> drop_na() |> group_by(species) |> print(n=2)
# A tibble: 333 × 8
# Groups: species [3]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
# ℹ 331 more rows
# ℹ 2 more variables: sex <fct>, year <int>
FAQ: filter
vs group_by
No group_by
- full summarization:
penguins |> drop_na() |> summarize(mean(body_mass_g))
# A tibble: 1 × 1
`mean(body_mass_g)`
<dbl>
1 4207.
With group_by
- summary within groups.
penguins |> drop_na() |> group_by(species) |> summarize(mean(body_mass_g))
# A tibble: 3 × 2
species `mean(body_mass_g)`
<fct> <dbl>
1 Adelie 3706.
2 Chinstrap 3733.
3 Gentoo 5092.
FAQ: filter
vs group_by
With multiple grouping - “cross-tabs” of results:
penguins |> drop_na() |> group_by(species, sex) |> summarize(mean(body_mass_g))
# A tibble: 6 × 3
# Groups: species [3]
species sex `mean(body_mass_g)`
<fct> <fct> <dbl>
1 Adelie female 3369.
2 Adelie male 4043.
3 Chinstrap female 3527.
4 Chinstrap male 3939.
5 Gentoo female 4680.
6 Gentoo male 5485.
Note that result of multi-group_by
is still grouped:
penguins |> drop_na() |> group_by(species, sex) |> summarize(mean(body_mass_g))
# A tibble: 6 × 3
# Groups: species [3]
species sex `mean(body_mass_g)`
<fct> <fct> <dbl>
1 Adelie female 3369.
2 Adelie male 4043.
3 Chinstrap female 3527.
4 Chinstrap male 3939.
5 Gentoo female 4680.
6 Gentoo male 5485.
FAQ: filter
vs group_by
Changes next call to summarize
:
penguins |> drop_na() |> group_by(species) |>
summarize(mbmg = mean(body_mass_g)) |> summarize(mean(mbmg))
# A tibble: 1 × 1
`mean(mbmg)`
<dbl>
1 4177.
penguins |> drop_na() |> group_by(species, sex) |>
summarize(mbmg = mean(body_mass_g)) |> summarize(mean(mbmg))
# A tibble: 3 × 2
species `mean(mbmg)`
<fct> <dbl>
1 Adelie 3706.
2 Chinstrap 3733.
3 Gentoo 5082.
penguins |> drop_na() |> group_by(sex, species) |>
summarize(mbmg = mean(body_mass_g)) |> summarize(mean(mbmg))
# A tibble: 2 × 2
sex `mean(mbmg)`
<fct> <dbl>
1 female 3859.
2 male 4489.
FAQ: Order of group_by
- No change to first “grouped” operations
- Change in grouping structure of result
- Last group “removed” by
summarize
- No impact on grouped operations performed by
mutate
or filter
FAQ: ungroup
- Remove all grouping structure
- Defensive to keep group structure from “propogating” unwantedly
sum_penguins <- penguins |>
group_by(sex, species) |>
summarize(mbmg = mean(body_mass_g))
... # Lots of code
sum_penguins |> filter(mbmg == max(mbmg)) # Still grouped!!
FAQ: Named Arguments in mutate
and summarize
mutate
and summarize
create new columns:
mutate
creates “one-to-one”
summarize
creates “one-per-group”
If you want to name them (so you can use them later), use named argument
penguins |> group_by(species) |> summarize(n())
# A tibble: 3 × 2
species `n()`
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
vs
penguins |> group_by(species) |> summarize(n_species = n())
# A tibble: 3 × 2
species n_species
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
FAQ: Pipe Syntax
Pipe syntax (|>
) is “syntactic sugar”
Just makes code easier to read:
penguins |> group_by(species) |> summarize(n_species = n())
# vs
summarize(group_by(penguins, species), n_species=n())
Exactly the same execution: improved UX
%>%
is an older way of doing essentially the same thing
FAQ: Assignment of Pipeline Results
When to start a pipeline with NAME <-
? Creating a new variable:
- Data you intend to reuse
- Assignment operator ‘up front’ indicates important
- My rules of thumb for names:
- New names for “new complete thoughts” - whole summary in one pipeline
- Overwrite existing names for “like-for-like improvements” (
USAGE <- USAGE |> code(...)
)
- Recoding variable names, fixing typos, etc.
- Use name repeatedly so downstream code picks up effects ‘for free’
FAQ: Comparison with SQL and Pandas (Python)
dplyr
is heavily inspired by SQL
(standard query language for data bases)
- MW (2014): “Why bother? Can’t folks just use SQL”
pandas
(in Python) inspired by R
data.frame
and SQL
:
- A bit older than
dplyr
(cousins?)
- “New hotness” (
polars
) directly inspired by dplyr
Tools for slow code:
Don’t worry about improving code performance until:
- You’re sure it’s right
- You’re sure it’s slow
Incorrect code is infinitely slow.