species island bill_len bill_dep flipper_len body_mass sex year
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
Hello from New Mexico!
I’m on hotel Wifi, so if I drop, hold on a bit and I’ll rejoin.
Thank you to those of you who provided peer feedback!
A few of you still haven’t completed MP#00.
Too late for peer feedback, but you need to get this done in order to submit MP#01.
At least one submission had title “YOUR TITLE GOES HERE”
Over 75% of the class reported receiving useful peer feedback.
Instructor’s Note: For graded MPs #01-04, be a bit more direct in peer feedback. Goal is to help your peers improve: constructive criticism.
When submitting peer feedback on graded MPs, use comment template.
If you didn’t get useful peer feedback on MP#00, please post a follow-up comment in your thread and I’ll take a look.
Remember course helper functions
mp_submission_create
- Open an issue for your submissionmp_submission_verify
- Check that your issue is formatted and page is available for reviewmp_feedback_locate
- Find issues on which you’re being asked to commentmp_feedback_verify
- Check that your peer feedback comments are formattedMP#01 released - Gourmet Cheeseburgers Across the Globe: Exploring the Most Popular Programming on Netflix
Due 2025-10-03 at 11:59pm ET
Pay attention to the rubric
Happy to see folks already getting started!
Not everything has a single right answer - be reasonable, justify, and document
How to deal with messy / incorrect data?
Roster due at 2025-09-30 at 11:59pm ET by email to me.
All teammates need to agree, so takes a bit of time.
Once you set a team, start thinking about a team name!
MP#02 assigned next week:
TBD due at 2025-10-24 at 11:59pm ET
With revised MP #01 deadline, MP #02 released before MP #01 due
Later:
Brightspace - Wednesdays at 11:45pm
select(-)
data |> select(colname)
keeps colname
, dropping everything else
data |> select(-colname)
drops colname
, keeping everything else
Dropping is mainly useful for
filter
vs group_by
group_by
is an adverb. On its own, it does nothing; it changes the behavior of later functionality.
species island bill_len bill_dep flipper_len body_mass sex year
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
# A tibble: 2 × 8
# Groups: species [1]
species island bill_len bill_dep flipper_len body_mass sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
filter
vs group_by
No group_by
- full summarization:
With group_by
- summary within groups.
filter
vs group_by
With multiple grouping - “cross-tabs” of results:
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 × 3
# Groups: species [3]
species sex `mean(body_mass)`
<fct> <fct> <dbl>
1 Adelie female 3369.
2 Adelie male 4043.
3 Chinstrap female 3527.
4 Chinstrap male 3939.
5 Gentoo female 4680.
6 Gentoo male 5485.
Note that result of multi-group_by
is still grouped:
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 × 3
# Groups: species [3]
species sex `mean(body_mass)`
<fct> <fct> <dbl>
1 Adelie female 3369.
2 Adelie male 4043.
3 Chinstrap female 3527.
4 Chinstrap male 3939.
5 Gentoo female 4680.
6 Gentoo male 5485.
filter
vs group_by
Changes next call to summarize
:
penguins |> drop_na() |> group_by(species) |>
summarize(mbmg = mean(body_mass)) |> summarize(mean(mbmg))
# A tibble: 1 × 1
`mean(mbmg)`
<dbl>
1 4177.
penguins |> drop_na() |> group_by(species, sex) |>
summarize(mbmg = mean(body_mass)) |> summarize(mean(mbmg))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 3 × 2
species `mean(mbmg)`
<fct> <dbl>
1 Adelie 3706.
2 Chinstrap 3733.
3 Gentoo 5082.
group_by
summarize
mutate
or filter
ungroup
mutate
and summarize
mutate
and summarize
create new columns:
mutate
creates “one-to-one”summarize
creates “one-per-group”If you want to name them (so you can use them later), use named argument
# A tibble: 3 × 2
species `n()`
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
vs
Pipe syntax (|>
) is “syntactic sugar”
Just makes code easier to read:
Exactly the same execution: improved UX
%>%
is an older way of doing essentially the same thing
When to start a pipeline with NAME <-
? Creating a new variable:
USAGE <- USAGE |> code(...)
)
dplyr
is heavily inspired by SQL
(standard query language for data bases)
pandas
(in Python) inspired by R
data.frame
and SQL
:
dplyr
(cousins?)polars
) directly inspired by dplyr
dplyr
is fast, but advanced options:
dbplyr
: translates dplyr
syntax to SQL and executes in DBdtplyr
: uses alternate data.table
back-end (HFT)Hard to have bad performance in single-table analysis
Tools for slow code:
Don’t worry about improving code performance until:
Incorrect code is infinitely slow.
group_by
, filter
, and summarize
Data Set: nycflights13
Exercise: Lab #04
As a student, you have many free and discounted benefits.
I have collected some of these on the course page, but there are many more if you look around.
Places love to give discounts to students - use them!