FAQ: select(-)
data |> select(colname)
keeps colname
, dropping everything else
data |> select(-colname)
drops colname
, keeping everything else
Dropping is mainly useful for
- Presentation (removing unwanted columns)
- Advanced:
- Operations across columns
FAQ: filter
vs group_by
group_by
is an adverb. On its own, it does nothing; it changes the behavior of later functionality.
penguins |> drop_na() |> print(n=2)
# A tibble: 333 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
# ℹ 331 more rows
# ℹ 2 more variables: sex <fct>, year <int>
penguins |> drop_na() |> group_by(species) |> print(n=2)
# A tibble: 333 × 8
# Groups: species [3]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
# ℹ 331 more rows
# ℹ 2 more variables: sex <fct>, year <int>
No group_by
- full summarization:
penguins |> drop_na() |> summarize(mean(body_mass_g))
# A tibble: 1 × 1
`mean(body_mass_g)`
<dbl>
1 4207.
With group_by
- summary within groups.
penguins |> drop_na() |> group_by(species) |> summarize(mean(body_mass_g))
# A tibble: 3 × 2
species `mean(body_mass_g)`
<fct> <dbl>
1 Adelie 3706.
2 Chinstrap 3733.
3 Gentoo 5092.
With multiple grouping - “cross-tabs” of results:
penguins |> drop_na() |> group_by(species, sex) |> summarize(mean(body_mass_g))
# A tibble: 6 × 3
# Groups: species [3]
species sex `mean(body_mass_g)`
<fct> <fct> <dbl>
1 Adelie female 3369.
2 Adelie male 4043.
3 Chinstrap female 3527.
4 Chinstrap male 3939.
5 Gentoo female 4680.
6 Gentoo male 5485.
Note that result of multi-group_by
is still grouped:
penguins |> drop_na() |> group_by(species, sex) |> summarize(mean(body_mass_g))
# A tibble: 6 × 3
# Groups: species [3]
species sex `mean(body_mass_g)`
<fct> <fct> <dbl>
1 Adelie female 3369.
2 Adelie male 4043.
3 Chinstrap female 3527.
4 Chinstrap male 3939.
5 Gentoo female 4680.
6 Gentoo male 5485.
Changes next call to summarize
:
penguins |> drop_na() |> group_by(species) |>
summarize(mbmg = mean(body_mass_g)) |> summarize(mean(mbmg))
# A tibble: 1 × 1
`mean(mbmg)`
<dbl>
1 4177.
penguins |> drop_na() |> group_by(species, sex) |>
summarize(mbmg = mean(body_mass_g)) |> summarize(mean(mbmg))
# A tibble: 3 × 2
species `mean(mbmg)`
<fct> <dbl>
1 Adelie 3706.
2 Chinstrap 3733.
3 Gentoo 5082.
penguins |> drop_na() |> group_by(sex, species) |>
summarize(mbmg = mean(body_mass_g)) |> summarize(mean(mbmg))
# A tibble: 2 × 2
sex `mean(mbmg)`
<fct> <dbl>
1 female 3859.
2 male 4489.
FAQ: Order of group_by
- No change to first “grouped” operations
- Change in grouping structure of result
- Last group “removed” by
summarize
- No impact on grouped operations performed by
mutate
or filter
FAQ: ungroup
- Remove all grouping structure
- Defensive to keep group structure from “propogating” unwantedly
sum_penguins <- penguins |>
group_by(sex, species) |>
summarize(mbmg = mean(body_mass_g))
... # Lots of code
sum_penguins |> filter(mbmg == max(mbmg)) # Still grouped!!
FAQ: Named Arguments in mutate
and summarize
mutate
and summarize
create new columns:
mutate
creates “one-to-one”
summarize
creates “one-per-group”
If you want to name them (so you can use them later), use named argument
penguins |> group_by(species) |> summarize(n())
# A tibble: 3 × 2
species `n()`
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
vs
penguins |> group_by(species) |> summarize(n_species = n())
# A tibble: 3 × 2
species n_species
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
FAQ: Pipe Syntax
Pipe syntax (|>
) is “syntactic sugar”
Just makes code easier to read:
penguins |> group_by(species) |> summarize(n_species = n())
# vs
summarize(group_by(penguins, species), n_species=n())
Exactly the same execution: improved UX
FAQ: Assignment of Pipeline Results
When to start a pipeline with NAME <-
? Creating a new variable:
- Data you intend to reuse
- Assignment operator ‘up front’ indicates important
- My rules of thumb for names:
- New names for “new complete thoughts” - whole summary in one pipeline
- Overwrite existing names for “like-for-like improvements” (
USAGE <- USAGE |> code(...)
)
- Recoding variable names, fixing typos, etc.
- Use name repeatedly so downstream code picks up effects ‘for free’
FAQ: Comparison with SQL and Pandas (Python)
dplyr
is heavily inspired by SQL
(standard query language for data bases)
- MW (2014): “Why bother? Can’t folks just use SQL”
pandas
(in Python) inspired by R
data.frame
and SQL
:
- A bit older than
dplyr
(cousins?)
- “New hotness” (
polars
) directly inspired by dplyr
Tools for slow code:
Don’t worry about improving code performance until:
- You’re sure it’s right
- You’re sure it’s slow
Incorrect code is infinitely slow.