# A tibble: 5 × 3
college campus_borough bus_code
<chr> <chr> <chr>
1 CCNY Manhattan M
2 Baruch Manhattan M
3 CSI Staten Island S
4 York Queens Q
5 Medgar Evers Brooklyn <NA>
MEC stays, but no bus code - NA value
inner_join - Keep only matches
left_join - Keep all rows in left (first) table even w/o matches
right_join - Keep all rows in right (second) table even w/o matches
full_join - Keep all rows from both tables, even w/o matches
left_ and right_ are types of ‘outer’ joins
Pivoting
The pivot_* functions change the shape of data
Values are not created or destroyed, just moved around
wider data sets are formed by forming multiple rows into columns
longer data sets are splitting columns from the same row into new rows
These functions come from the tidyr package - not dplyr
Pivoting
Untidy example from last week:
# A tibble: 12 × 4
Semester Course Number Type
<chr> <chr> <dbl> <chr>
1 Fall Accounting 200 Enrollment
2 Fall Accounting 250 Cap
3 Fall Law 100 Enrollment
4 Fall Law 125 Cap
5 Fall Statistics 200 Enrollment
6 Fall Statistics 200 Cap
7 Spring Accounting 300 Enrollment
8 Spring Accounting 350 Cap
9 Spring Law 50 Enrollment
10 Spring Law 100 Cap
11 Spring Statistics 400 Enrollment
12 Spring Statistics 400 Cap
Pivoting
This data was untidy because it split a single unit (course) across multiple rows
# A tibble: 6 × 4
Semester Course Enrollment Cap
<chr> <chr> <dbl> <dbl>
1 Fall Accounting 200 250
2 Fall Law 100 125
3 Fall Statistics 200 200
4 Spring Accounting 300 350
5 Spring Law 50 100
6 Spring Statistics 400 400
Pivots
pivot_ changes the shape of a data set. Purposes:
Get ready for presentation
Prep for a join
Combine rows before looking at ‘cross-row’ structure
Pivots
Which penguin species has the largest between-sex mass difference?
library(tidyr)avg_mass_tbl <- penguins |>drop_na() |>group_by(sex, species) |>summarize(avg_mass =mean(body_mass), .groups="drop")# .groups="drop" is equivalent to |> ungroup()avg_mass_tbl
# A tibble: 6 × 3
sex species avg_mass
<fct> <fct> <dbl>
1 female Adelie 3369.
2 female Chinstrap 3527.
3 female Gentoo 4680.
4 male Adelie 4043.
5 male Chinstrap 3939.
6 male Gentoo 5485.
# A tibble: 1 × 4
species female male sex_diff
<fct> <dbl> <dbl> <dbl>
1 Gentoo 4680. 5485. 805.
Pivots
pivot_wider Arguments:
id_cols: kept as ‘keys’ for new table
names_from: existing column ‘spread’ to create new columns names
values_from: values in new table
pivot_longer:
‘Inverse’ operation
Spread one row + multiple columns => one col + multiple rows
pivot_wider and pivot_longer have many additional arguments for dealing with repeats / missing values. The help page (+ experimenting) is your friend
Legos of Data Analysis
These functions are like Legos:
Simple individually
Combine for complex structures
Legos of Data Analysis
Q: How many distinct flights left NYC in 2013?
[1] 336776
Not quite what we wanted…
Legos of Data Analysis
Q: How many distinct flights left NYC in 2013?
How many unique combinations of carrier + flight (e.g., United 101)?
[1] 5725
Legos of Data Analysis
Q: How many distinct flights left NYC in 2013?
💡 Did airlines re-use flight numbers for different destinations?
# A tibble: 5,725 × 3
carrier flight n
<chr> <int> <int>
1 9E 2900 1
2 9E 2901 1
3 9E 2902 1
4 9E 2903 2
5 9E 2904 2
6 9E 2905 1
7 9E 2906 1
8 9E 2907 1
9 9E 2908 1
10 9E 2909 2
# ℹ 5,715 more rows
Seems so!
Legos of Data Analysis
Q: How many distinct flights left NYC in 2013?
Find examples of re-use:
# A tibble: 1 × 3
carrier flight n
<chr> <int> <int>
1 UA 1162 16
Legos of Data Analysis
Q: How many distinct flights left NYC in 2013?
Finding most re-used flight number:
BOS CLE DEN DFW IAH JAC LAS MIA MSY ORD SAN SAT SEA SFO SNA TPA
13 1 19 5 8 2 1 2 8 55 4 1 18 2 27 4
Legos of Data Analysis
Q: How many distinct flights left NYC in 2013?
Finding most re-used flight number:
# A tibble: 1 × 3
carrier flight n
<chr> <int> <int>
1 UA 1162 16
Legos of Data Analysis
Q: How many distinct flights left NYC in 2013?
Seeing where our most reused number went:
# A tibble: 16 × 4
flight carrier dest n
<int> <chr> <chr> <int>
1 1162 UA BOS 13
2 1162 UA CLE 1
3 1162 UA DEN 19
4 1162 UA DFW 5
5 1162 UA IAH 8
6 1162 UA JAC 2
7 1162 UA LAS 1
8 1162 UA MIA 2
9 1162 UA MSY 8
10 1162 UA ORD 55
11 1162 UA SAN 4
12 1162 UA SAT 1
13 1162 UA SEA 18
14 1162 UA SFO 2
15 1162 UA SNA 27
16 1162 UA TPA 4
Legos of Data Analysis
Q: How many distinct flights left NYC in 2013?
Additional join to get airport information + formatting:
# A tibble: 16 × 4
`Destination Airport` Number of Times Flow…¹ `Carrier Code` `Flight Number`
<chr> <int> <chr> <int>
1 Chicago Ohare Intl 55 UA 1162
2 John Wayne Arpt Orange… 27 UA 1162
3 Denver Intl 19 UA 1162
4 Seattle Tacoma Intl 18 UA 1162
5 General Edward Lawrenc… 13 UA 1162
6 George Bush Interconti… 8 UA 1162
7 Louis Armstrong New Or… 8 UA 1162
8 Dallas Fort Worth Intl 5 UA 1162
9 San Diego Intl 4 UA 1162
10 Tampa Intl 4 UA 1162
11 Jackson Hole Airport 2 UA 1162
12 Miami Intl 2 UA 1162
13 San Francisco Intl 2 UA 1162
14 Cleveland Hopkins Intl 1 UA 1162
15 Mc Carran Intl 1 UA 1162
16 San Antonio Intl 1 UA 1162
# ℹ abbreviated name: ¹`Number of Times Flown`
Legos of Data Analysis
Q: How many distinct flights left NYC in 2013?
Extra join to match to airlines as well:
# A tibble: 3 × 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
Also has a column named name - need to disambiguate!
Legos of Data Analysis
Q: How many distinct flights left NYC in 2013?
Additional join to get airport information + formatting:
# A tibble: 16 × 4
`Destination Airport` Number of Times Flow…¹ `Flight Number` Carrier
<chr> <int> <int> <chr>
1 Chicago Ohare Intl 55 1162 United…
2 John Wayne Arpt Orange Co 27 1162 United…
3 Denver Intl 19 1162 United…
4 Seattle Tacoma Intl 18 1162 United…
5 General Edward Lawrence Logan… 13 1162 United…
6 George Bush Intercontinental 8 1162 United…
7 Louis Armstrong New Orleans I… 8 1162 United…
8 Dallas Fort Worth Intl 5 1162 United…
9 San Diego Intl 4 1162 United…
10 Tampa Intl 4 1162 United…
11 Jackson Hole Airport 2 1162 United…
12 Miami Intl 2 1162 United…
13 San Francisco Intl 2 1162 United…
14 Cleveland Hopkins Intl 1 1162 United…
15 Mc Carran Intl 1 1162 United…
16 San Antonio Intl 1 1162 United…
# ℹ abbreviated name: ¹`Number of Times Flown`
Legos of Data Analysis
Question: What does this do I can’t do in Excel?
Technically, nothing. All programming languages of sufficient complexity are equally powerful (Turing equivalence).
In actuality, quite a lot:
filter allows more complex filtering than clicking on values
group_by + summarize extend array formulas
*_join provides more complex matching than VLOOKUP
pivot_* provide general formulation of pivot tables
everything else you can do in R.
Ability to script minimizes “hard-coding” of names and values.
But truthfully
Let's not kid ourselves: the most widely used piece of software for statistics
is Excel.
-- Brian D. Ripley ('Statistical Methods Need Software: A View of
Statistical Computing')
Opening lecture RSS 2002, Plymouth (September 2002)
Some people familiar with R describe it as a supercharged version of
Microsoft's Excel spreadsheet software.
-- Ashlee Vance (in his article "Data Analysts Captivated by R's Power")
The New York Times (January 2009)
[W]ill we be learning how to perform joins within a subquery?
You don’t need subqueries in R since it’s an imperative language. Just create a new variable to represent the result of the subquery and use that in the next command.
SELECT first_name, last_name
FROM collectors
WHERE id IN (
SELECT collector_id
FROM sales
);