# A tibble: 5 × 3
college campus_borough bus_code
<chr> <chr> <chr>
1 CCNY Manhattan M
2 Baruch Manhattan M
3 CSI Staten Island S
4 York Queens Q
5 Medgar Evers Brooklyn <NA>
MEC stays, but no bus code - NA value
inner_join - Keep only matches
left_join - Keep all rows in left (first) table even w/o matches
right_join - Keep all rows in right (second) table even w/o matches
full_join - Keep all rows from both tables, even w/o matches
left_ and right_ are types of ‘outer’ joins
Pivoting
The pivot_* functions change the shape of data
Values are not created or destroyed, just moved around
wider data sets are formed by forming multiple rows into columns
longer data sets are splitting columns from the same row into new rows
These functions come from the tidyr package - not dplyr
library(tidyr) # included in library(tidyverse)
Pivoting
Untidy example from last week:
# A tibble: 12 × 4
Semester Course Number Type
<chr> <chr> <dbl> <chr>
1 Fall Accounting 200 Enrollment
2 Fall Accounting 250 Cap
3 Fall Law 100 Enrollment
4 Fall Law 125 Cap
5 Fall Statistics 200 Enrollment
6 Fall Statistics 200 Cap
7 Spring Accounting 300 Enrollment
8 Spring Accounting 350 Cap
9 Spring Law 50 Enrollment
10 Spring Law 100 Cap
11 Spring Statistics 400 Enrollment
12 Spring Statistics 400 Cap
Pivoting
This data was untidy because it split a single unit (course) across multiple rows
# A tibble: 6 × 4
Semester Course Enrollment Cap
<chr> <chr> <dbl> <dbl>
1 Fall Accounting 200 250
2 Fall Law 100 125
3 Fall Statistics 200 200
4 Spring Accounting 300 350
5 Spring Law 50 100
6 Spring Statistics 400 400
Pivots
pivot_ changes the shape of a data set. Purposes:
Get ready for presentation
Prep for a join
Combine rows before looking at ‘cross-row’ structure
Pivots
Which penguin species has the largest between-sex mass difference?
library(tidyr)avg_mass_tbl <- penguins |>drop_na() |>group_by(sex, species) |>summarize(avg_mass =mean(body_mass), .groups="drop")# .groups="drop" is equivalent to |> ungroup()avg_mass_tbl
# A tibble: 6 × 3
sex species avg_mass
<fct> <fct> <dbl>
1 female Adelie 3369.
2 female Chinstrap 3527.
3 female Gentoo 4680.
4 male Adelie 4043.
5 male Chinstrap 3939.
6 male Gentoo 5485.
How many unique combinations of carrier + flight (e.g., United 101)?
flights |>select(carrier, flight) |>n_distinct()
[1] 5725
Legos of Data Analysis
Q: How many distinct flights left NYC in 2013?
💡 Did airlines re-use flight numbers for different destinations?
flights |>distinct(carrier, flight, dest) |># Find reuse of number across different destinations# Shorthand for group_by + summarize(n = n())count(carrier, flight)
# A tibble: 5,725 × 3
carrier flight n
<chr> <int> <int>
1 9E 2900 1
2 9E 2901 1
3 9E 2902 1
4 9E 2903 2
5 9E 2904 2
6 9E 2905 1
7 9E 2906 1
8 9E 2907 1
9 9E 2908 1
10 9E 2909 2
# ℹ 5,715 more rows
# A tibble: 16 × 4
flight carrier dest n
<int> <chr> <chr> <int>
1 1162 UA BOS 13
2 1162 UA CLE 1
3 1162 UA DEN 19
4 1162 UA DFW 5
5 1162 UA IAH 8
6 1162 UA JAC 2
7 1162 UA LAS 1
8 1162 UA MIA 2
9 1162 UA MSY 8
10 1162 UA ORD 55
11 1162 UA SAN 4
12 1162 UA SAT 1
13 1162 UA SEA 18
14 1162 UA SFO 2
15 1162 UA SNA 27
16 1162 UA TPA 4
Legos of Data Analysis
Q: How many distinct flights left NYC in 2013?
Additional join to get airport information + formatting:
flights |>distinct(carrier, flight, dest) |>count(carrier, flight) |>slice_max(n) |>inner_join(flights, join_by(carrier == carrier, flight == flight)) |>count(flight, carrier, dest) |>inner_join(airports, join_by(dest == faa)) |>select(name, n, carrier, flight) |>arrange(desc(n)) |>rename(`Destination Airport`= name, `Number of Times Flown`= n, `Carrier Code`= carrier, `Flight Number`= flight)
# A tibble: 16 × 4
`Destination Airport` Number of Times Flow…¹ `Carrier Code` `Flight Number`
<chr> <int> <chr> <int>
1 Chicago Ohare Intl 55 UA 1162
2 John Wayne Arpt Orange… 27 UA 1162
3 Denver Intl 19 UA 1162
4 Seattle Tacoma Intl 18 UA 1162
5 General Edward Lawrenc… 13 UA 1162
6 George Bush Interconti… 8 UA 1162
7 Louis Armstrong New Or… 8 UA 1162
8 Dallas Fort Worth Intl 5 UA 1162
9 San Diego Intl 4 UA 1162
10 Tampa Intl 4 UA 1162
11 Jackson Hole Airport 2 UA 1162
12 Miami Intl 2 UA 1162
13 San Francisco Intl 2 UA 1162
14 Cleveland Hopkins Intl 1 UA 1162
15 Mc Carran Intl 1 UA 1162
16 San Antonio Intl 1 UA 1162
# ℹ abbreviated name: ¹`Number of Times Flown`
Legos of Data Analysis
Q: How many distinct flights left NYC in 2013?
Extra join to match to airlines as well:
head(airlines, 3)
# A tibble: 3 × 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
Also has a column named name - need to disambiguate!
Legos of Data Analysis
Q: How many distinct flights left NYC in 2013?
Additional join to get airport information + formatting:
# A tibble: 16 × 4
`Destination Airport` Number of Times Flow…¹ `Flight Number` Carrier
<chr> <int> <int> <chr>
1 Chicago Ohare Intl 55 1162 United…
2 John Wayne Arpt Orange Co 27 1162 United…
3 Denver Intl 19 1162 United…
4 Seattle Tacoma Intl 18 1162 United…
5 General Edward Lawrence Logan… 13 1162 United…
6 George Bush Intercontinental 8 1162 United…
7 Louis Armstrong New Orleans I… 8 1162 United…
8 Dallas Fort Worth Intl 5 1162 United…
9 San Diego Intl 4 1162 United…
10 Tampa Intl 4 1162 United…
11 Jackson Hole Airport 2 1162 United…
12 Miami Intl 2 1162 United…
13 San Francisco Intl 2 1162 United…
14 Cleveland Hopkins Intl 1 1162 United…
15 Mc Carran Intl 1 1162 United…
16 San Antonio Intl 1 1162 United…
# ℹ abbreviated name: ¹`Number of Times Flown`
Legos of Data Analysis
Question: What does this do I can’t do in Excel?
Technically, nothing. All programming languages of sufficient complexity are equally powerful (Turing equivalence).
In actuality, quite a lot:
filter allows more complex filtering than clicking on values
group_by + summarize extend array formulas
*_join provides more complex matching than VLOOKUP
pivot_* provide general formulation of pivot tables
everything else you can do in R.
Ability to script minimizes “hard-coding” of names and values.
But truthfully
fortunes::fortune(59)
Let's not kid ourselves: the most widely used piece of software for statistics
is Excel.
-- Brian D. Ripley ('Statistical Methods Need Software: A View of
Statistical Computing')
Opening lecture RSS 2002, Plymouth (September 2002)
fortunes::fortune(222)
Some people familiar with R describe it as a supercharged version of
Microsoft's Excel spreadsheet software.
-- Ashlee Vance (in his article "Data Analysts Captivated by R's Power")
The New York Times (January 2009)
[W]ill we be learning how to perform joins within a subquery?
You don’t need subqueries in R since it’s an imperative language. Just create a new variable to represent the result of the subquery and use that in the next command.
SELECT first_name, last_name
FROM collectors
WHERE id IN (
SELECT collector_id
FROM sales
);