Make sure to complete all instructor-provided EDA questions and writing prompts
Be reasonable, justify, and document your analysis.
Do your answers pass the sniff test?
At least one sentence of text per short question is a good baseline
MP#01 - Peer Feedback
Assigned on GitHub - due on 2026-03-22
\(\approx 4\) feedbacks each
Take this seriously: 20% of this assignment is “meta-review”
Goal: rigorousconstructive critique
Use helper function to perform peer feedback assigned to you. Ask on Piazza if still having trouble.
MP#01 - Peer Feedback
Submissions may not map perfectly to rubric - use your best judgement
Be generous but serious:
Goal is improvement, so “everything is great, no comments” is unhelpful
Nothing is completely right nor completely wrong
Remember, meta-review (instructor scores of your feedback) to follow
Learn from this! What can you adapt for MP#02?
MP#01 - Peer Feedback
Example of poor feedback:
Comments
Website looks really good, nice and clean!
Written Communication
Excellent and straight to the point.
Project Skeleton
Solid skeleton, well-organized .
Formatting & Display
Nicely formatted and balanced across the website.
Code Quality
Code runs like Forrest Gump in a Slump.
Data Preparation
Very great.
Extra Credit
Added graphes for even better understanding.
Superficial comments / no sign of actually reading work
Red Flag: Repeated verbatim on several posts
Reminder: Poor feedback \(\neq\) poor work.
MP#01 - Peer Feedback
Example of medium feedback:
## Comments
Love the visuals for Press Releases.
### Written Communication
A short summary or a description of this project would be great to add.
### Project Skeleton
Code completes all instructor-provided tasks correctly.
### Formatting & Display
Tables have well-formatted column names; caption would be great.
### Code Quality
Code is clear and well written.
### Data Preparation
Automatic (10/10). Out of scope for this mini-project.
### Extra Credit
I found all the Press Releases very interesting. The visuals were a great touch.
Gave actionable suggestions
Directionally correct, but a bit vague
MP#01 - Peer Feedback
Example of great feedback:
### Written Communication
Overall, your writing is clear and easy to follow. I noticed a few small typos,
but nothing major — using the built-in spell check in RStudio should catch those quickly. Everything else looks solid, and I didn’t have any major concerns based
on my review.
### Project Skeleton
All tasks were completed satisfactorily.
### Formatting & Display
Overall, your tables and figures are well-organized and clear. There is one
table in the "Data" section with column titles that could be formatted a little
more cleanly for easier reading. I also noticed a few small typos in some of the
table captions, but they should be easy to fix.
### Code Quality
The code quality is generally good, but there are a few minor linter issues. The
comments could be more frequent and clearer in some places. For example, the
comment `#checking dupolicate so that github will not block it` should be
corrected to "duplicate," and the explanation could be clarified to avoid
confusion–what duplicate are you referring to, and how does it block GitHub?
It might confuse someone else reading the code.
Also, I would recommend moving all library imports to the top of the script, as that is typically considered standard practice.
### Data Preparation
The data preparation looks solid overall. I like how you handle missing files
and JSON parsing failures–it demonstrates strong defensive programming.
### Extra Credit
Additional two points for using quarto's video support!
Positive tone
Detailed comments
Noted issues & gave suggestions on how to fix
Noted unclear sections for improvement
MP#01 - Peer Feedback
Lack of prior experience is not a hinderance here:
If something is unclear to you, that’s a problem!
Nothing required super-complex code, so anything overly complex probably could have been a simpler way (except for some above-and-beyond stuff)
You don’t have to be definitive in comments - impressions and questions are just as helpful.
MP FAQs
Q: Why doesn’t my site look the same on GitHub as it does on my laptop?
A: Missing css and js files. Need to upload everything in docs.
# A tibble: 5 × 3
college campus_borough bus_code
<chr> <chr> <chr>
1 CCNY Manhattan M
2 Baruch Manhattan M
3 CSI Staten Island S
4 York Queens Q
5 Medgar Evers Brooklyn <NA>
MEC stays, but no bus code - NA value
inner_join - Keep only matches
left_join - Keep all rows in left (first) table even w/o matches
right_join - Keep all rows in right (second) table even w/o matches
full_join - Keep all rows from both tables, even w/o matches
left_ and right_ are types of ‘outer’ joins
Pivoting
The pivot_* functions change the shape of data
Values are not created or destroyed, just moved around
wider data sets are formed by forming multiple rows into columns
longer data sets are splitting columns from the same row into new rows
These functions come from the tidyr package - not dplyr
library(tidyr) # included in library(tidyverse)
Pivoting
Untidy example from last week:
# A tibble: 12 × 4
Semester Course Number Type
<chr> <chr> <dbl> <chr>
1 Fall Accounting 200 Enrollment
2 Fall Accounting 250 Cap
3 Fall Law 100 Enrollment
4 Fall Law 125 Cap
5 Fall Statistics 200 Enrollment
6 Fall Statistics 200 Cap
7 Spring Accounting 300 Enrollment
8 Spring Accounting 350 Cap
9 Spring Law 50 Enrollment
10 Spring Law 100 Cap
11 Spring Statistics 400 Enrollment
12 Spring Statistics 400 Cap
Pivoting
This data was untidy because it split a single unit (course) across multiple rows
# A tibble: 6 × 4
Semester Course Enrollment Cap
<chr> <chr> <dbl> <dbl>
1 Fall Accounting 200 250
2 Fall Law 100 125
3 Fall Statistics 200 200
4 Spring Accounting 300 350
5 Spring Law 50 100
6 Spring Statistics 400 400
Pivots
pivot_ changes the shape of a data set. Purposes:
Get ready for presentation
Prep for a join
Combine rows before looking at ‘cross-row’ structure
Pivots
Which penguin species has the largest between-sex mass difference?
library(tidyr)avg_mass_tbl <- penguins |>drop_na() |>group_by(sex, species) |>summarize(avg_mass =mean(body_mass), .groups="drop")# .groups="drop" is equivalent to |> ungroup()avg_mass_tbl
# A tibble: 6 × 3
sex species avg_mass
<fct> <fct> <dbl>
1 female Adelie 3369.
2 female Chinstrap 3527.
3 female Gentoo 4680.
4 male Adelie 4043.
5 male Chinstrap 3939.
6 male Gentoo 5485.
Pivots
We want data that is wider (or at least not longer) than our current data:
How many unique combinations of carrier + flight (e.g., United 101)?
flights |>select(carrier, flight) |>n_distinct()
[1] 5725
Legos of Data Analysis
Q: How many distinct flights left NYC in 2013?
💡 Did airlines re-use flight numbers for different destinations?
flights |>distinct(carrier, flight, dest) |># Find reuse of number across different destinations# Shorthand for group_by + summarize(n = n())count(carrier, flight)
# A tibble: 5,725 × 3
carrier flight n
<chr> <int> <int>
1 9E 2900 1
2 9E 2901 1
3 9E 2902 1
4 9E 2903 2
5 9E 2904 2
6 9E 2905 1
7 9E 2906 1
8 9E 2907 1
9 9E 2908 1
10 9E 2909 2
# ℹ 5,715 more rows
# A tibble: 16 × 4
flight carrier dest n
<int> <chr> <chr> <int>
1 1162 UA BOS 13
2 1162 UA CLE 1
3 1162 UA DEN 19
4 1162 UA DFW 5
5 1162 UA IAH 8
6 1162 UA JAC 2
7 1162 UA LAS 1
8 1162 UA MIA 2
9 1162 UA MSY 8
10 1162 UA ORD 55
11 1162 UA SAN 4
12 1162 UA SAT 1
13 1162 UA SEA 18
14 1162 UA SFO 2
15 1162 UA SNA 27
16 1162 UA TPA 4
Legos of Data Analysis
Q: How many distinct flights left NYC in 2013?
Additional join to get airport information + formatting:
flights |>distinct(carrier, flight, dest) |>count(carrier, flight) |>slice_max(n) |>inner_join(flights, join_by(carrier == carrier, flight == flight)) |>count(flight, carrier, dest) |>inner_join(airports, join_by(dest == faa)) |>select(name, n, carrier, flight) |>arrange(desc(n)) |>rename(`Destination Airport`= name, `Number of Times Flown`= n, `Carrier Code`= carrier, `Flight Number`= flight)
# A tibble: 16 × 4
`Destination Airport` Number of Times Flow…¹ `Carrier Code` `Flight Number`
<chr> <int> <chr> <int>
1 Chicago Ohare Intl 55 UA 1162
2 John Wayne Arpt Orange… 27 UA 1162
3 Denver Intl 19 UA 1162
4 Seattle Tacoma Intl 18 UA 1162
5 General Edward Lawrenc… 13 UA 1162
6 George Bush Interconti… 8 UA 1162
7 Louis Armstrong New Or… 8 UA 1162
8 Dallas Fort Worth Intl 5 UA 1162
9 San Diego Intl 4 UA 1162
10 Tampa Intl 4 UA 1162
11 Jackson Hole Airport 2 UA 1162
12 Miami Intl 2 UA 1162
13 San Francisco Intl 2 UA 1162
14 Cleveland Hopkins Intl 1 UA 1162
15 Mc Carran Intl 1 UA 1162
16 San Antonio Intl 1 UA 1162
# ℹ abbreviated name: ¹`Number of Times Flown`
Legos of Data Analysis
Q: How many distinct flights left NYC in 2013?
Extra join to match to airlines as well:
head(airlines, 3)
# A tibble: 3 × 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
Also has a column named name - need to disambiguate!
Legos of Data Analysis
Q: How many distinct flights left NYC in 2013?
Additional join to get airport information + formatting:
# A tibble: 16 × 4
`Destination Airport` Number of Times Flow…¹ `Flight Number` Carrier
<chr> <int> <int> <chr>
1 Chicago Ohare Intl 55 1162 United…
2 John Wayne Arpt Orange Co 27 1162 United…
3 Denver Intl 19 1162 United…
4 Seattle Tacoma Intl 18 1162 United…
5 General Edward Lawrence Logan… 13 1162 United…
6 George Bush Intercontinental 8 1162 United…
7 Louis Armstrong New Orleans I… 8 1162 United…
8 Dallas Fort Worth Intl 5 1162 United…
9 San Diego Intl 4 1162 United…
10 Tampa Intl 4 1162 United…
11 Jackson Hole Airport 2 1162 United…
12 Miami Intl 2 1162 United…
13 San Francisco Intl 2 1162 United…
14 Cleveland Hopkins Intl 1 1162 United…
15 Mc Carran Intl 1 1162 United…
16 San Antonio Intl 1 1162 United…
# ℹ abbreviated name: ¹`Number of Times Flown`
Legos of Data Analysis
Question: What does this do I can’t do in Excel?
Technically, nothing. All programming languages of sufficient complexity are equally powerful (Turing equivalence).
In actuality, quite a lot:
filter allows more complex filtering than clicking on values
group_by + summarize extend array formulas
*_join provides more complex matching than VLOOKUP
pivot_* provide general formulation of pivot tables
everything else you can do in R.
Ability to script minimizes “hard-coding” of names and values.
But truthfully
fortunes::fortune(59)
Let's not kid ourselves: the most widely used piece of software for statistics
is Excel.
-- Brian D. Ripley ('Statistical Methods Need Software: A View of
Statistical Computing')
Opening lecture RSS 2002, Plymouth (September 2002)
fortunes::fortune(222)
Some people familiar with R describe it as a supercharged version of
Microsoft's Excel spreadsheet software.
-- Ashlee Vance (in his article "Data Analysts Captivated by R's Power")
The New York Times (January 2009)
[W]ill we be learning how to perform joins within a subquery?
You don’t need subqueries in R since it’s an imperative language. Just create a new variable to represent the result of the subquery and use that in the next command.
SELECT first_name, last_name
FROM collectors
WHERE id IN (
SELECT collector_id
FROM sales
);
[H]ow can we ensure that the information [resulting from a join] is accurate and not repeated?
If you have a true unique ID, you’re usually safe
Pay attention to all warnings
Manually examine the result of any joins
Performance
Will joining large data sets […] affect performance?
Somewhat - larger data sets are always slower.
Bigger danger is “bad joins” creating huge data automatically.
Note that R is less “smart” than SQL, so won’t optimize execution order for you automatically.
dplyr joins vs SQL joins
What is the difference between dplyr and SQL joins?
Not too much - biggest difference is no INDEX or FOREIGN KEY in R so less guarantees of data integrity.
When to use anti_join()?
Rare: looking for unmatched rows.
Useful to find data integrity issues or ‘implicit’ missingness.
I use an anti_join to find students who haven’t submitted an assignment.
many-to-many Warning
Tricky to address, but fortunately pretty rare.
SQL explicitly forbids many-to-many
Usually a sign that a “key” isn’t really unique
Check for duplicates in x and y tables
Can occur with “fancy” joins (rolling, inequality)
Add additional join variables to break “duplication”
How to Check Efficiency?
No automatic way. Some rules of thumb:
Don’t create large tables just to filter down
filter before join when possible
full_outer join is a bit dangerous
cross_join is rarely the right answer
tidyr vs dplyr
Is tidyr more efficient than dplyr?
Nope - different packages from the same developers.
Designed to work together elegantly.
Rare Joins
What are cross_join, filter joins, and nest_join?
cross_join: dangerous.
Creates “all pairs” of rows. Useful for ‘design’ problems
filter joins (anti_, semi_):
Hunting down quietly missing data.
Filtering to sub-samples
nest_join: beyond this course.
left_join with extra structure to output.
Wrap-Up
Review
Multi-Table dplyr:
inner_join and left_join
join_by specifications
pivot_longer and pivot_wider to get data into optimal formats (tidyr)
Additional dplyr:
Ranking, cumulative, and shift functions
Orientation
Communicating Results (quarto) ✅
R Basics ✅
Data Manipulation in R ✅
Data Visualization in R ⬅️
Getting Data into R
Statistical Modeling in R
Life Tip of the Week
West Virginia State Board of Education v. Barnette (March 11, 1943)
Students cannot be compelled to recite the Pleldge of Allegiance, even during a period of war
Iconic First Amendment Victory
Barnette
Justice Jackson’s Opinion (6-3):
If there is any fixed star in our constitutional constellation, it is that no official, high or petty, can prescribe what shall be orthodox in politics, nationalism, religion, or other matters of opinion or force citizens to confess by word or act their faith therein.
Story Behind the Case
1940 Case Minersville School District v. Gobitis
JW Students in PA refused to recite Pledge of Allegience
Justice Frankfurter (8-1 Majority Opinion):
National Unity is the basis of National Security
Students could be forced to pledge
After the decision, waves of violence against JW students and adults accused of “treason” against the war effort
Story Behind the Case
Justice Stone (Dissent):
[T]he guarantees of civil liberty are but guarantees of freedom of the human mind and spirit and of reasonable freedom and opportunity to express them. [… T]he very essence of the liberty which they guarantee is the freedom of the individual from compulsion as to what he shall think and what he shall say.
A few years later, changed court wanted to revisit, leading to Barnette
More from J. Jackson
As governmental pressure toward unity becomes greater, so strife becomes more bitter as to whose unity it shall be.[…] Those who begin coercive elimination of dissent soon find themselves exterminating dissenters. Compulsory unification of opinion achieves only the unanimity of the graveyard.
Authority [in the United States] is to be controlled by public opinion, not public opinion by authority.
More from J. Jackson
[F]reedom to differ is not limited to things that do not matter much. That would be a mere shadow of freedom. The test of its substance is the right to differ as to things that touch the heart of the existing order.
Lessons
We get things wrong, often very wrong, in times of public fear
Law of Free Speech is necessary but not sufficient for a Culture of Free Speech
Freedom to Dissent is at the core of a pluralistic society
Rules and norms exist for the hard cases, not the easy ones