Grades returned this afternoon.
Review regrade policy and late work policy if you have questions.
MP#02 - Identifying Environmentally Responsible US Public Transit Systems
Due 2025-03-26 at 11:45pm ET
Pay attention to the rubric
A few students have reported issues with rate-limiting on the EIA SEP site. Re-running the code seems to resolve issues.
My download code is written to save the files locally, so only needs to succeed once.
One student reported NA conversion warnings in processing the NTD data. These were harmless, but I’ve modified the provided code to suppress these warnings.
In brief: update to NTD now uses dashes in a few places for missing data and NA warning came up when trying to conver these to numeric values
Remember that “non-standard” column names require use of backticks:
I’ve started to return Proposal Feedback (via email) - will finish tomorrow
MP#01 grades to follow
Brightspace - Wednesdays at 11:45
Next pre-assignment is 2025-03-26 at 11:45pm ET:
Thank you for FAQs and (honest) team feedback. Keep it coming!
Thanks to DS for helping peers on Piazza!
Due Wednesday at 11:45pm:
On April 3rd, Prof. Brandwein will sit in and observe class
ggplot2: Elegant Visualizations for Data Analysisggplot2 vs Tableauggplot2
ggplot2 vs matplotlibggplot2
ggmatplotlib
Matlab plottingClosest Python analogue to ggplot2 is seaborn
+ instead of |>ggplot2 is older than |>ggplot3 ever gets made, will use |>I tried an interactive plot with \(n=132,000\) points, but it brought my computer to a halt. [Ed. Paraphrased]
That’s a lot of plots!!
ggplot2 is itself pretty fast, but it depends on (possibly slow) graphics backends
Large data sets can lead to overplotting:
Ways to address:
geom_jittergeom_hexJitter: add a bit of random noise so points don’t step on each other
Little “heatmaps” of counts. Hexagons to avoid weird rounding artifacts
aes()aes maps data to values. Outside of aes, set constant value
aes()aes maps data to values. Outside of aes, set constant value
geom_ specific aes()ggplot() apply to entire plotgeom apply there only
Two “modes”
Three types of color palettes:
When mapping quantitative variables to palettes (sequential/diverging), two approaches:
library(dplyr)
data <- data.frame(x = rnorm(5),
y = rnorm(5),
group = c("a", "a", "b", "b", "b"))
data |>
group_by(group) |>
mutate(n_count = n()) |>
ungroup() |>
mutate(color = ifelse(n_count == max(n_count), "red", "black")) |>
ggplot(aes(x=x, y=y, shape=group, color=color)) +
geom_point() +
scale_color_identity()Built-in themes + ggthemes package:
library(ggplot2); library(ggthemes);
library(palmerpenguins); library(ggpmisc)
p <- ggplot(penguins,
aes(x=flipper_length_mm,
y=body_mass_g,
color=species)) +
geom_point() +
stat_poly_line(se=FALSE,
color="black") +
stat_poly_eq() +
xlab("Flipper Length (mm)") +
ylab("Body Mass (g)") +
scale_color_brewer(type="qual",
palette=2,
name="Species") +
facet_wrap(~species)Default theme (ggplot2::theme_grey()):
Black and White theme (ggplot2::theme_bw()):
Minimal theme (ggplot2::theme_minimal()):
Light theme (ggplot2::theme_light()):
Dark theme (ggplot2::theme_dark()):
Excel theme (ggthemes::theme_excel()):
Google Docs theme (ggthemes::theme_gdocs()):
The Economist theme (ggthemes::theme_economist()):
The Economist theme (ggthemes::theme_economist()):
Solarized theme (ggthemes::theme_solarized()):
Solarized2 theme (ggthemes::theme_solarized_2()):
Stata theme (ggthemes::theme_stata()):
Tufte theme (ggthemes::theme_tufte()):
Wall Street Journal theme (ggthemes::theme_wsj()):
Many more online:
Order of layers technically matters, but the effect is small
p1 <- ggplot(penguins, aes(x=bill_length_mm, y=flipper_length_mm)) +
geom_point(color="black") +
geom_smooth(color="blue", method="lm") + ggtitle("Line on points")
p2 <- ggplot(penguins, aes(x=bill_length_mm, y=flipper_length_mm)) +
geom_smooth(color="blue", method="lm") +
geom_point(color="black") + ggtitle("Points on line")
p1 + p2Order matters more with theme. Adding a theme_*() will override any theme() customization you did:
stat_poly_{line,eq} vs geom_smoothBy default geom_smooth fits a generalized additive model (GAM)
ggpmisc::stat_poly_{line,eq} fit linear models, so they can expose more machinery.
What is a GAM?
+ggtitle("text") is just shorthand for +labs(title="text")
Perceptually:
Humans are better at:
Facets are group_by for plots. Useful for
1973: UC Berkeley was concerned about bias in Grad School Admissions
This phenomenon occurs throughout the social sciences: the best doctors have the worst patient outcomes
BHO note:
Women are shunted by their socialization and education toward fields of graduate study that are generally more crowded, less productive of completed degrees, and less well funded, and that frequently offer poorer professional employment prospects.
How can I implement a dual (twin) axis plot in
ggplot2?
Disfavored. But if you must …
Doesn’t allow arbitrary secondary axes; allows transformed axes (e.g., Celsius and Fahrenheit)
ggplotSee the ggimage or ggflags package for images as “points”:
See cowplot::draw_image() for image background:
library(cowplot)
p <- ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
geom_density(alpha = 0.7) +
scale_y_continuous(expand = expansion(mult = c(0, 0.05))) +
theme_half_open(12)
logo_file <- system.file("extdata", "logo.png", package = "cowplot")
ggdraw() +
draw_image(
logo_file, scale = .7
) +
draw_plot(p)ggplot2Data Sets:
diamonds from the ggplot2 packagecdiac from the CVXR packagegapminder from the gapminder packageYou need to install CVXR and gapminder now.
Exercise: Lab #07
Today:
geomsNext Week:
| Room | Team | Room | Team | |
|---|---|---|---|---|
| 1 | VH + SG + DS + DL | 5 | GB + RJ + FS + MH | |
| 2 | HZ + JLL + CA | 6 | EM + AK + GMdS + JL | |
| 3 | MT + CW + VG + GS + CWo. | 7 | SJB + JC + SB + ZS | |
| 4 | SD + GO + CFGF |
Course Structure:
RCourse Projects:
My Learning Goals: