Grades returned this afternoon.
Review regrade policy and late work policy if you have questions.
MP#02 - Identifying Environmentally Responsible US Public Transit Systems
Due 2025-03-26 at 11:45pm ET
Pay attention to the rubric
A few students have reported issues with rate-limiting on the EIA SEP site. Re-running the code seems to resolve issues.
My download code is written to save the files locally, so only needs to succeed once.
One student reported NA
conversion warnings in processing the NTD data. These were harmless, but I’ve modified the provided code to suppress these warnings.
In brief: update to NTD now uses dashes in a few places for missing data and NA
warning came up when trying to conver these to numeric values
Remember that “non-standard” column names require use of backticks:
I’ve started to return Proposal Feedback (via email) - will finish tomorrow
MP#01 grades to follow
Brightspace - Wednesdays at 11:45
Next pre-assignment is 2025-03-26 at 11:45pm ET:
Thank you for FAQs and (honest) team feedback. Keep it coming!
Thanks to DS for helping peers on Piazza!
Due Wednesday at 11:45pm:
On April 3rd, Prof. Brandwein will sit in and observe class
ggplot2
: Elegant Visualizations for Data Analysisggplot2
vs Tableau
ggplot2
ggplot2
vs matplotlib
ggplot2
gg
matplotlib
Matlab
plottingClosest Python analogue to ggplot2
is seaborn
+
instead of |>
ggplot2
is older than |>
ggplot3
ever gets made, will use |>
I tried an interactive plot with \(n=132,000\) points, but it brought my computer to a halt. [Ed. Paraphrased]
That’s a lot of plots!!
ggplot2
is itself pretty fast, but it depends on (possibly slow) graphics backends
Large data sets can lead to overplotting:
Ways to address:
geom_jitter
geom_hex
Jitter: add a bit of random noise so points don’t step on each other
Little “heatmaps” of counts. Hexagons to avoid weird rounding artifacts
aes()
aes
maps data to values. Outside of aes
, set constant value
aes()
aes
maps data to values. Outside of aes
, set constant value
geom_
specific aes()
ggplot()
apply to entire plotgeom
apply there only
Two “modes”
Three types of color palettes:
When mapping quantitative variables to palettes (sequential/diverging), two approaches:
library(dplyr)
data <- data.frame(x = rnorm(5),
y = rnorm(5),
group = c("a", "a", "b", "b", "b"))
data |>
group_by(group) |>
mutate(n_count = n()) |>
ungroup() |>
mutate(color = ifelse(n_count == max(n_count), "red", "black")) |>
ggplot(aes(x=x, y=y, shape=group, color=color)) +
geom_point() +
scale_color_identity()
Built-in themes + ggthemes
package:
library(ggplot2); library(ggthemes);
library(palmerpenguins); library(ggpmisc)
p <- ggplot(penguins,
aes(x=flipper_length_mm,
y=body_mass_g,
color=species)) +
geom_point() +
stat_poly_line(se=FALSE,
color="black") +
stat_poly_eq() +
xlab("Flipper Length (mm)") +
ylab("Body Mass (g)") +
scale_color_brewer(type="qual",
palette=2,
name="Species") +
facet_wrap(~species)
Default theme (ggplot2::theme_grey()
):
Black and White theme (ggplot2::theme_bw()
):
Minimal theme (ggplot2::theme_minimal()
):
Light theme (ggplot2::theme_light()
):
Dark theme (ggplot2::theme_dark()
):
Excel theme (ggthemes::theme_excel()
):
Google Docs theme (ggthemes::theme_gdocs()
):
The Economist theme (ggthemes::theme_economist()
):
The Economist theme (ggthemes::theme_economist()
):
Solarized theme (ggthemes::theme_solarized()
):
Solarized2 theme (ggthemes::theme_solarized_2()
):
Stata theme (ggthemes::theme_stata()
):
Tufte theme (ggthemes::theme_tufte()
):
Wall Street Journal theme (ggthemes::theme_wsj()
):
Many more online:
Order of layers technically matters, but the effect is small
p1 <- ggplot(penguins, aes(x=bill_length_mm, y=flipper_length_mm)) +
geom_point(color="black") +
geom_smooth(color="blue", method="lm") + ggtitle("Line on points")
p2 <- ggplot(penguins, aes(x=bill_length_mm, y=flipper_length_mm)) +
geom_smooth(color="blue", method="lm") +
geom_point(color="black") + ggtitle("Points on line")
p1 + p2
Order matters more with theme. Adding a theme_*()
will override any theme()
customization you did:
stat_poly_{line,eq}
vs geom_smooth
By default geom_smooth
fits a generalized additive model (GAM)
ggpmisc::stat_poly_{line,eq}
fit linear models, so they can expose more machinery.
What is a GAM?
+ggtitle("text")
is just shorthand for +labs(title="text")
Perceptually:
Humans are better at:
Facets are group_by
for plots. Useful for
1973: UC Berkeley was concerned about bias in Grad School Admissions
This phenomenon occurs throughout the social sciences: the best doctors have the worst patient outcomes
BHO note:
Women are shunted by their socialization and education toward fields of graduate study that are generally more crowded, less productive of completed degrees, and less well funded, and that frequently offer poorer professional employment prospects.
How can I implement a dual (twin) axis plot in
ggplot2
?
Disfavored. But if you must …
Doesn’t allow arbitrary secondary axes; allows transformed axes (e.g., Celsius and Fahrenheit)
ggplot
See the ggimage
or ggflags
package for images as “points”:
See cowplot::draw_image()
for image background:
library(cowplot)
p <- ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
geom_density(alpha = 0.7) +
scale_y_continuous(expand = expansion(mult = c(0, 0.05))) +
theme_half_open(12)
logo_file <- system.file("extdata", "logo.png", package = "cowplot")
ggdraw() +
draw_image(
logo_file, scale = .7
) +
draw_plot(p)
ggplot2
Data Sets:
diamonds
from the ggplot2
packagecdiac
from the CVXR
packagegapminder
from the gapminder
packageYou need to install CVXR
and gapminder
now.
Exercise: Lab #07
Today:
geom
sNext Week:
Room | Team | Room | Team | |
---|---|---|---|---|
1 | VH + SG + DS + DL | 5 | GB + RJ + FS + MH | |
2 | HZ + JLL + CA | 6 | EM + AK + GMdS + JL | |
3 | MT + CW + VG + GS + CWo. | 7 | SJB + JC + SB + ZS | |
4 | SD + GO + CFGF |
Course Structure:
R
Course Projects:
My Learning Goals: