ggplot2
ggplot2
:
G rammar of G raphics Plot ing, Version 2
Structured Plotting
Plotting to express statistical visualization
Not raw shapes and colors (“graphics primitives”)
Make it easy to make good visualizations
Why Visualization?
Why do we visualize data?
Data exploration and understanding
Hypothesis generation
Data communication
Humans are better at visuals than numbers
Allow the data to surprise you
Why Visualization?
Same \(\mu_X, \mu_Y, \sigma_X, \sigma_Y, \rho_{XY}, \beta_{Y|X}, \dots\) - OLS can’t distinguish
Why Visualization?
Modeling and visualizing are not sequential:
Build a model, where does it fail?
See a pattern, does it hold up in a model / test?
ggplot2
ggplot2
provides a system (“grammar”) for visualizations:
geom_
s: the actual thing to be plotted (points, lines, etc. )
aes
(aesthetics): mapping of aspects of data
scale_
s: control mapping from ‘data space’ to ‘graphics space’
theme
: basic non-data-dependent plot elements
guide
s: legends
stat_
s: transformations of data used to plot (CDF, histogram counts)
ggplot2 - Worked Example
Let’s plot the penguins
data. To avoid warnings, use a no-NA
version:
library (tidyverse)
penguins_ok <- penguins |> drop_na ()
ggplot (penguins_ok)
ggplot2 - Worked Example
Need to map specific variables to aspects of plots: aes
mapping
ggplot (penguins_ok,
aes (x= flipper_len, y= body_mass))
Map flipper_len
to \(x\) -axis
Map body_mass
to \(y\) -axis
Basic labels and ticks auto-generated
Still nothing in the plot!
ggplot2 - Worked Example
Add a geom_
to draw plot elements
ggplot (penguins_ok,
aes (x= flipper_len, y= body_mass)) +
geom_point ()
\((x, y)\) coordinates inherited from aes
Default simple points
Plot elements are added
Combining elements into a single plot, not sequencing (no |>
)
ggplot2 - Worked Example
Replace default labels:
ggplot (penguins_ok,
aes (x= flipper_len, y= body_mass)) +
geom_point () +
xlab ("Flipper Length (mm)" ) +
ylab ("Body Mass (g)" ) +
ggtitle ("Body Mass vs Flipper Length for 333 Penguins" )
Helper functions xlab
, ylab
, ggtitle
Can also use labs()
directly with named arguments:
+ labs (x = "Flipper Length(mm)" ,
y = "Body Mass (g)" ,
title= "Body Mass vs Flipper Length for 333 Penguins" )
ggplot2 - Worked Example
Replace default labels:
ggplot (penguins_ok,
aes (x= flipper_len, y= body_mass, color= species)) +
geom_point () +
xlab ("Flipper Length (mm)" ) +
ylab ("Body Mass (g)" ) +
ggtitle ("Body Mass vs Flipper Length for 333 Penguins" )
Additional aesthetic (color
) inherited by geom_point
Automatic identification of categorical (factor) data
ggplot2 - Worked Example
Replace default color scale:
ggplot (penguins_ok,
aes (x= flipper_len, y= body_mass, color= species)) +
geom_point () +
xlab ("Flipper Length (mm)" ) +
ylab ("Body Mass (g)" ) +
ggtitle ("Body Mass vs Flipper Length for 333 Penguins" ) +
scale_color_brewer (name= "Species" , type= "qual" , palette= 2 )
Override default color scale with scale_color_brewer
Colors taken from work of Cynthia Brewer (PSU)
Using a qual
itative palette here because no order to species
ggplot2 - Worked Example
Change theme
for non-data elements:
ggplot (penguins_ok,
aes (x= flipper_len, y= body_mass, color= species)) +
geom_point () +
xlab ("Flipper Length (mm)" ) +
ylab ("Body Mass (g)" ) +
ggtitle ("Body Mass vs Flipper Length for 333 Penguins" ) +
scale_color_brewer (name= "Species" , type= "qual" , palette= 2 ) +
theme_bw ()
Default theme_grey()
Replace by theme_bw()
(Black & White)
Many more themes available
ggplot2 - Worked Example
Override default aesthetic to change shape of points:
ggplot (penguins_ok,
aes (x= flipper_len, y= body_mass, color= species)) +
geom_point (shape= 15 ) +
xlab ("Flipper Length (mm)" ) +
ylab ("Body Mass (g)" ) +
ggtitle ("Body Mass vs Flipper Length for 333 Penguins" ) +
scale_color_brewer (name= "Species" , type= "qual" , palette= 2 ) +
theme_bw ()
Override default shape
aesthetic
Provide directly to geom_point
, not via aes
since not data dependent
See ?scale_shape_discrete
for table of values
ggplot2 - Worked Example
Add trend lines with stat_smooth
:
ggplot (penguins_ok,
aes (x= flipper_len, y= body_mass, color= species)) +
geom_point (shape= 15 ) +
stat_smooth (method= "lm" , se= FALSE ) +
xlab ("Flipper Length (mm)" ) + ylab ("Body Mass (g)" ) +
ggtitle ("Body Mass vs Flipper Length for 333 Penguins" ) +
scale_color_brewer (name= "Species" , type= "qual" , palette= 2 ) +
theme_bw ()
stat_
s implement transformations
stat_smooth
marks a trend
Specify use of OLS (lm
= linear model) + disable SE shading
ggplot2 - Worked Example
Break data into subplots (“facets”) to avoid over-plotting:
ggplot (penguins_ok,
aes (x= flipper_len, y= body_mass, color= species)) +
geom_point (shape= 15 ) + stat_smooth (method= "lm" , se= FALSE ) +
xlab ("Flipper Length (mm)" ) + ylab ("Body Mass (g)" ) +
ggtitle ("Body Mass vs Flipper Length for 333 Penguins" ) +
scale_color_brewer (name= "Species" , type= "qual" , palette= 2 ) +
theme_bw () + facet_wrap (~ species)
facet_wrap
(split by one grouping) or facet_grid
(show all pairs of groups)
group_by
of plotting
Called “small multiples”
ggplot2 - Worked Example
Remove redundant legend:
ggplot (penguins_ok,
aes (x= flipper_len, y= body_mass, color= species)) +
geom_point (shape= 15 ) + stat_smooth (method= "lm" , se= FALSE ) +
xlab ("Flipper Length (mm)" ) + ylab ("Body Mass (g)" ) +
ggtitle ("Body Mass vs Flipper Length for 333 Penguins" ) +
scale_color_brewer (name= "Species" , type= "qual" , palette= 2 ) +
theme_bw () + facet_wrap (~ species) +
guides (color= "none" )
guides
controls legends (also via scale_*
)
Here redundant with facet labels
ggplot2 Workflow
Take a sad plot and make it better
Start with exploratory graphics:
Quick and easy
Find the story you want to tell
Let the data drive you
More use of raw data
Iterate to publication-quality graphics:
Repeat to improve quality
Tell a story to your reader
More use of transformations
ggplot2 Workflow
Many ways to get the same result
Practice Exercise
ggplot2
provides diamonds
data of ~54K diamonds
Size, quality, price
4 C’s of Diamonds: Color, Cut, Clarity, Carat (Weight)
Return to breakout rooms for Practice Activity #01
ggplot2 Customization - Themes
Best practice:
Pick a starting theme you like and customize
theme_STARTER() + theme(things.i.change="to.new.values")
ggplot2 Customization - Themes
ggplot2
has 8 built-in themes. The ggthemes
and ThemePark
packages has many more!
Let’s define a basic theme and see how it is rendered in different themes:
my_plot <- ggplot (penguins_ok,
aes (x= flipper_len, y= body_mass, color= species)) +
geom_point (shape= 15 ) + stat_smooth (method= "lm" , se= FALSE ) +
xlab ("Flipper Length (mm)" ) + ylab ("Body Mass (g)" ) +
ggtitle ("Body Mass vs Flipper Length for 333 Penguins" ) +
facet_wrap (~ species) + guides (color= "none" )
Reminder: ggplot2
only displays plot when printed so assign to variable if you want to keep modifying
ggplot2 Customization - Themes
Default theme (theme_grey()
)
my_plot + theme_grey () # Default: can omit
ggplot2 Customization - Themes
Black-and-White theme (ggplot2::theme_bw()
) - MW’s favorite
ggplot2 Customization - Themes
Minimal theme (ggplot2::theme_minimal()
)
my_plot + theme_minimal ()
ggplot2 Customization - Themes
Light theme (ggplot2::theme_light()
)
ggplot2 Customization - Themes
Dark theme (ggplot2::theme_dark()
)
ggplot2 Customization - Themes
Old-school MS Excel theme (ggthemes::theme_excel()
)
library (ggthemes)
my_plot + theme_excel ()
ggplot2 Customization - Themes
Google Docs theme (ggthemes::theme_gdocs()
)
library (ggthemes)
my_plot + theme_gdocs ()
ggplot2 Customization - Themes
Economist theme (ggthemes::theme_economist()
)
library (ggthemes)
my_plot + theme_economist ()
ggplot2 Customization - Themes
Wall St Journal theme (ggthemes::theme_wsj()
)
library (ggthemes)
my_plot + theme_wsj ()
ggplot2 Customization - Themes
Edward Tufte theme (ggthemes::theme_tufte()
)
library (ggthemes)
my_plot + theme_tufte ()
ggplot2 Customization - Themes
Barbie theme (ThemePark::theme_barbie()
)
library (ThemePark)
my_plot + theme_barbie ()
ggplot2 Customization - Themes
Oppenheimer theme (ThemePark::theme_oppenheimer()
)
library (ThemePark)
my_plot + theme_oppenheimer ()
ggplot2 Customization - Themes
Avatar theme (ThemePark::theme_avatar()
)
library (ThemePark)
my_plot + theme_avatar ()
ggplot2 Customization - Themes
Spiderman theme (ThemePark::theme_spiderman()
)
library (ThemePark)
my_plot + theme_spiderman ()
ggplot2 Customization - Themes
Game of Thrones theme (ThemePark::theme_gameofthrones()
)
library (ThemePark)
my_plot + theme_gameofthrones ()
ggplot2 Customization - Themes
Avatar theme (ThemePark::theme_avatar()
)
library (ThemePark)
my_plot + theme_avatar ()
ggplot2 Customization - Themes
Most of these are silly
Adapt and extend!
Color Palettes
Three types of color palettes:
Sequential : ordered from 0 to “high”
Example: rain forecast in different areas
Diverging : ordered from -X to +Y with meaningful 0 in the middle
Example: political leaning
Qualitative : no ordering
Two ways to make a color scale for quantitative variables:
Binned: \([0, 1)\) light green, \([1, 3)\) medium green; \([3, 5]\) dark green
Continuous
Color Palettes
I often rely on the work of Cynthia Brewer
https://colorbrewer2.org
Officially for cartography, but generally useful
Different (punny) ggplot2
names for different derived scales
Color Palettes
scale_color_brewer()
for discrete scales
library (ggplot2)
ggplot (penguins_ok, aes (x= bill_len, y= bill_dep, color= species)) +
geom_point () + theme_bw () +
scale_color_brewer (type= "qual" ) # Qualitative
Color Palettes
scale_color_distiller()
for continuous scales
library (ggplot2)
ggplot (penguins_ok, aes (x= bill_len, y= bill_dep, color= body_mass)) +
geom_point () + theme_bw () +
scale_color_distiller (type= "seq" ) # Continuous
Color Palettes
scale_color_fermenter()
for binned scales
library (ggplot2)
ggplot (penguins_ok, aes (x= bill_len, y= bill_dep, color= body_mass)) +
geom_point () + theme_bw () +
scale_color_fermenter (type= "seq" ) # Binned
Color Palettes
library (ggplot2)
ggplot (penguins_ok, aes (x= bill_len, y= bill_dep, color= body_mass)) +
geom_point () + theme_bw () +
scale_color_fermenter (type= "seq" ) # Binned + Sequential
Color Palettes
library (ggplot2)
ggplot (penguins_ok, aes (x= bill_len, y= bill_dep, color= body_mass)) +
geom_point () + theme_bw () +
scale_color_fermenter (type= "qual" ) # Binned + Qualitative
Warning: Using a discrete colour palette in a binned scale
ℹ Consider using `type = "seq"` or `type = "div"` instead
Color Palettes
library (ggplot2)
ggplot (penguins_ok, aes (x= bill_len, y= bill_dep, color= body_mass)) +
geom_point () + theme_bw () +
scale_color_fermenter (type= "div" ) # Binned + Diverging
“Hard-coding” Colors
scale_color_identity
will take color names from a column:
penguins_ok |>
mutate (color_column = if_else (species == "Adelie" , "blue" , "red" )) |>
ggplot (aes (x= bill_len, y= bill_dep, color= color_column)) +
geom_point () + theme_bw () +
scale_color_identity (name= "Species" )
Practice Exercise
Return to the cdiac
data from the Warm-Up
Visualize trends in time series
geom_line
to create line plots
Return to breakout rooms for Practice Activity #02
Simpson’s Paradox
Data Visualization can be used to find counterintuitive trends in data:
ggplot (simpsons_paradox, aes (x= x, y= y)) +
geom_point () + stat_smooth (method= "lm" )
Simpson’s Paradox
Overall trend does not need to match trend within groups
ggplot (simpsons_paradox, aes (x= x, y= y, color= group)) +
geom_point () + stat_smooth (method= "lm" ) + facet_grid (~ group)
Modeling: ANCOVA or Mixed-Effects Regression
UCB Graduate Admissions
1973: UC Berkeley was concerned about bias in Grad School Admissions
Higher fraction of men admitted than women
Bickel, Hammell, O’Connell asked to study
When they try to find the source of this bias, there is none!
Each department admits women at a higher rate than men
Women applied to more selective programs at a higher rate
This phenomenon occurs throughout the social sciences: the best doctors have the worst patient outcomes
BHO note:
Women are shunted by their socialization and education toward fields of graduate study that are generally more crowded, less productive of completed degrees, and less well funded, and that frequently offer poorer professional employment prospects.
Red State Blue State
Facts about US politics (c. 2008):
Rich people vote Republican at higher rates
Rich states vote Republican at lower rates
Rich states are rich because they have richer people
How can we reconcile this?
ggplot2 vs Tableau
Tableau
$$$
IT department automatically integrates with data sources
Easy, if it does what you want
ggplot2
Free
Can use arbitrary data sources, with effort
Flexible / customizable
ggplot2 vs matplotlib
ggplot2
Data visualizations
Enforces “good practice” via gg
matplotlib
Scientific visualizations
More flexible for good or for ill
Inspired by Matlab
plotting
Closest Python analogue to ggplot2
is seaborn
Why use + instead of |>
ggplot2
is older than |>
Per H. Wickham: if ggplot3
ever gets made, will use |>
Unlikely to change: too much code depends on it
Overplotting
Large data sets can lead to overplotting :
Points “on top of” each other
Can also occur with “designed” experiments / rounded data
Ways to address:
Inside vs. Outside aes()
aes
maps data to values . Outside of aes
, set constant value
library (ggplot2)
ggplot (penguins_ok,
aes (x= bill_len, y= bill_dep, color= species)) +
geom_point ()
Inside vs. Outside aes()
aes
maps data to values . Outside of aes
, set constant value
library (ggplot2)
ggplot (penguins_ok,
aes (x= bill_len, y= bill_dep))+ geom_point (color= "blue" )
Global vs geom_ specific aes()
Elements set in ggplot()
apply to entire plot
Elements set in specific geom
apply there only
library (ggplot2)
ggplot (penguins_ok,
aes (x= bill_len, y= bill_dep, color= species))+
stat_smooth () + geom_point ()
Global vs geom_ specific aes()
Elements set in ggplot()
apply to entire plot
Elements set in specific geom
apply there only
library (ggplot2)
ggplot (penguins_ok,
aes (x= bill_len, y= bill_dep)) +
stat_smooth () + geom_point (aes (color= species))
How to Select geoms
Two “modes”
Exploratory data analysis: Quick, rapid iteration, for your eyes only
Let the data tell you a story
Low pre-processing: scatter plots, lines, histograms
“Publication quality”: Polished, for someone else
You tell the reader a story
More processing, more modeling: trends, line segments, ribbons
Order of Layers
Order of layers technically matters, but the effect is small
p1 <- ggplot (penguins_ok, aes (x= bill_len, y= flipper_len)) +
geom_point (color= "black" ) +
stat_smooth (color= "blue" , method= "lm" ) + ggtitle ("Line on points" )
p2 <- ggplot (penguins_ok, aes (x= bill_len, y= flipper_len)) +
stat_smooth (color= "blue" , method= "lm" ) +
geom_point (color= "black" ) + ggtitle ("Points on line" )
p1 + p2
Order of Layers
Order matters more with theme. Adding a theme_*()
will override any theme()
customization you did:
p1 <- p1 + theme_bw () + theme (legend.position= "bottom" )
p2 <- p2 + theme (legend.position= "bottom" ) + theme_bw ()
p1 + p2
stat_poly_line vs stat_smooth
By default stat_smooth
fits a generalized additive model (GAM)
ggpmisc::stat_poly_line
and stat_poly_eq
fit linear models, so they can expose more machinery.
What is a GAM?
Take 9890 with me (typically Spring semester) to find out!
Free Course: “GAMs in R” from Noam Ross
Titles and Captions
ggplot () +
labs (title= "Title" , subtitle= "Subtitle" , caption= "Caption" ,
tag= "Tag" , alt= "Alt-Text" , alt_insight= "Alt-Insight" )
+ggtitle("text")
is just shorthand for +labs(title="text")
Importance of Aesthetics
Perceptually:
Location > Color > Size > Shape
Humans are better at:
FAQ: When to Use Facets?
Facets are group_by
for plots. Useful for
Distinguishing intra- vs inter-group trends
Avoiding overplotting
Twin Axes Plots
How can I implement a dual (twin) axis plot in ggplot2
?
Disfavored. But if you must …
sec.axis
Doesn’t allow arbitrary secondary axes; allows transformed axes (e.g., Celsius and Fahrenheit)