STA/OPR 9750 Week 7 Pre Assignment: Lots of Plots

Due Date: 2024-10-16 (Wednesday) at 11:45pm

Submission: CUNY Brightspace

This week we begin to explore statistical visualizations. Visualizations play several interrelated roles in statistical practice: we use visualization to explore new data sets, to see how well models fit to data, and to communicate the results of analyses to new audiences. Compared with the ‘point summary’ tools we have discussed to date, visualizations are far more flexible and more powerful: we can extract novel insights from data visualizations, but we can also deceive ourselves and others.

As you review this document, also watch how I iterate and refine each figure until I have something I’m finally happy with. This is quite typical of how working data scientists produce plots: you rarely know exactly what you want, particularly before you begin to explore your data. You should adapt a similar pattern of “take a sad plot and make it better” as you create plots for your mini-projects and, ultimately, your final project.

Grammar of Graphics

Computer graphics is a field of essentially infinite possibility - anything you dream can be represented in the digital domain. At one limit, we have the option to work on a “per pixel” basis, telling the computer exactly what to draw and where. Because this is clearly overwhelmingly monotonous, very little work is actually done at such fine detail and abstraction layers are provided to automate the low-level detail work. We will use an abstraction model known as the Grammar of Graphics.

Designed by Leland Wilkinson and popularized by Hadley Wickham, the Grammar of Graphics poses a set of rules for visualizing (tidy) data. It draws its name from the concept of linguistic grammar - the rules that dictate how basic elements (nouns, verbs, and adjectives) may and may not be combined into clear and meaningful sentences The Grammar of Graphics provides specifications for combining different plot elements (legends, data, axes, etc.) into clear and meaningful statistical graphics. Taking the metaphor too far, we might say that the Grammar of Graphics is the Strunk and White to the Ska that are “Infographics.”

The grammar of graphics has several interconnected components, which are combined to form a meaningful graphic. We assume that we have a “tidy” data set we want to visualize; recall that, by “tidy”, we mean that our data is

  • Organized in a rectangular array
  • Homogenously-typed within a column
  • One observation per row
  • One value per cell

Given this form of tidy data, the grammar of graphics provides us the “parts of speach” necessary to convert tidy data to a visual representation. We’ll only cover the basic components here, leaving more advanced tools for class session:

  • The aesthetics are mappings between columns of the data and aspects of the visualization. For instance, “put the grade column on the \(y\) axis” or “use color to represent the course ID”.
  • Scales convert data values to the aesthetics: scales may be quite trivial, e.g. placing continuous values on the \(x\) axis in proper order, or more advanced, e.g., binning values and converting them to a sequence of perceptually-ordered colors. )
  • Geometric elements or geoms specify how the data are represented on the page through the scales. geoms include basic representations, like points for a scatter plot or lines for a trend plot, as well as more complex objects like boundaries on a map.
  • Guides provide interpretational assistance to the viewer. Most guides take the form of legends.

That’s all a bit abstract, so let’s put it into practice. For now, you shouldn’t worry so much about what each of these really mean; it’s just useful to have a rough sense of what “knob” you want to turn to modify plots.

Getting started with ggplot2

The leading implementation of the grammar of graphics is the ggplot2 package in R (gg = Grammar of Graphics). It comes to us from Hadley Wickham and the tidyverse team, who also developed the dplyr and tidyr tools we have been using for the past several weeks.

Let’s begin by using ggplot2 to explore our penguins data:

library(ggplot2)
library(palmerpenguins)
ggplot(penguins)

Somewhat underwhelming…

ggplot2, like many of the tools in this course, do exactly what we ask, and no more. Because we have not specified any of the Grammar of Graphics elements, we only get a blank canvas. Let’s now begin by adding an aesthetic to map some of the elements of our data to aspects of our plot.

Specifically, suppose we want to see how flipper length correlates with body mass. Let’s make a scatter plot with flipper length on the \(x\)-axis and body mass on the \(y\)-axis.1

ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g))

Ok - this is perhaps a bit better. We can see that the columns flipper_lenth_mm and body_mass_g have been placed on the \(x\)- and \(y\)-axes as we wanted, but we still don’t see anything.

We need a geom to actually put “ink to paper”. The simplest geom is a point, useful for making scatter plots.

ggplot(penguins, 
       aes(x=flipper_length_mm, y=body_mass_g)) + 
    geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Pretty nifty! Before we go forward, note that ggplot2 adds elements to create ever more complex plots. This is different than dplyr where we “piped” data from one step to the next, refining it along the way.

How can we improve the plot above? Before anything else, let’s clean up the \(x\) and \(y\) axis labels. While the default behavior of showing variable names is helpful for exploratory data analysis, we never want to let variable names “leak” in plots we intend to share with others. We should instead use meaningful (and attractive) axis labels.

ggplot(penguins, 
       aes(x=flipper_length_mm, y=body_mass_g)) + 
    geom_point() + 
    xlab("Flipper Length (mm)") + 
    ylab("Body Mass (g)")
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

We can also take advantage of the theme mechanisms of ggplot2 to change the color of the “infrastructure” of our plot. The theme mechanism doesn’t change how the data itself is visualized, but it controls how things like the background, font sizing, etc behave. I tend to prefer the black and white theme over the grey default:

ggplot(penguins, 
       aes(x=flipper_length_mm, y=body_mass_g)) + 
    geom_point() + 
    xlab("Flipper Length (mm)") + 
    ylab("Body Mass (g)") + 
    theme_bw()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Adding Color to Depict Species

Our data set three distinct species and the correlation between flipper and body size may vary across species. Let’s add some color to our plot: since color maps a data element (species) to a graphical aspect (color) we add it to our aesthetic mapping (aes).

ggplot(penguins, 
       aes(x=flipper_length_mm, 
           y=body_mass_g, 
           color=species)) + 
    geom_point() + 
    xlab("Flipper Length (mm)") + 
    ylab("Body Mass (g)") + 
    theme_bw()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

We see here that a few things happened automatically for us:

  1. The color element was automatically propagated into geom_point. By default, any “top-level” aesthetics are automatically applied to any geom that can handle them.
  2. A legend was created.
  3. A color scale was chosen.

The geom_point help page tells us which aesthetics are required (\(x, y\)) and which are optional for the point geom. We couldn’t have gotten away without providing \(x, y\) coordinates, but until this point, we were just using the default (black) color.

To improve the look of the colors, we can choose a different color scale. I tend to like the colors of the Color Brewer project, though strictly speaking these are designed for use in maps, not scatter plots. You can access these in ggplot2 as follows:

ggplot(penguins, 
       aes(x=flipper_length_mm, 
           y=body_mass_g, 
           color=species)) + 
    geom_point() + 
    xlab("Flipper Length (mm)") + 
    ylab("Body Mass (g)") + 
    theme_bw() + 
    scale_color_brewer(type="qual", palette=2)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Here, I’m using a type="qual" (qualitative) palette because there is no inherent ordering to the penguin types. I like the “bolder” colors of the second palette in this set, but you can adjust the number to try different schemes.

Next, let’s improve the look of the legend. As before, we see that it is by default titled with the variable name (species). We can provide a proper title instead:

ggplot(penguins, 
       aes(x=flipper_length_mm, 
           y=body_mass_g, 
           color=species)) + 
    geom_point() + 
    xlab("Flipper Length (mm)") + 
    ylab("Body Mass (g)") + 
    theme_bw() + 
    scale_color_brewer(type="qual", 
                       palette=2, 
                       name="Species")
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

I still don’t like this legend on the side - it takes too much room, so let’s move it below the image instead. This involves changing a “non-data” element of the plot, so we go through the theme machinery. theme() allows an enormous number of possible changes, but here we want legend.position:

ggplot(penguins, 
       aes(x=flipper_length_mm, 
           y=body_mass_g, 
           color=species)) + 
    geom_point() + 
    xlab("Flipper Length (mm)") + 
    ylab("Body Mass (g)") + 
    theme_bw() + 
    scale_color_brewer(type="qual", 
                       palette=2, 
                       name="Species") + 
    theme(legend.position="bottom")
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Not so bad!

Visualizing Statistical Relationships

Recall that our goal was to measure the correlation between Body Mass and Flipper Length. We can visualize this correlation on the plot by adding a regression line (recall that for univariate regression like this, the slope of the regression line is \(\hat{\beta} = \rho_{XY}\frac{\sigma_Y}{\sigma_X}\)).

This is a new geometric element, called a smoother. ggplot2 allows many possible smoothers, but let’s use the lm (linear model) version, which we specify by setting the method argument:

ggplot(penguins, 
       aes(x=flipper_length_mm, 
           y=body_mass_g, 
           color=species)) + 
    geom_point() + 
    geom_smooth(method="lm") + 
    xlab("Flipper Length (mm)") + 
    ylab("Body Mass (g)") + 
    theme_bw() + 
    scale_color_brewer(type="qual", 
                       palette=2, 
                       name="Species") + 
    theme(legend.position="bottom")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

By default, ggplot2 is giving us confidence intervals around the linear trend. These are sometimes useful, but perhaps a bit crowded for now, so let’s turn them off:

ggplot(penguins, 
       aes(x=flipper_length_mm, 
           y=body_mass_g, 
           color=species)) + 
    geom_point() + 
    geom_smooth(method="lm", 
                se=FALSE) + 
    xlab("Flipper Length (mm)") + 
    ylab("Body Mass (g)") + 
    theme_bw() + 
    scale_color_brewer(type="qual", 
                       palette=2, 
                       name="Species") + 
    theme(legend.position="bottom")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

This is nicer, but maybe still a bit crowded. It would be nicer if we could avoid the “overlaps” of the different species. Here, let’s break out a “small multiples” plot: this is, in essence, a group_by for plotting.

In ggplot2 speak, this is called a faceted plot:

ggplot(penguins, 
       aes(x=flipper_length_mm, 
           y=body_mass_g, 
           color=species)) + 
    geom_point() + 
    geom_smooth(method="lm", 
                se=FALSE) + 
    xlab("Flipper Length (mm)") + 
    ylab("Body Mass (g)") + 
    theme_bw() + 
    scale_color_brewer(type="qual", 
                       palette=2, 
                       name="Species") + 
    theme(legend.position="bottom") + 
    facet_wrap(~species)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Not too shabby. But now it’s a bit too hard to tell the lines from the points. Let’s override the color used:

ggplot(penguins, 
       aes(x=flipper_length_mm, 
           y=body_mass_g, 
           color=species)) + 
    geom_point() + 
    geom_smooth(method="lm", 
                se=FALSE, 
                color="black") + 
    xlab("Flipper Length (mm)") + 
    ylab("Body Mass (g)") + 
    theme_bw() + 
    scale_color_brewer(type="qual", 
                       palette=2, 
                       name="Species") + 
    theme(legend.position="bottom") + 
    facet_wrap(~species)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

These plots give us some idea of the correlation, but what if we want actual numbers? We can’t do this in “plain” ggplot2, but now we can take advantage of the enormous number of ggplot2 extension packages. It turns out that the ggpmisc package supports what we need, so let’s download and install it:

if(!require("ggpmisc")) install.packages("ggpmisc")
Loading required package: ggpmisc
Loading required package: ggpp
Registered S3 methods overwritten by 'ggpp':
  method                  from   
  heightDetails.titleGrob ggplot2
  widthDetails.titleGrob  ggplot2

Attaching package: 'ggpp'
The following object is masked from 'package:ggplot2':

    annotate
library(ggpmisc)

Now we have access to the various geom_, scale_, etc objects from that package. We can now introduce a new category, stat_, that represents statistical transformations or modeling. Generally, these are applied “automagically” for us, as in geom_smooth, but here we need to build our regression models explicitly:

ggplot(penguins, 
       aes(x=flipper_length_mm, 
           y=body_mass_g, 
           color=species)) + 
    geom_point() + 
    stat_poly_line(se=FALSE, 
                   color="black") +
    stat_poly_eq() + 
    xlab("Flipper Length (mm)") + 
    ylab("Body Mass (g)") + 
    theme_bw() + 
    scale_color_brewer(type="qual", 
                       palette=2, 
                       name="Species") + 
    theme(legend.position="bottom") + 
    facet_wrap(~species)
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_poly_line()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_poly_eq()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

See how, even though we’re using functionality from outside ggplot2, the structure of the “grammar” makes it easy for all these tools to work well together.

Final Polish

We are almost done, but every figure needs a bit of final polish. Firstly, we should add a title, using the ggtitle function. (R has a built-in title function but that won’t help us here)

ggplot(penguins, 
       aes(x=flipper_length_mm, 
           y=body_mass_g, 
           color=species)) + 
    geom_point() + 
    stat_poly_line(se=FALSE, 
                   color="black") +
    stat_poly_eq() + 
    xlab("Flipper Length (mm)") + 
    ylab("Body Mass (g)") + 
    theme_bw() + 
    scale_color_brewer(type="qual", 
                       palette=2, 
                       name="Species") + 
    theme(legend.position="bottom") + 
    facet_wrap(~species) + 
    ggtitle("Correlation of Flipper Length and Body Mass across Penguin Species")
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_poly_line()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_poly_eq()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

In good academic practice, we should always add a footnote citing the source of our data. The palmerpenguins site has appropriate source information:

ggplot(penguins, 
       aes(x=flipper_length_mm, 
           y=body_mass_g, 
           color=species)) + 
    geom_point() + 
    stat_poly_line(se=FALSE, 
                   color="black") +
    stat_poly_eq() + 
    xlab("Flipper Length (mm)") + 
    ylab("Body Mass (g)") + 
    theme_bw() + 
    scale_color_brewer(type="qual", 
                       palette=2, 
                       name="Species") + 
    theme(legend.position="bottom") + 
    facet_wrap(~species) + 
    ggtitle("Correlation of Flipper Length and Body Mass across Penguin Species") + 
    labs(caption="Data provided by Dr. K. Gorman and the Palmer Station, Antarctica LTER")
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_poly_line()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_poly_eq()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

At this point, the bottom of our figure looks a bit crowded. To clear out some space, let’s remove the legend from the bottom, since it simply repeats the facet labels:

ggplot(penguins, 
       aes(x=flipper_length_mm, 
           y=body_mass_g, 
           color=species)) + 
    geom_point() + 
    stat_poly_line(se=FALSE, 
                   color="black") +
    stat_poly_eq() + 
    xlab("Flipper Length (mm)") + 
    ylab("Body Mass (g)") + 
    theme_bw() + 
    scale_color_brewer(type="qual", 
                       palette=2, 
                       name="Species") + 
    theme(legend.position="bottom") + 
    facet_wrap(~species) + 
    ggtitle("Correlation of Flipper Length and Body Mass across Penguin Species") + 
    labs(caption="Data provided by Dr. K. Gorman and the Palmer Station, Antarctica LTER") + 
    guides(color="none")
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_poly_line()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_poly_eq()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Cleaner and no loss of information. In designing good scientific graphics, the concept of a “ink-to-information” ratio is useful: if you can remove some ink without removing any (relevant) information, you should generally do so. This makes it easier for the reader to identify the important elements of the plot.

To make our point even clearer, it is sometimes useful to add a short “summary” to a plot like this:

ggplot(penguins, 
       aes(x=flipper_length_mm, 
           y=body_mass_g, 
           color=species)) + 
    geom_point() + 
    stat_poly_line(se=FALSE, 
                   color="black") +
    stat_poly_eq() + 
    xlab("Flipper Length (mm)") + 
    ylab("Body Mass (g)") + 
    theme_bw() + 
    scale_color_brewer(type="qual", 
                       palette=2, 
                       name="Species") + 
    theme(legend.position="bottom") + 
    facet_wrap(~species) + 
    ggtitle("Correlation of Flipper Length and Body Mass across Penguin Species", 
            subtitle="Flipper Length and Body Mass are positively correlated across species\nGentoo penguins exhibit the strongest relationship at 70% correlation") + 
    labs(caption="Data provided by Dr. K. Gorman and the Palmer Station, Antarctica LTER")
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_poly_line()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_poly_eq()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

This is not always a good idea - it requires “hard-coding” the insights for your reader and takes up some space. In scientific writing, I generally prefer to put this sort of summary in a figure caption, while I tend to “say the point” verbally if the figure is destined for a presentation.

Personally, I only use this sort of “here is the point” text if I expect a figure to “escape beyond” my presentation and need it to stand fully on its own.

Conclusions

ggplot2 provides an exceptionally powerful and flexible set of tools for creating statistical visualizations. We will explore it in more depth in class. For now, review the examples above and make sure you see how each plot is created “piecewise” from its various components.

To see more about what ggplot2 can do, check out the R Graphics Gallery. If you want to see the specifics of each ggplot2 function, check out the package reference page. To go further with ggplot2, you should also explore its extension gallery.

Footnotes

  1. I once had a boss who refused to use the terms \(x\) and \(y\) axis unless quantities called \(x\) and \(y\) were actually being plotted. Being Cantabrigian, he insisted I use the terms abscissa and ordinate. I will not inflict such pedantry in this course.↩︎