STA 9750 Week 7 In-Class Activity: More Thoughts on Plots

Slides

Review Practice

The Carbon Dioxide Information and Analysis Center studies the effect of carbon dioxide on global and local temperature trends. A key tool in their analysis is the temperature “anomaly”. An anomaly is the difference between observed temperature (in a world with anthropogenic atmospheric CO2) and ‘natural’ temperature (from a world without anthropogenic gases). Note that these anomalies require significant analysis to compute and are not “simple observational” data.

Politicians have adopted the tools of temperature anomaly to set national and international emissions targets, e.g., the 2 Degree Target. Note that 2 degrees is calculated as a global average: in practice, some regions will experience a much larger change in temperature, while others may experience a smaller change or even a negative change.

The CVXR package includes the cdiac dataset, capturing CDIAC’s estimated global temperature anomalies from 1850 to 2015. In this question, you will explore these estimated anomalies. Note that you may want to install the CVXR package before attempting this question on your personal computer.[^1] Because CVXR does not work reliably through this online interface, I have also posted a copy of this data on the course website.

As you can see, this data has 14 columns, with 12 monthly anomalies and 1 annual average anomaly, for a period covering 166 years from 1850 to 2015. Use this data to answer the following questions:

  1. In what year was the highest annual anomaly observed? The lowest?

We can select the row with the highest and lowest values using the slice_max and slice_min functions from the dplyr package. If we want to specifically get the year column after selecting a single row, we can use the pull function to pull out that one column into a vector:

  1. For how many months was 2015 the highest anomaly recorded?

    Hint: It will be helpful to pivot_longer this data into a longer format. Note that you can use the : operator when specifying a set of columns to pivot

    library(tidyverse)
    library(CVXR)
    data(cdiac)
    cdiac |> 
      select(-annual) |> 
      pivot_longer(cols=jan:dec, names_to="month", values_to="anomaly")
  1. For how many years did July have the largest anomaly of that year?

    Hint: You should first pivot the data as in the previous question.

  1. What is the probability that a given July exhibits a greater (more positive) anomaly than the annual average anomaly for that year?

    Hint: Remember that probabilities can be estimated by constructing a Boolean (TRUE/FALSE) vector and taking its mean.

  1. Which month has the highest probability of exceeding the annual anomaly in a given year?

Plotting with ggplot2

This week, we’re going to break into project groups and do three ggplot2 exercises of increasing difficulty. As you work through these with your teammates, be sure to reflect on what plots and what tools you will need to best present your mini-project and course project findings.

Exercise 1: Basic ggplot2

In this exercise, you will create ggplot2 graphics to analyze the diamonds data from the ggplot2 package. This data contains pricing and measurements for over 50,000 diamonds sold in the US. (Note that these prices are rather out of date and pre-date the popularity of lab-grown diamonds.) Before beginning this exercise, you might want to read about the “4 C’s of Diamonds” commonly used to measure quality.

  1. Make a scatter plot of price vs carat.
  1. Use stat_smooth to see how the price-carat relationship changes by color.
  1. Take your scatter plot from the previous question and iterate it to “publication quality”.

If you want to require your trend-line to be monotone increasing, see this StackOverflow answer.

  1. Further refine your plot by adding a two-way small-multiples design for color and cut.
  1. Create a frequency polygon plot of price, broken out by different diamond cuts.

    Hint: Before trying this new geom_, make sure to read the documentation and identify required aesthetics.

  1. Create a scatter plot of color by clarity. Why is this plot not useful?

This plot is not useful because both the \(x\)- and \(y\)-axis are discrete (categorical) values, resulting in significant overplotting. Alternative visualizations should be used when considering data of this type.

Exercise 2: Trend Analysis with ggplot2

Let’s return to the cdiac data discussed above. In this set of exercises, you will use geom_line to create line plots visualizing trends in this data.

  1. Use geom_line to plot the estimated annual global mean temperature (GMT) anomaly from 1850 to 2015.
  1. Plot the GMT anomaly for each month on the same plot as different lines.

    Hint: You will need to use a pivot_longer to get the data into the appropriate shape. ggplot2 expects one “data point” per row.

This solution has a problem, however: the months appear in alphabetical order rather than their actual order. To address this, we need to transform month into an factor type with the levels in a specific order. This is a hair beyond the scope of this course, but I’ll show you how to do so using the forcats package (tidyverse tools for factors) here:

When dealing with this sort of ‘spaghetti’ (overlapping lines) plot, one of my favorite strategies is to plot some sort of average in bolder line and to plot the individual entries that go into that average semi-transparently. For instance:

This lets us see both the overall trend and use the month-to-month variability as a proxy for sampling variability.

  1. Plot the monthly GMT anomaly series as one long line (with a point for each month).

    Hint: Use the ym function from the lubridate package to combine the year and month into a single Date object. Something like time=lubridate::ym(paste(year, month)) should suffice.

  1. Now let’s focus on July: use the runmed() function to add a second series to the data frame containing the running median July GMT anomaly over the previous 5 years. Plot this and the raw July data on the same plot. Is there evidence of an increasing warming trend?

Since the overlay of the individual years over the trend is a bit noisy, we can adjust the alpha aesthetic (transparency) to make the individual years a bit less visible.

  1. For each year, identify the warmest month (as measured by GMT anomaly). Create a bar char showing the probability a given month was the hottest (greatest anomaly) in its year.

    Hint: Make sure your \(x\)-axis is in reasonable (chronological) order, not alphabetical. The following code snippet may help:

    mutate(month = factor(month, 
                           levels=str_to_lower(month.abb), 
                           labels=month.name,
                           ordered=TRUE))

    This takes a column called month and marks it as an ordered categorical variable (factor), with order given by str_to_lower(month.abb) (month.abb is a built-in variable in R, but it is capitalized, unlike the cdiac variables) and names to be printed from month.name.