Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 8

STA 9750 Week 8

Today:

  • Tuesday Section: 2025-10-28
  • Thursday Section: 2025-10-23
  • Communicating Results (quarto) ✅
  • R Basics ✅
  • Data Manipulation in R
  • Data Visualization in R ⬅️
    • Static Plots ✅
    • Interactivity, Maps, Animated Plots ⬅️
  • Getting Data into R
  • Statistical Modeling in R

Today

Today

  • Course Administration
  • Warm-Up Exercises
  • Advanced ggplot2
  • PA#08 FAQs
  • Wrap-Up
    • Life Tip of the Day

Course Administration

GTA

Charles Ramirez is our GTA

  • Wednesday Office Hours at 6pm
  • Working on Meta-Review

Mini-Project #02

MP#02 - Making Backyards Affordable for All

Due 2025-10-31 at 11:59pm ET

  • GitHub post (used for peer feedback) AND Brightspace
  • Start early to avoid Git issues

Pay attention to the rubric

  • Writing and presentation are about 50% of your grade
  • Evaluated on rigor and thoughtfulness
  • Use what you learned from MP#01

Rubric updated to clarify opportunities for and limits to extra credit

Mini-Project #03

MP#03 now posted. Due 2025-11-14 at 11:59pm ET

Visualizing and Maintaining the Green Canopy of NYC

Topics covered:

  • Data Import
    • One static file
    • One API call
  • Spatial Data
    • Very basic spatial joins
    • Spatial visualizations (maps!)

Comapre to NYC Tree Map

Grading in Progress

I owe you:

  • Project Proposal Feedback
  • MP#01 Meta-Review Grades
  • (Selected) MP#01 Regrades

Pre-Assignments

Brightspace:

  • Reading, typically on course website
  • Brightspace auto-grades for completion

Next PA is due 2025-11-03 at 11:59pm ET:

  • Quick: what are files? How do we read them into R?

Course Support

  • Synchronous
    • MW Office Hours 2x / week: Tuesdays + Thursdays 5pm
      • Rest of Semester except Thanksgiving (Nov 27th)
    • GTA Office Hours: Wednesdays at 6pm
  • Asynchronous: Piazza (\(<20\) minute average response time)

Continual Improvement

I maintain a TODO file with ideas of things I want to improve for next cohort.

Suggestions welcome.

Every semester, I create new mini-projects. Ideas and suggestions very welcome

  • Topics and data sets are both great

Future Mini-Projects

  • MP#04:
    • Deadline: 2025-12-05 at 11:59pm ET
    • Topic: BLS Monthly Employment Reports

Course Project

Course Project should be your main focus for rest of course

  • But you still need to do mini-projects and pre-assignments(!)

Proud of You!

Proud of You

A personal note, if you allow me:

I’m teaching an alternate version of STA 9750 this semester (different degree program, different curricular goals). The quality of your analyses and presentation(even on MP#01) is night and day. I hope you are proud of the work you are doing - I know I am.

Your effort is not unnoticed - I know this class starts “pedal-to-the-metal” but hopefully you’ve seen just how powerful these tools R.

More than that - I appreciate your good attitude and willingness to share your frustrations and triumphs. Reading comments, even if I don’t respond, is uplifting.

Review Exercise

SSA Registered Births

Time series data of babies born and registered with the Social Security Administration (previously in Lab #05)

Rows: 7305 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (7): year, month, day, births, day_of_year, day_of_week, id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 7,305
Columns: 7
$ year        <dbl> 1969, 1969, 1969, 1969, 1969, 1969, 1969, 1969, 1969, 1969…
$ month       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ births      <dbl> 8486, 9002, 9542, 8960, 8390, 9560, 9738, 9734, 9434, 1004…
$ day_of_year <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ day_of_week <dbl> 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1…
$ id          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…

SSA Registered Births

Practice visualization:

  1. Are there months with more births?
  2. Are there days of the week with more births?
  3. Is there a long-term trend in births?

See Lab #08 for details.

Breakout Rooms

Breakout Teams
1 Data Miners (T) + Standard Deviants (R)
2 Inspector Clouseau (T) + Green Apple (R)
3 The Mean, Green, Data-Analyzing Team (T) + Gridion Regression (R)
4 Master Splinter (T) + House Busters (R)
5 Nightshift Analysts (T) + Wellness Warriors (R)
6 Weight Watchers (T) + Irish Mafia (R)
7 Sounds Good (T) + Urban Health Insight Group (R)
8 Point of Interest (T) + Stats & The City (R)
9 Cycle Paths (T) + Restaurant Nightmares (R)
10 Happy Hour (T)
11 How We Met Your Landlord (T)

Advanced ggplot2

Spatial Data

Maps are more interesting than you think!

  • The world isn’t flat!

This can get intense, but we will get by with simple features (sf)

Spatial Data in Lab #08

WGS 84

WGS 84 is a robust and widely-used way of creating maps from 3D coordinates

Not universal!

Interactive Tooling

Animated Graphics

gganimate is the de facto standard for animated ggplot2:

  • Make a bunch of png files
  • Combine into a gif

Most commonly: +transition_time(VARIABLE)

See Getting Started for more

Animated Graphics

When it works, gganimate is great

  • PITA when external software is busted

Alternatives:

  • Facet plots
  • Split facets over pages and scroll quickly: ggforce::facet_wrap_paginate
  • Interactivity with autoplay

PA#08 FAQs

ggplot2 & Pie Charts

Why do Pie Charts have a bad reputation?

  • Use of area and angle over length: less accurate perception
  • Depends on fill to convey category - limited categories

But honestly - “insider smugness” and hate of Excel

ggplot2 Plot Type Choice

For me:

  • Exploratory mode:
    • Simple: line, scatter, bar, frequency
  • Publication mode:
    • Very context specific

ggplot2 Font Sizing

Theme machinery!

ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) + 
    geom_point() + theme(axis.text = element_text(size=24))

Overplotting / ScatterBlobs

Student asked about “scatterblobs” - typo(?) but I love it!

  • Density based plotting: hexbins, histograms, rugplots
  • Data reduction: summarization or sub-sampling

Optimizing ggplot2 Performance

Active project of ggplot2 team - not much you can do

Practical advice: plot less (see previous slide)

ggplot2 Beyond Scatter and Line

Some favorite semi-advanced plot types:

  • Violin plots: combination of boxplot and histogram
  • Ridgelines
  • Beeswarms

Deep rabbit hole

ggplot2 + High-Dimensional Data

High-dimensional data: measure many variables per observation (“wide”)

High-dimensional data is hard to visualize

Approaches:

  • Pair plots for “moderate” HDD
  • PCA (or similar dimension reduction. Take 9890!)

Custom ggplot2 Theme

my_theme <- theme_bw() + theme(panel.background = element_rect(fill = 'lightblue'))
ggplot(iris, aes(x=Sepal.Length, y = Sepal.Width)) + geom_point() + my_theme

Advanced:

  • theme_set() - change ggplot2 defaults
  • .Rprofile - set code to run every time you start R

ggplot2 - When Not to Use

ggplot2 is designed to make good statistical graphics. Sub-par for:

  • Advanced interactivity
  • Really big data
  • Hardcore customization / “infographics”

git WTF

Reference: Happy Git with R

Wrap-Up

Review

Advanced ggplot2:

  • ggplot2 as a platform for powerful extensions
  • Spatial data: sf
  • Interactivity: ggplotly, shiny
  • Animation: gganimate

Upcoming Work

Upcoming work from course calendar

Topics for the next three weeks:

  • Reading ‘clean data’ into R
  • Reading and parsing HTML
  • Parsing messy (text) data

Life Tip of the Week

Date Formatting

Date Formatting

Write your dates as:

YYYY-MM-DD

e.g., Sys.Date()

YYYY-MM-DDTHH:MM:SS for date + time

Default in analytics-world:

  • Unambiguous (DD-MM vs MM-DD)
  • Alphabetical: sort by name => sort by date!

Musical Treat