Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 9

STA 9750 Week 9

Today:

  • Tuesday Section: 2025-11-04
  • Thursday Section: 2025-10-30

Lecture #08:Data Import and File System Usage; APIs

  • Communicating Results (quarto) ✅
  • R Basics ✅
  • Data Manipulation in R
  • Data Visualization in R
  • Getting Data into R ⬅️
    • Files and APIs ⬅️
    • Web Scraping
    • Cleaning and Processing Text
  • Statistical Modeling in R

Today

Today

  • Course Administration
  • Warm-Up Exercise
  • New Material
    • Files and the File System
    • HTTP and Web Access
    • API Usage
  • Wrap-Up
    • Life Tip of the Day

Course Administration

GTA

Charles Ramirez is our GTA

  • Wednesday Office Hours moved to 5:15-7:15 for greater access
    • Give a bit of flexibility on the front end for CR to get off work
  • Working on Meta-Review #01
    • I won’t assign peer-feedback #02 until these grades are returned

Mini-Project #03

MP#03 - Visualizing and Maintaining the Green Canopy of NYC

Due 2025-11-14 at 11:59pm ET

Topics covered:

  • Data Import
    • One static file (⬅️ Today)
    • One API call (⬅️ Today)
  • Spatial Data
    • Very basic spatial joins
    • Spatial visualizations (maps!)

Grading in Progress

We owe you:

  • MP#01 Meta-Review Grades
  • (Selected) MP#01 Regrades

Course Support

  • Synchronous
    • MW Office Hours 2x / week: Tuesdays + Thursdays 5pm
      • Rest of Semester except Thanksgiving (Nov 27th)
    • GTA Office Hours: Wednesdays at 6pm
  • Asynchronous: Piazza (\(<20\) minute average response time)

Future Mini-Projects

  • MP#04:
    • Deadline: 2025-12-05 at 11:59pm ET
    • Topic: BLS Monthly Employment Reports

Posted 2025-11-01

Course Project

Course Project should be your main focus for rest of course

  • But you still need to do mini-projects and pre-assignments(!)

Mid-Semester Check-Ins

Week #10 - Mid-Semester Check-In Presentations:

  • Overarching Question
  • Data Sources
    • Quality
    • Suitability
  • Specific Questions
  • Prior Art
  • Challenges

Data Sources

When using ‘found’ data, two important questions to ask:

  • Quality: Does the data do what it claims to?
    • Exhaustiveness, sampling error, sampling bias, missingness
  • Suitability: Does the data do what you need it to?
    • Right ‘unit of analysis’, construct alignment

Prior Art

Context and Novelty:

  • What else have people said on your topic?
    • What is missing?
  • What do you have to add to this conversation? (Novelty)
    • New data set, new way of measuring, new style of analysis

A research project is not just summarization of other work: how can you contribute something new?

Project Advice

General Advice:

  • Work on as small a scale as possible
  • Leave room to demonstrate your coding skill: if you can’t
    demonstrate the skills of this class, your SQ may be too small
  • Plan how to integrate your findings: if you find 5 factors are all correlated with response, how can you identify which ones are most important?

Project Advice

I’ll try to write up some informal advice on:

  • Estimating causal effects
  • Identifying key factors

Review Exercise

Flat / Plain Text Files

‘Plain text’ files:

  • Simple human readable and human writeable file formats
  • Not specific to one piece of software
  • Examples: CSV, txt, TSV
  • Anti-Examples: docx, pdf, jpg

Read into R with readr functions (e.g., read_csv)

Best Halloween Candy

From FiveThirtyEight

Best Halloween Candy

Data can be found at https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/candy-power-ranking/candy-data.csv

Read into R (readr::read_csv) and make 3 plots:

  • Do people prefer more sugary candies? (Think OLS)
  • Do people prefer more expensive candies? (Think OLS)
  • Do people prefer chocolate candies? (Think ANOVA)

Breakout Rooms

Breakout Teams
1 Data Miners (T) + Gridion Regression (R)
2 Master Splinter (T) + Restaurant Nightmares (R)
3 Inspector Clouseau (T) + Urban Health Insight Group (R)
4 Nightshift Analysts (T) + Standard Deviants (R)
5 The Mean, Green, Data-Analyzing Team (T) + Green Apple (R)
6 Cycle Paths (T) + Irish Mafia (R)
7 Point of Interest (T) + House Busters (R)
8 Weight Watchers (T) + Stats & The City (R)
9 Happy Hour (T) + Wellness Warriors (R)
10 Sounds Good (T)
11 How We Met Your Landlord (T)

Files and the File System

Files and the File System

The file system is the way your computer organizes and provides access to files:

  • Tree like-structure:
    • Files in folders in folders in folders …
      • Separated by /
      • e.g. STA9750-2025-FALL/data/mp01/data_file.csv
    • End point (or starting point) is the root:
      • Called / on Mac/Linux
      • Drive name on Windows (C:/)

Home Directory

Typically, all user files are stored in a “home directory”:

  • /Users/YOURNAME on Mac/Linux
  • C:/Users/YOURNAME on Windows
  • Subfolders include Downloads, Desktop, Documents, etc
  • Commonly abbreviated as ~
    • My desktop is ~/Desktop
    • My course material is in ~/STA9750-2025-FALL

Paths

Two ways to specify a file:

  • Absolute path:
    • Starts from root and gives “full name”
    • /Users/michaelweylandt/STA9750/docs/index.html
    • GPS coordinates
  • Relative path:
    • Starts from working directory (getwd()) and gives directions
    • If I am in STA9750, path is just docs/index.html
    • ./ means “this directory”: could also write ./docs/index.html
    • ../ means “up one level”
      • If I were in STA9750/docs, source at ../index.qmd
    • Driving directions

Using the File System in R

Use the fs package to interact with the file system:

  • dir_ls(), dir_create(), dir_exists(), dir_delete()
  • path(), path_home(), path_abs(), path_rel()
  • file_create(), file_exists(), file_delete(), file_info()

Activity #01

Return to breakout rooms to practice file system usage:

  • Convert relative paths to absolute paths
  • List files in a directory
  • Examine metadata the largest file(s) in your STA9750-2025-FALL directory

Accessing Data from the Web: URLs, HTTP, JSON

URLs

URLs are an extension of file paths for the internet:

  • Protocol / Scheme: How data should be transferred
    • HTTP(S), SMS (Texting), POP3/IMAP (Email)
  • Domain: Name of the other computer
    • In practice, often a ‘placeholder’ for something more complex
  • Path: Files on the other computer
    • In practice, hide functionalty behind file-like paths

URLs

From abstrax.io

Static Data Transfer

R’s basic download.file can be used for downloading simple files:

args(download.file)
function (url, destfile, method, quiet = FALSE, mode = "w", cacheOK = TRUE, 
    extra = getOption("download.file.extra"), headers = NULL, 
    ...) 
NULL

Basic file download capabilities:

  • url: source
  • destfile: where on your computer to store it

Customizable behavior, but defaults often work well:

  • method: what software to use in the background to download
  • mode: is this a text or binary file
  • cacheOK: are you ok with a cached version of the file
  • headers: do you need to send any additional info in your request

download.file()

download.file("https://raw.githubusercontent.com/michaelweylandt/STA9750/refs/heads/main/births.csv", 
              destfile="births.csv")

Note use of relative path here, so saves in current working directory

Be polite:

  • Try to avoid unnecessarily downloading files
  • Save file and only download if !file_exists(destfile)

Data Transfer - download.file()

if(!file_exists("births.csv")){
    download.file("https://raw.githubusercontent.com/michaelweylandt/STA9750/refs/heads/main/births.csv", 
                  destfile="births.csv")
}

read_csv("births.csv")

JSON

JSON:

  • Short for JavaScript Object Notation
  • Popular plain-text representation for hierarchical data.
  • Closer to Python objects (dicts of dicts of dicts) than R data.frames
  • Widely used for web-based data transfer

JSON

Example:

{
    "data": {
        "id": 27992,
        "title": "A Sunday on La Grande Jatte — 1884",
        "image_id": "1adf2696-8489-499b-cad2-821d7fde4b33"
    },
    "config": {
        "iiif_url": "https://www.artic.edu/iiif/2",
    }
}

JSON

Read JSON in R with jsonlite package (alternatives exist)

library(jsonlite)
# A JSON array of primitives
json <- '["Mario", "Peach", null, "Bowser"]'

# Simplifies into an atomic vector
fromJSON(json)
[1] "Mario"  "Peach"  NA       "Bowser"

JSON

json <-
'[
  {"Name" : "Mario", "Age" : 32, "Occupation" : "Plumber"}, 
  {"Name" : "Peach", "Age" : 21, "Occupation" : "Princess"},
  {},
  {"Name" : "Bowser", "Occupation" : "Koopa"}
]'
mydf <- fromJSON(json)
mydf
    Name Age Occupation
1  Mario  32    Plumber
2  Peach  21   Princess
3   <NA>  NA       <NA>
4 Bowser  NA      Koopa

JSON - A Web Standard

fromJSON("https://official-joke-api.appspot.com/random_joke")
$type
[1] "general"

$setup
[1] "Why did the mushroom get invited to the party?"

$punchline
[1] "Because he was a fungi."

$id
[1] 35

Compare to browser access

Data Transfer: HTTP

HTTP

  • HyperText Transfer Protocol
  • Most common (but not only) internet protocol
  • Also ftp, smtp, ssh, …

“Low-level” mechanism of internet transfer

  • Many R packages add a friendly UX
  • httr2 for low-level work (today)

HTTP

HTTP has two stages:

  • Request
    • URL (Host + Path)
    • Method (VERB)
    • Headers
    • Content
    • Cookies
  • Response
    • Status Code
    • Headers
    • Content

Modern (easy) APIs put most of the behavior in the URL

HTTP in the Browser

In Firefox: Right-Click + Inspect

In Chrome: Right-Click + Developer Tools

HTTP with httr2

httr2 (pronounced “hitter-2”) is low-level manipulation of HTTP.

library(httr2)
request(example_url())
<httr2_request>
GET http://127.0.0.1:49992/
Body: empty

Pretty simple so far:

  • example_url() starts a tiny local web host
  • 127.0.0.1 is localhost

httr2 Requests

Build a request:

  • request
  • req_method
  • req_body_*
  • req_cookies_set
  • req_auth_basic / req_oauth

httr2 Requests

Behaviors:

  • req_cache
  • req_timeout

Execution:

  • req_perform

httr2 Responses

Request status

  • resp_status / resp_status_desc

Content:

  • resp_header*
  • resp_body_*

Live Demo

Demo: Using httr2 to get a random joke from

https://official-joke-api.appspot.com/

API Usage

Exercise - CRAN Logs API

See Lab #09

Wrap-Up

Review

Web Data Access

  • Local Files
  • Static Files & download.file
  • HTTP and JSON
  • API Usage

Upcoming Work

Upcoming work from course calendar

Topics for after presentations:

  • Reading and parsing HTML
  • Parsing messy (text) data

Musical Treat