Software Tools for Data Analysis
STA 9750
Michael Weylandt
Week 11

STA 9750 Week 11

Today:

  • Tuesday Section: 2025-11-18
  • Thursday Section: 2025-11-13

Lecture #09: Introduction to Web Technologies & Web Scraping

  • Communicating Results (quarto) ✅
  • R Basics ✅
  • Data Manipulation in R
  • Data Visualization in R
  • Getting Data into R ⬅️
    • Files and APIs ✅
    • Web Scraping ⬅️
    • Cleaning and Processing Text
  • Statistical Modeling in R

Today

Today

  • Course Administration
  • Warm-Up Exercise
  • New Material
    • HTML Structure
    • Selectors
    • Parsing HTML with rvest
  • Wrap-Up
    • Life Tip of the Day

Course Administration

GTA

Charles Ramirez is our GTA

  • Wednesday Office Hours moved to 5:15-7:15 for greater access
    • Give a bit of flexibility on the front end for CR to get off work
  • Will start on Meta-Review #02 after Peer Feedback due Tuesday November 18th

Mini-Project #03

MP#03 - Visualizing and Maintaining the Green Canopy of NYC

Due 2025-11-14 at 11:59pm ET

Topics covered:

  • Data Import
    • One static file
    • One API call
  • Spatial Data
    • Very basic spatial joins
    • Spatial visualizations (maps!)

Submissions so far look great!

Mini-Project #04

MP#04 - Just the Fact(-Check)s, Ma’am!

Due 2025-12-05 at 11:59pm ET

Topics covered:

  • Data Import
    • HTTP Request Construction (Week 9)
    • HTML Scraping (Tabular) ⬅️ (Today)
  • \(t\)-tests
  • Putting Everything Together

Grading in Progress

We owe you:

  • Selected Regrades
  • Mid-Term Check-In Presentation Feedback

Course Support

  • Synchronous
    • MW Office Hours 2x / week: Tuesdays + Thursdays 5pm
      • Rest of Semester except Thanksgiving (Nov 27th)
    • GTA Office Hours: Wednesdays at 5:15-7:15pm
  • Asynchronous: Piazza (\(<20\) minute average response time)

Large Files

Several of you have reported issues with git complaining about large files

git ls-tree -r -t -l --full-name HEAD | sort -n -k 4 | tail -n 10

SO on Removing Large Files:

git filter-branch --index-filter 'git rm -rf --cached --ignore-unmatch data/**' HEAD

⚠️This is dangerous! I can help with it after class.⚠️

Review Exercise

API Exercise

The Open Trivia Database

  • Use the API to build the question bank for a trivia night

Lab #11

Breakout Rooms

Breakout Teams
1 How We Met Your Landlord (T) + House Busters (R)
2 Happy Hour (T) + Green Apple (R)
3 Standard Deviants (T) + Stats & The City (R)
4 Nightshift Analysts (T) + Irish Mafia (R)
5 Sounds Good (T) + Wellness Warriors (R)
6 Weight Watchers (T) + Restaurant Nightmares (R)
7 Master Splinter (T) + Urban Health Insight Group (R)
8 Cycle Paths (T) + Gridion Regression (R)
9 Point of Interest (T)
10 The Mean, Green, Data-Analyzing Team (T)
11 Data Miners (T)
12 Inspector Clouseau (T)

Working with HTML

HTML

HTML - HyperText Markup Language - is the language used to write web pages

  • We have avoided writing HTML directly, in favor of Markdown

But you will have to read HTML

  • In a web browser, right click and “View Source” on a page

HTML Elements

HTML is written as a nested series of elements:

Important Elements

There are many HTML elements; some important ones are:

  • body: Body (the ‘mean’ of a page)
  • h1, h2, …: Headers
  • div: Division (generic container)
  • p: Paragraph (most text goes in these)
  • table: Table (often how data is displayed)
  • ul, ol: Lists (unordered and ordered)
  • a: Anchors (used for linking)

Important Elements

Other useful elements:

  • script: Javascript - the code that implements interactivity
  • style: CSS - formatting and appearance

You won’t use these directly, but they are everywhere

HTML Element Selection

The SelectorGadget can be used to practice selecting elements on web pages

  • #id will select by ID (“name”)
  • type will select all elements of that type
  • .class will select all elements with that class
  • sel1 sel2 will select elements matching sel2 that are inside a sel1

HTML Element Selection

main h2

table or tbody

.geo

HTML Anchors

Anchors (a elements) are probably the most important part of HTML

Confusingly, anchors are both links and destinations.

Anchors can reference:

  • Another page (http://URL)
  • A particular part of another page (http://URL#place)
  • A particular part of the same page (#place)

Quarto supports cross-linking with anchors

HTML Anchors

Incoming link anchor:

<a id="#fish"> Content </a>

Links to http://page#fish will go straight to that spot

Outgoing link anchor:

<a href="https://baruch.cuny.edu"> Baruch </a>

Clicking text Baruch will go to Baruch website

HTML Tables

HTML Tables are often used for showing data:

  • <table> - Top level container
  • <thead> and <tbody> - Separate header and body
  • <trow> - Rows
  • <td> - Table data (cells)

Can be much fancier - will see several examples in exercises

rvest

The rvest package can be used to manipulate HTML in R:

  • Get HTML by either read_html or resp_body_html if using httr2
  • Select elements with html_elements("selector")
    • Same syntax as SelectorGadget
    • Use html_element if you only want the first
  • Eventually need to extract content
    • html_text and html_text2 (removes whitespace)
    • html_attr gets attribute values (e.g., link targets)
    • html_table will attempt to parse a table automatically

Example

Using httr2 to get the names of all 5 Mini-Projects

library(rvest)
read_html("https://michael-weylandt.com/STA9750/miniprojects.html") |>
    html_element("#mini-projects") |>
    html_elements("h4") |>
    html_text2()
[1] "Mini-Project #00: Course Set-Up"                                                                            
[2] "Mini-Project #01: Gourmet Cheeseburgers Across the Globe: Exploring the Most Popular Programming on Netflix"
[3] "Mini-Project #02: Making Backyards Affordable for All"                                                      
[4] "Mini-Project #03: Visualizing and Maintaining the Green Canopy of NYC"                                      
[5] "Mini-Project #04: Just the Fact(-Check)s, Ma’am!"                                                           

Breakout Exercise #01

Return to your breakout rooms for Exercise #01

  • Use html_element to extract the course table
  • Use html_table to convert to a data frame

JavaScript Websites

JavaScript is a great tool, but R doesn’t know about it

  • What you see might not be what R sees
  • Make sure to consider “raw” HTML
  • Often ‘hijacks’ important elements to add new features
    • Focus on standard HTML elements

Breakout Exercise #02

Return to your breakout rooms for Exercise #02

  • Finding a table in a more complex website
  • Pulling out desired data
  • Making a map of results (optional)

Non-Tabular Data

Often, data will not be in a nice table

  • Need to manually build a data.frame
  • Build each data frame and ‘combine’
    • map |> list_rbind() idiom
    • DATA <- rbind(DATA, NEW_DATA) idiom

Loops

Recall the basic structure of a loop in R:

for(item in container){
    do_something_with(item)
}

Accumulate data

all_results <- tibble()
for(item in container){
    new_results <- do_something_with(item)
    new_tibble <- tibble(col1=val1, col2=val2)
    all_results <- rbind(all_results, new_tibble)
}

Example

Getting info about all pre-assignments

library(rvest)

preassigns <- read_html("https://michael-weylandt.com/STA9750/preassignments.html") |> 
    html_element("#pre-assignments") |>
    html_elements("section")

pa_details <- tibble()
for(pa in preassigns){
    name        = pa |> html_element("h4") |> html_text2()
    description = pa |> html_elements("p:last-of-type") |>  html_text2()
    
    pa_df <- tibble(name = name, description = description)
    
    pa_details <- rbind(pa_details, pa_df)
}

pa_details

Breakout Exercise #03

Return to your breakout rooms for Exercise #03

  • Examining a multi-page site
  • Pulling out data from non-tabular elements

Wrap Up

Wrap Up

Processing HTML in R

  • HTML Structure
  • HTML Selectors
  • html_table and <table> elements
  • HTML Anchors and Links

Upcoming Work

Upcoming work from course calendar

Remaining Topics

  • Parsing messy (text) data
  • Statistical Modeling

Life Tip of the Week

End of the Semester is Upcoming

  • End of the semester is rough
    • More important than ever to plan ahead
    • Ask for extensions / accomodations early
    • Help us help you
    • Faculty are slammed as well
  • Don’t ‘grade grub’
    • Makes it harder for professors to curve in your favor
    • Seek extra credit from the syllabus
    • Don’t ask for special treatment - ask how to take advantage of existing opportunities
  • Take care of yourselves
    • Nasty bugs going around …

Musical Treat


You might recognize the finale from Fantasia 2000