| Breakout | Team |
|---|---|
| 1 | TBD |
Today: Lecture #09: Introduction to Web Technologies & Web Scraping
quarto) ✅R Basics ✅R ✅R ✅R ⬅️
RrvestCharles Ramirez is our GTA
MP#03 - TBD
Due 2026-04-24 at 11:59pm ET
Topics covered:
Submissions so far look great!
MP#04 - TBD
Due 2026-05-15 at 11:59pm ET
Topics covered:
We owe you:
Several of you have reported issues with git complaining about large files
git ls-tree -r -t -l --full-name HEAD | sort -n -k 4 | tail -n 10
git filter-branch --index-filter 'git rm -rf --cached --ignore-unmatch data/**' HEAD
⚠️This is dangerous! I can help with it after class.⚠️
| Breakout | Team |
|---|---|
| 1 | TBD |
HTML - HyperText Markup Language - is the language used to write web pages
But you will have to read HTML
HTML is written as a nested series of elements:

There are many HTML elements; some important ones are:
body: Body (the ‘mean’ of a page)h1, h2, …: Headersdiv: Division (generic container)p: Paragraph (most text goes in these)table: Table (often how data is displayed)ul, ol: Lists (unordered and ordered)a: Anchors (used for linking)Other useful elements:
script: Javascript - the code that implements interactivitystyle: CSS - formatting and appearanceYou won’t use these directly, but they are everywhere
The SelectorGadget can be used to practice selecting elements on web pages
#id will select by ID (“name”)type will select all elements of that type.class will select all elements with that classsel1 sel2 will select elements matching sel2 that are inside a sel1main h2
table or tbody
.geo
Anchors (a elements) are probably the most important part of HTML
Confusingly, anchors are both links and destinations.
Anchors can reference:
http://URL)http://URL#place)#place)Quarto supports cross-linking with anchors
Incoming link anchor:
<a id="#fish"> Content </a>
Links to http://page#fish will go straight to that spot
Outgoing link anchor:
<a href="https://baruch.cuny.edu"> Baruch </a>
Clicking text Baruch will go to Baruch website
HTML Tables are often used for showing data:
<table> - Top level container<thead> and <tbody> - Separate header and body<trow> - Rows<td> - Table data (cells)Can be much fancier - will see several examples in exercises
The rvest package can be used to manipulate HTML in R:
read_html or resp_body_html if using httr2html_elements("selector")
html_element if you only want the firsthtml_text and html_text2 (removes whitespace)html_attr gets attribute values (e.g., link targets)html_table will attempt to parse a table automaticallyUsing httr2 to get the names of all 5 Mini-Projects
[1] "Mini-Project #00: Course Set-Up"
[2] "Mini-Project #01: Gourmet Cheeseburgers Across the Globe: Exploring the Most Popular Programming on Netflix"
[3] "Mini-Project #02: Making Backyards Affordable for All"
[4] "Mini-Project #03: Visualizing and Maintaining the Green Canopy of NYC"
[5] "Mini-Project #04: Just the Fact(-Check)s, Ma’am!"
Return to your breakout rooms for Exercise #01
html_element to extract the course tablehtml_table to convert to a data frameJavaScript is a great tool, but R doesn’t know about it
R seesReturn to your breakout rooms for Exercise #02
Often, data will not be in a nice table
data.framemap |> list_rbind() idiomDATA <- rbind(DATA, NEW_DATA) idiomRecall the basic structure of a loop in R:
Accumulate data
Getting info about all pre-assignments
Return to your breakout rooms for Exercise #03
Processing HTML in R
html_table and <table> elementsUpcoming work from course calendar
Remaining Topics
End of the Semester is Upcoming
You might recognize the finale from Fantasia 2000