HTML
HTML - HyperText Markup Language - is the language used to write web pages
- We have avoided writing HTML directly, in favor of Markdown
But you will have to read HTML
- In a web browser, right click and “View Source” on a page
HTML Elements
HTML is written as a nested series of elements:
![]()
Important Elements
There are many HTML elements; some important ones are:
-
body: Body (the ‘mean’ of a page)
-
h1, h2, …: Headers
-
div: Division (generic container)
-
p: Paragraph (most text goes in these)
-
table: Table (often how data is displayed)
-
ul, ol: Lists (unordered and ordered)
-
a: Anchors (used for linking)
Important Elements
Other useful elements:
-
script: Javascript - the code that implements interactivity
-
style: CSS - formatting and appearance
You won’t use these directly, but they are everywhere
HTML Element Selection
The SelectorGadget can be used to practice selecting elements on web pages
-
#id will select by ID (“name”)
-
type will select all elements of that type
-
.class will select all elements with that class
-
sel1 sel2 will select elements matching sel2 that are inside a sel1
HTML Anchors
Anchors (a elements) are probably the most important part of HTML
Confusingly, anchors are both links and destinations.
Anchors can reference:
- Another page (
http://URL)
- A particular part of another page (
http://URL#place)
- A particular part of the same page (
#place)
Quarto supports cross-linking with anchors
HTML Anchors
Incoming link anchor:
<a id="#fish"> Content </a>
Links to http://page#fish will go straight to that spot
Outgoing link anchor:
<a href="https://baruch.cuny.edu"> Baruch </a>
Clicking text Baruch will go to Baruch website
HTML Tables
HTML Tables are often used for showing data:
-
<table> - Top level container
-
<thead> and <tbody> - Separate header and body
-
<trow> - Rows
-
<td> - Table data (cells)
Can be much fancier - will see several examples in exercises
rvest
The rvest package can be used to manipulate HTML in R:
- Get HTML by either
read_html or resp_body_html if using httr2
- Select elements with
html_elements("selector")
- Same syntax as SelectorGadget
- Use
html_element if you only want the first
- Eventually need to extract content
-
html_text and html_text2 (removes whitespace)
-
html_attr gets attribute values (e.g., link targets)
-
html_table will attempt to parse a table automatically
Example
Using httr2 to get the names of all 5 Mini-Projects
[1] "Mini-Project #00: Course Set-Up"
[2] "Mini-Project #01: Gourmet Cheeseburgers Across the Globe: Exploring the Most Popular Programming on Netflix"
[3] "Mini-Project #02: Making Backyards Affordable for All"
[4] "Mini-Project #03: Visualizing and Maintaining the Green Canopy of NYC"
[5] "Mini-Project #04: Just the Fact(-Check)s, Ma’am!"
Breakout Exercise #01
Return to your breakout rooms for Exercise #01
- Use
html_element to extract the course table
- Use
html_table to convert to a data frame
JavaScript Websites
JavaScript is a great tool, but R doesn’t know about it
- What you see might not be what
R sees
- Make sure to consider “raw” HTML
- Often ‘hijacks’ important elements to add new features
- Focus on standard HTML elements
Breakout Exercise #02
Return to your breakout rooms for Exercise #02
- Finding a table in a more complex website
- Pulling out desired data
- Making a map of results (optional)
Non-Tabular Data
Often, data will not be in a nice table
- Need to manually build a
data.frame
- Build each data frame and ‘combine’
-
map |> list_rbind() idiom
-
DATA <- rbind(DATA, NEW_DATA) idiom
Loops
Recall the basic structure of a loop in R:
Example
Getting info about all pre-assignments
Breakout Exercise #03
Return to your breakout rooms for Exercise #03
- Examining a multi-page site
- Pulling out data from non-tabular elements