HTML
HTML - HyperText Markup Language - is the language used to write web pages
We have avoided writing HTML directly, in favor of Markdown
But you will have to read HTML
In a web browser, right click and “View Source” on a page
HTML Elements
HTML is written as a nested series of elements:
Important Elements
There are many HTML elements; some important ones are:
body: Body (the ‘mean’ of a page)
h1, h2, …: Headers
div: Division (generic container)
p: Paragraph (most text goes in these)
table: Table (often how data is displayed)
ul, ol: Lists (unordered and ordered)
a: Anchors (used for linking)
Important Elements
Other useful elements:
script: Javascript - the code that implements interactivity
style: CSS - formatting and appearance
You won’t use these directly, but they are everywhere
HTML Element Selection
The SelectorGadget can be used to practice selecting elements on web pages
#id will select by ID (“name”)
type will select all elements of that type
.class will select all elements with that class
sel1 sel2 will select elements matching sel2 that are inside a sel1
HTML Anchors
Anchors (a elements) are probably the most important part of HTML
Confusingly, anchors are both links and destinations.
Anchors can reference:
Another page (http://URL)
A particular part of another page (http://URL#place)
A particular part of the same page (#place)
Quarto supports cross-linking with anchors
HTML Anchors
Incoming link anchor:
<a id="#fish"> Content </a>
Links to http://page#fish will go straight to that spot
Outgoing link anchor:
<a href="https://baruch.cuny.edu"> Baruch </a>
Clicking text Baruch will go to Baruch website
HTML Tables
HTML Tables are often used for showing data:
<table> - Top level container
<thead> and <tbody> - Separate header and body
<trow> - Rows
<td> - Table data (cells)
Can be much fancier - will see several examples in exercises
rvest
The rvest package can be used to manipulate HTML in R:
Get HTML by either read_html or resp_body_html if using httr2
Select elements with html_elements("selector")
Same syntax as SelectorGadget
Use html_element if you only want the first
Eventually need to extract content
html_text and html_text2 (removes whitespace)
html_attr gets attribute values (e.g., link targets)
html_table will attempt to parse a table automatically
Example
Using httr2 to get the names of all 5 Mini-Projects
library (rvest)
read_html ("https://michael-weylandt.com/STA9750/miniprojects.html" ) |>
html_element ("#mini-projects" ) |>
html_elements ("h4" ) |>
html_text2 ()
[1] "Mini-Project #00: Course Set-Up"
[2] "Mini-Project #01: Gourmet Cheeseburgers Across the Globe: Exploring the Most Popular Programming on Netflix"
[3] "Mini-Project #02: Making Backyards Affordable for All"
[4] "Mini-Project #03: Visualizing and Maintaining the Green Canopy of NYC"
[5] "Mini-Project #04: Just the Fact(-Check)s, Ma’am!"
Breakout Exercise #01
Return to your breakout rooms for Exercise #01
Use html_element to extract the course table
Use html_table to convert to a data frame
JavaScript Websites
JavaScript is a great tool, but R doesn’t know about it
What you see might not be what R sees
Make sure to consider “raw” HTML
Often ‘hijacks’ important elements to add new features
Focus on standard HTML elements
Breakout Exercise #02
Return to your breakout rooms for Exercise #02
Finding a table in a more complex website
Pulling out desired data
Making a map of results (optional)
Non-Tabular Data
Often, data will not be in a nice table
Need to manually build a data.frame
Build each data frame and ‘combine’
map |> list_rbind() idiom
DATA <- rbind(DATA, NEW_DATA) idiom
Loops
Recall the basic structure of a loop in R:
for (item in container){
do_something_with (item)
}
Accumulate data
all_results <- tibble ()
for (item in container){
new_results <- do_something_with (item)
new_tibble <- tibble (col1= val1, col2= val2)
all_results <- rbind (all_results, new_tibble)
}
Example
Getting info about all pre-assignments
library (rvest)
preassigns <- read_html ("https://michael-weylandt.com/STA9750/preassignments.html" ) |>
html_element ("#pre-assignments" ) |>
html_elements ("section" )
pa_details <- tibble ()
for (pa in preassigns){
name = pa |> html_element ("h4" ) |> html_text2 ()
description = pa |> html_elements ("p:last-of-type" ) |> html_text2 ()
pa_df <- tibble (name = name, description = description)
pa_details <- rbind (pa_details, pa_df)
}
pa_details
Breakout Exercise #03
Return to your breakout rooms for Exercise #03
Examining a multi-page site
Pulling out data from non-tabular elements