STA 9750 Lecture #9 Pre-Class Assignment: Intro to HTML

Due Date: 2026-04-23 (Thursday) at 06:00pm (before Class Session #11)

After our discussion of getting data from standardized formats (csv, json) over the internet, we now turn to data that lives on the bulk of the internet and is stored on HTML, i.e., data embedded within a web page. Recall our “hierarchy” of data storage preferences:

Relational Databases
Flat Files
Hierarchical Plain Text
“Reasonably Formatted” Spreadsheets
Served via API
“Unreasonable” Spreadsheets
Served via HTML
PDF

As the course has progressed, we have seen or worked with data in many of these formats. This week, we move to the most difficult data source of this course: HTML.¹ This will actually be a two-week discussion:

This week, we will focus on simply getting text into R
Next week, we will focus on transforming text into something more usable, e.g., how can we transform the string "2 oz" to the numeric value 2?

HTML Basics

HTML, short for HyperText Markup Language, is the lingua franca of the web. The vast majority of websites you visit or interact with are written using HTML. As such, HTML is ubiquitous in modern life. HTML is flexible, relatively easy to write by hand or programmatically, compressible, and has near universal built-in support on modern operating systems.

Don’t Confuse HTTP and HTML

Don’t confuse http and html: HTTP is a protocol and it controls how computers communicate with each other, while HTML is a language that specifies the contents of that communication. For a metaphor, HTTP is roughly “use a phone” while HTML is “French”.

Unfortunately, this ubiquity comes with a cost: because HTML is so universal, web browsers have been designed to “try their best” to read and render even broken HTML.² (Compare this to binary formats where it’s very common to get a corrupted file error without option to try to ‘muddle through’ and fix it.) In response, web professionals and amateurs have released ever worse HTML upon the world, continuing a vicious cycle.

To escape this, modern programming practice has moved away from hand-written HTML in favor of tools like Markdown and Quarto, which allow correct HTML to be generated automatically. While this is a welcome trend, you will almost certainly still encounter malformatted HTML in your career, as nothing - however flawed - truly leaves the internet.

But, for now, we begin with relatively well-formatted HTML!

Right-click in your browser and view the source code of this page - what you’re seeing is HTML. It might look like a mess (and there’s a lot on this simple page!) but the basic structure of HTML actually isn’t too hard to understand.

HTML consists of a hierarchically nested set of elements that look something like this:

<p format="emph">This is some text.</p>

This whole thing is called an element and it has three main parts:

The start tag  and matching end tag . Here, p is the type of tag or the element (interchangeably), short for “paragraph” and used to specify a block of paragraph text.

Occasionally, the end tag  can be omitted for tags which have no internal content, such as   denoting a break between paragraphs, but the end tag should typically be included.
The contents of the element, which is everything between the start and end tags. For our example, the text “This is some text” is the contents of the p element.
A set of attributes, included in the start tag. These are essentially endlessly flexible “meta-data” of the form key="value". The most common use of these attributes is to control formatting, as in our example above, which indicates that the paragraph should be emphasized, perhaps by bold or italics, but the meaning is not fixed by HTML alone.

The power and flexibility of HTML comes from the fact that the contents of one element can include one or more additional elements. For instance, you might encounter an element like

<p format="emph">This semester, I am teaching <a href="https://michael-weylandt.com/STA9750">Introduction to R</a> at <a
href="https://baruch.cuny.edu">Baruch College</a>.</p>

which might render as:

This semester, I am teaching Introduction to R at Baruch College.

Here the a tag (anchor) is used to create hyperlinks, with the target specified by the href attribute. There are many more “standard” HTML tags in addition to those that websites might define for their own use. For comprehensive documentation, see the Mozilla Developer Network (MDN) Documentation.

The major ones you will use as a practicing data scientist are:

p: paragraph. This is “normal” text.
a: anchor. This gives rise to hyperlinks.
h1 to h6: Headers Level 1 to Level 6 (with Level 1 being the largest). These map to the # Header ## SubHeader ### SubSubHeader etc in Quarto.
table: table. Tables are further subdivided into:
- thead: table header. The non-data row of column names
- tbody: table body. The part of the table with the actual data.
  - tr: table row. A row of a table.
  - td: table datum. Table datum, i.e., a single cell in the table.

Tables are obviously incredibly common for hosting data on web pages, so it’s worth being familiar with the structure. Right click on this page and find the HTML source for the following two tables.

While you can do this by simply reviewing the HTML manually, your browser provides a much more convenient way of doing so. If you right click on the page and select “Inspect” (at least in Firefox and Chrome; other browsers may have other names), you will be given a ‘split window’ view in which you can hover your mouse over an HTML element and the corresponding (rendered) element will be highlighted. This makes it much easier to find the relevant element.³

First, a simple pure Markdown table:

Course Code	Name
STA 9750	Software Tools for Data Analysis
STA 9715	Applied Probability
STA 9890	Statistical Learning for Data Mining

Next, a more complex table generated using gt:

Country	2024 Population
Estimated Populations of Ten Largest Countries (2024)
India (IN/IND)	1,450,935,791
China (CN/CHN)	1,408,975,000
United States (US/USA)	340,110,988
Indonesia (ID/IDN)	283,487,931
Pakistan (PK/PAK)	251,269,164
Nigeria (NG/NGA)	232,679,478
Brazil (BR/BRA)	211,998,573
Bangladesh (BD/BGD)	173,562,364
Russia (RU/RUS)	143,533,851
Ethiopia (ET/ETH)	132,059,767

As you review each table, make sure you see how the various element types fit together. In this case, you’ll see that there are some style components mixed into the HTML and that there are some class attributes which, in turn, evoke styles that are specified elsewhere in the document.

HTML Selectors

When extracting data from a website, we will typically want to select all the elements of a certain tag or with a certain attribute: e.g., all cells of a table or all bolded paragraph headers. We can do so efficiently using “CSS Selectors”.

CSS Selectors are a special language used to select multiple elements at once: the basic elements are as follows:

el: Select all elements delimited by an el element.
.cls: Selects all elements with a class="cls" tag.
[attr]: Selects all elements with an attr attribute
[attr="val"]: Selects all elements with an attr attribute equal to "val"
el1 el2: Select all el2 elements that are within an el1 element.

We will use these to tell R what elements to import from a web page. A well constructed selector statement can usually highlight exactly the data we hope to extract.

For now, however, you will practice using CSS Selectors from within your browser.

Right click the the following link and add it to your bookmarks. Whenever you’re on a website, you can click that bookmark to open the CSS SelectorGadget.⁴

SelectorGadget.

If you’re on Google Chrome, you may instead use the SelectorGadget extension for the same effect.

Upon clicking, you will see a toolbar at the bottom of the page. If you type a CSS selector statement into that toolbar, it will highlight all elements on this page that match that selector. For now try a simple a and hit enter: you should see all links on the page highlighted. You can also try more advanced CSS selectors: li a will select all links (a) within list items (li), such as those appearing in the navigation bar at the top of the page.

You can also use SelectorGadget to create CSS Selectors. If you click several items that you want to select, SelectorGadget will attempt to create a suitable selector command. (You might need to Clear the input area before trying this.) For instance, try clicking a link in this text and seeing what SelectorGadget automatically selects for you. In this case, SelectorGadget comes up with a for all links on the page. If we want to exclude the links in the navigation bar, we can click them again, marking them in red and SelectorGadget will attempt to exclude them. Here, it creates a CSS selector that selects only links within the main body of the page. For our purposes, two clicks are enough, but you could extend this further. SelectorGadget isn’t perfect, but it’s often a very good starting guess.

Let’s practice using SelectorGadget to create and to verify potential selectors for some different pages:

Open the rvest Star Wars example page in a new tab and use SelectorGadget to select the 7 movie names in the main section. We want only the movie names and not the text below them. We also don’t want the clickable links in the sidebar.

TipSolution

Possible valid selectors are #main h2 (all second-level headers within the element called main) or just main h2 (second level headers within the an element of type main). The fact that there is a main type element with an ID of #main is a bit confusing, but not uncommon.

If you just use h2, you will also be picking up the links in the navigation bar on the side of the page.
As the name suggests, the site scrapethissite.com is designed to provide some simple examples on which we can practice webscraping. Open the “Countries of the World” example page and use the SelectorGadget to find a selector for the country capitals, but not the populations or areas.
TipSolution
You should wind up with the selector .country-capital, which will select all elements with the class country-capital. On this site, all three facts for each country are of type span, so we need to further identify which spans we’re interested in. Since the HTML looks like:
```
<div class="country-info">
 Capital: Yerevan 
 Population: 2968000 
 Area (km2): 29800.0 
</div>
```
for each country, we can see that selecting on the class gets us exactly what we want.
Next, open the Wikipedia page listing all CUNY Colleges and confirm that the tbody selector selects the entirety of the main table.

Note that if you use SelectorGadget here to create your selector, you might get something like .jquery-tablesorter instead of tbody. For reasons we will discuss in class, this won’t work in R.

Note: Sometimes SelectorGadget seems to choke on processing Wikipedia pages as they are rather complicated. If this is happening to you, feel free to move and skip this.
Finally, open the Baruch College Wiki page and create a selector for just the GPS coordinates in the top right corner of the page. You should try to select just the coordinates themselves and not the text “Coordinates” preceding them.

After finishing this document, complete the Weekly Pre-Assignment Quiz on Brightspace.

Footnotes

We won’t cover reading data from PDFs in this course.↩︎
If you are of a certain age, you will remember an era when websites would work in one browser and not others. Proper HTML should work in all browsers, but each browser had its own way of handling malformatted HTML. Developers were, in essence, requiring users to use a piece of software that would automatically correct their mistakes. These were dark times…↩︎
I slightly prefer Firefox for this as it also provides a visual indicator of whether the highlighted element is above or below the visible part of the page, but the functionality is otherwise pretty interchangeable.↩︎
All credit to Andrew Cantino at https://selectorgadget.com/. Use here inspired by the rvest documentation ↩︎