| Estimated Populations of Ten Largest Countries (2024) | |
| Country | 2024 Population |
|---|---|
| India (IN/IND) | 1,450,935,791 |
| China (CN/CHN) | 1,408,975,000 |
| United States (US/USA) | 340,110,988 |
| Indonesia (ID/IDN) | 283,487,931 |
| Pakistan (PK/PAK) | 251,269,164 |
| Nigeria (NG/NGA) | 232,679,478 |
| Brazil (BR/BRA) | 211,998,573 |
| Bangladesh (BD/BGD) | 173,562,364 |
| Russia (RU/RUS) | 143,533,851 |
| Ethiopia (ET/ETH) | 132,059,767 |
STA 9750 Lecture #9 Pre-Class Assignment: Intro to HTML
Due Date: 2026-04-23 (Thursday) at 06:00pm (before Class Session #11)
Submission: CUNY Brightspace
After our discussion of getting data from standardized formats (csv, json) over the internet, we now turn to data that lives on the bulk of the internet and is stored on HTML, i.e., data embedded within a web page. Recall our “hierarchy” of data storage preferences:
- Relational Databases
- Flat Files
- Hierarchical Plain Text
- “Reasonably Formatted” Spreadsheets
- Served via API
- “Unreasonable” Spreadsheets
- Served via HTML
As the course has progressed, we have seen or worked with data in many of these formats. This week, we move to the most difficult data source of this course: HTML.1 This will actually be a two-week discussion:
- This week, we will focus on simply getting text into
R - Next week, we will focus on transforming text into something more usable, e.g., how can we transform the string
"2 oz"to the numeric value2?
HTML Basics
HTML, short for HyperText Markup Language, is the lingua franca of the web. The vast majority of websites you visit or interact with are written using HTML. As such, HTML is ubiquitous in modern life. HTML is flexible, relatively easy to write by hand or programmatically, compressible, and has near universal built-in support on modern operating systems.
Don’t confuse http and html: HTTP is a protocol and it controls how computers communicate with each other, while HTML is a language that specifies the contents of that communication. For a metaphor, HTTP is roughly “use a phone” while HTML is “French”.
Unfortunately, this ubiquity comes with a cost: because HTML is so universal, web browsers have been designed to “try their best” to read and render even broken HTML.2 (Compare this to binary formats where it’s very common to get a corrupted file error without option to try to ‘muddle through’ and fix it.) In response, web professionals and amateurs have released ever worse HTML upon the world, continuing a vicious cycle.
To escape this, modern programming practice has moved away from hand-written HTML in favor of tools like Markdown and Quarto, which allow correct HTML to be generated automatically. While this is a welcome trend, you will almost certainly still encounter malformatted HTML in your career, as nothing - however flawed - truly leaves the internet.
But, for now, we begin with relatively well-formatted HTML!
Right-click in your browser and view the source code of this page - what you’re seeing is HTML. It might look like a mess (and there’s a lot on this simple page!) but the basic structure of HTML actually isn’t too hard to understand.
HTML consists of a hierarchically nested set of elements that look something like this:
<p format="emph">This is some text.</p>
This whole thing is called an element and it has three main parts:
-
The start tag
<p>and matching end tag</p>. Here,pis the type of tag or the element (interchangeably), short for “paragraph” and used to specify a block of paragraph text.Occasionally, the end tag
</p>can be omitted for tags which have no internal content, such as<br>denoting a break between paragraphs, but the end tag should typically be included. The contents of the element, which is everything between the start and end tags. For our example, the text “This is some text” is the contents of the
pelement.A set of attributes, included in the start tag. These are essentially endlessly flexible “meta-data” of the form
key="value". The most common use of these attributes is to control formatting, as in our example above, which indicates that the paragraph should be emphasized, perhaps by bold or italics, but the meaning is not fixed by HTML alone.
The power and flexibility of HTML comes from the fact that the contents of one element can include one or more additional elements. For instance, you might encounter an element like
<p format="emph">This semester, I am teaching <a href="https://michael-weylandt.com/STA9750">Introduction to R</a> at <a
href="https://baruch.cuny.edu">Baruch College</a>.</p>
which might render as:
This semester, I am teaching Introduction to R at Baruch College.
Here the a tag (anchor) is used to create hyperlinks, with the target specified by the href attribute. There are many more “standard” HTML tags in addition to those that websites might define for their own use. For comprehensive documentation, see the Mozilla Developer Network (MDN) Documentation.
The major ones you will use as a practicing data scientist are:
-
p: paragraph. This is “normal” text. -
a: anchor. This gives rise to hyperlinks. -
h1toh6: Headers Level 1 to Level 6 (with Level 1 being the largest). These map to the# Header## SubHeader### SubSubHeaderetc in Quarto. -
table: table. Tables are further subdivided into:-
thead: table header. The non-data row of column names -
tbody: table body. The part of the table with the actual data.-
tr: table row. A row of a table. -
td: table datum. Table datum, i.e., a single cell in the table.
-
-
Tables are obviously incredibly common for hosting data on web pages, so it’s worth being familiar with the structure. Right click on this page and find the HTML source for the following two tables.
While you can do this by simply reviewing the HTML manually, your browser provides a much more convenient way of doing so. If you right click on the page and select “Inspect” (at least in Firefox and Chrome; other browsers may have other names), you will be given a ‘split window’ view in which you can hover your mouse over an HTML element and the corresponding (rendered) element will be highlighted. This makes it much easier to find the relevant element.3
First, a simple pure Markdown table:
| Course Code | Name |
|---|---|
| STA 9750 | Software Tools for Data Analysis |
| STA 9715 | Applied Probability |
| STA 9890 | Statistical Learning for Data Mining |
Next, a more complex table generated using gt:
As you review each table, make sure you see how the various element types fit together. In this case, you’ll see that there are some style components mixed into the HTML and that there are some class attributes which, in turn, evoke styles that are specified elsewhere in the document.
HTML Selectors
When extracting data from a website, we will typically want to select all the elements of a certain tag or with a certain attribute: e.g., all cells of a table or all bolded paragraph headers. We can do so efficiently using “CSS Selectors”.
CSS Selectors are a special language used to select multiple elements at once: the basic elements are as follows:
-
el: Select all elements delimited by anelelement. -
.cls: Selects all elements with aclass="cls"tag. -
[attr]: Selects all elements with anattrattribute -
[attr="val"]: Selects all elements with anattrattribute equal to"val" -
el1 el2: Select allel2elements that are within anel1element.
We will use these to tell R what elements to import from a web page. A well constructed selector statement can usually highlight exactly the data we hope to extract.
For now, however, you will practice using CSS Selectors from within your browser.
Right click the the following link and add it to your bookmarks. Whenever you’re on a website, you can click that bookmark to open the CSS SelectorGadget.4
If you’re on Google Chrome, you may instead use the SelectorGadget extension for the same effect.
Upon clicking, you will see a toolbar at the bottom of the page. If you type a CSS selector statement into that toolbar, it will highlight all elements on this page that match that selector. For now try a simple a and hit enter: you should see all links on the page highlighted. You can also try more advanced CSS selectors: li a will select all links (a) within list items (li), such as those appearing in the navigation bar at the top of the page.
You can also use SelectorGadget to create CSS Selectors. If you click several items that you want to select, SelectorGadget will attempt to create a suitable selector command. (You might need to Clear the input area before trying this.) For instance, try clicking a link in this text and seeing what SelectorGadget automatically selects for you. In this case, SelectorGadget comes up with a for all links on the page. If we want to exclude the links in the navigation bar, we can click them again, marking them in red and SelectorGadget will attempt to exclude them. Here, it creates a CSS selector that selects only links within the main body of the page. For our purposes, two clicks are enough, but you could extend this further. SelectorGadget isn’t perfect, but it’s often a very good starting guess.
Let’s practice using SelectorGadget to create and to verify potential selectors for some different pages:
-
Open the
rvestStar Wars example page in a new tab and use SelectorGadget to select the 7 movie names in the main section. We want only the movie names and not the text below them. We also don’t want the clickable links in the sidebar.TipSolutionPossible valid selectors are
#main h2(all second-level headers within the element calledmain) or justmain h2(second level headers within the an element of type main). The fact that there is amaintype element with an ID of#mainis a bit confusing, but not uncommon.If you just use
h2, you will also be picking up the links in the navigation bar on the side of the page. -
As the name suggests, the site
scrapethissite.comis designed to provide some simple examples on which we can practice webscraping. Open the “Countries of the World” example page and use the SelectorGadget to find a selector for the country capitals, but not the populations or areas.TipSolutionYou should wind up with the selector
.country-capital, which will select all elements with the classcountry-capital. On this site, all three facts for each country are of typespan, so we need to further identify whichspans we’re interested in. Since the HTML looks like:<div class="country-info"> <strong>Capital:</strong> <span class="country-capital">Yerevan</span><br> <strong>Population:</strong> <span class="country-population">2968000</span><br> <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">29800.0</span><br> </div>for each country, we can see that selecting on the
classgets us exactly what we want. -
Next, open the Wikipedia page listing all CUNY Colleges and confirm that the
tbodyselector selects the entirety of the main table.Note that if you use SelectorGadget here to create your selector, you might get something like
.jquery-tablesorterinstead oftbody. For reasons we will discuss in class, this won’t work inR.Note: Sometimes SelectorGadget seems to choke on processing Wikipedia pages as they are rather complicated. If this is happening to you, feel free to move and skip this.
Finally, open the Baruch College Wiki page and create a selector for just the GPS coordinates in the top right corner of the page. You should try to select just the coordinates themselves and not the text “Coordinates” preceding them.
After finishing this document, complete the Weekly Pre-Assignment Quiz on Brightspace.
Footnotes
We won’t cover reading data from PDFs in this course.↩︎
If you are of a certain age, you will remember an era when websites would work in one browser and not others. Proper HTML should work in all browsers, but each browser had its own way of handling malformatted HTML. Developers were, in essence, requiring users to use a piece of software that would automatically correct their mistakes. These were dark times…↩︎
I slightly prefer Firefox for this as it also provides a visual indicator of whether the highlighted element is above or below the visible part of the page, but the functionality is otherwise pretty interchangeable.↩︎
All credit to Andrew Cantino at https://selectorgadget.com/. Use here inspired by the
rvestdocumentation↩︎