Elements of Data Science
SDS 322E

H. Sherry Zhang
Department of Statistics and Data Sciences
The University of Texas at Austin

Fall 2025

Learning objectives

  • Understand the basics of HTML and CSS
  • Use the rvest package to scrape data from web pages
    • read_html() for reading in an HTML page
    • html_element(), html_elements() for selecting elements
    • html_text(), html_table() for parsing elements to text or tables

HTML basics

https://html-css-js.com/

Let’s play

Press F12 (or right click and select “Inspect”) to open Developer Tools of the page: https://en.wikipedia.org/wiki/United_States

CSS - Under the Styles tab:

  • find the section for body{}, click the color box next to color, and change the text color to anything you like.

  • find the html, body { section, change the font-family from Sans-Serif to Serif or fantasy.

Let’s play

HTML - Under the Elements tab:

  • Use the selector gadget (top left corner of the Developer Tools) to select the title “United States” and see the HTML structure highlighted in the Elements tab.

  • Change the text to “United States of America” and hit Enter to see the change on the webpage. (This change is only temporary and will be lost when you refresh the page.)

  • Let’s scroll down to a table: Demographics > Population Change the values in the table cell.

Basic table structure

An HTML table with a <tbody> element. The table is defined rowwise: <tr> defines a table row, <td> defines a table cell (data):

The HTML code:

<table>
<tbody>
<tr>
  <td>Name</td>
  <td>Region</td>
  <td>Population</td>
</tr>
<tr>
  <td>New York</td>
  <td>Northeast</td>
  <td>19,940,274</td>
</tr>
</tbody>
</table>

renders as:

Name Region Population
New York Northeast 19,940,274

Summary

HTML

  • Every HTML page must be in an <html> element, and it must have two children: <head>, which contains document metadata like the page title, and <body>, which contains the content you see in the browser.

  • Block tags like <h1> (heading 1), <p> (paragraph), <table> (table), and <ol> (ordered list) form the overall structure of the page.

CSS

  • CSS is short for Cascading Style Sheets, and is a tool for defining the visual styling of HTML documents.

  • CSS selectors define patterns for locating HTML elements, and are useful for scraping because they provide a concise way of describing which elements you want to extract.

Web scraping with rvest

Toy example

html <- minimal_html("
<head>
<title></title>
</head>
<body>
<h1 id='welcome'>Page title</h1>
<p>Welcome to HTML-CSS-JS.com</p>
<p>Online HTML, CSS and JavaScript editor with instant preview.</p>
  
<table>
<tr>
<td>Name</td>
<td>Region</td>
<td>Population</td>
</tr>
<tr>
<td>New York</td>
<td>Northeast</td>
<td>19,940,274</td>
</tr>
</table>
</body>
")
html
{html_document}
<html>
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n<h1 id="welcome">Page title</h1>\n<p>Welcome to HTML-CSS-JS.com</ ...

Extract elements

  • html_element(): extract the first matching element

  • html_elements(): extract all matching elements

  • html_text(): parse the results from html_element()/ html_elements() to text

  • html_table(): parse the results from html_element()/ html_elements() to a table (data frame)

Example:

Take the html object created above, extract the <h1> element and parse it to text.

html |> html_element("h1")
{html_node}
<h1 id="welcome">
html |> html_element("h1") |> html_text()
[1] "Page title"

Extract elements

html
{html_document}
<html>
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n<h1 id="welcome">Page title</h1>\n<p>Welcome to HTML-CSS-JS.com</ ...

Take the html object created above, extract all <p> elements and parse them to text.

html |> html_elements("p") |> html_text()
[1] "Welcome to HTML-CSS-JS.com"                                  
[2] "Online HTML, CSS and JavaScript editor with instant preview."

Take the html object created above, extract the <table> element and parse it to a data frame.

html |> html_element("table") |> html_table()
# A tibble: 2 × 3
  X1       X2        X3        
  <chr>    <chr>     <chr>     
1 Name     Region    Population
2 New York Northeast 19,940,274

A full example: US population table

Steps:

  • read_html(): read in an HTML page from a URL
  • html_elements()/ html_elements(): select the element(s) of interest
  • html_text()/ html_table(): parse the element(s) to text or a data frame
url <- "https://en.wikipedia.org/wiki/United_States"
html <- read_html(url)
tables <- html |> html_elements("table") |> html_table()
tables[[2]]
# A tibble: 10 × 2
   State          `Population (millions)`
   <chr>                            <dbl>
 1 California                        39.4
 2 Texas                             31.3
 3 Florida                           23.4
 4 New York                          19.9
 5 Pennsylvania                      13.1
 6 Illinois                          12.7
 7 Ohio                              11.9
 8 Georgia                           11.2
 9 North Carolina                    11  
10 Michigan                          10.1

A full example: US population table

You may also use other CSS selector: e.g. The CSS selector for class of “bar-chart” is “.bar-chart”

html |> html_element(".bar-chart") |> html_table()

Learn more about the CSS selectors: https://flukeout.github.io/

Or through Xpath:

  • find the table element, right click and select “Copy” > “Copy XPath”
html |> html_element(xpath = '//*[@id="mw-content-text"]/div[1]/table[2]') |> html_table()

List of minimum annual leave by country

List of minimum annual leave by country

url <- "https://en.wikipedia.org/wiki/List_of_minimum_annual_leave_by_country"
read_html(url) |>
  html_elements("h2") |> 
  html_text()
[1] "Contents"       "Methodology"    "Countries"      "By country"    
[5] "See also"       "References"     "External links"

List of minimum annual leave by country

raw <- read_html(url) |> html_elements("table") |> html_table()
raw <- raw[[2]]
raw
# A tibble: 193 × 5
   Country  Paid vacation days b…¹ Paid public holidays…² Total paid leave(fiv…³
   <chr>    <chr>                  <chr>                  <chr>                 
 1 Afghani… 20                     15                     35                    
 2 Albania  28                     12                     40                    
 3 Algeria  30                     11                     41                    
 4 Andorra  31                     14                     45                    
 5 Angola   22                     11                     33                    
 6 Antigua… 12                     11                     23                    
 7 Argenti… 10                     19[12]                 29                    
 8 Armenia  20                     12[14][15]             36                    
 9 Austral… 20                     10                     30                    
10 Austria  25                     13                     38                    
# ℹ 183 more rows
# ℹ abbreviated names: ¹​`Paid vacation days by year (five-day workweek)[1][2]`,
#   ²​`Paid public holidays (bank holidays)[3][4]`,
#   ³​`Total paid leave(five-day workweek)`
# ℹ 1 more variable: Notes <chr>

Select by css selector for class of “wikitable”:

raw <- read_html(url) |> html_element(".wikitable") |> html_table()

List of minimum annual leave by country

holiday_df <- raw |> janitor::clean_names() |> select(1:5)
names(holiday_df) <- c("country", "notes", "paid_vacation", "public_holiday", "total")
holiday_df
# A tibble: 193 × 5
   country             notes paid_vacation public_holiday total                 
   <chr>               <chr> <chr>         <chr>          <chr>                 
 1 Afghanistan         20    15            35             "Employees are entitl…
 2 Albania             28    12            40             "Employees are entitl…
 3 Algeria             30    11            41             "The paid annual leav…
 4 Andorra             31    14            45             "Workers are entitled…
 5 Angola              22    11            33             "The annual leave for…
 6 Antigua and Barbuda 12    11            23             "The annual leave for…
 7 Argentina           10    19[12]        29             "14 calendar days (10…
 8 Armenia             20    12[14][15]    36             "Generally, the durat…
 9 Australia           20    10            30             "An employee is entit…
10 Austria             25    13            38             "Employees with fewer…
# ℹ 183 more rows