Web-Scraping

Required reading

Recommended reading

Key concepts/skills/etc

Key libraries

Key functions

Pre-quiz

  1. Name three reasons why we should be respectful when getting scraping data from websites.
  2. What features of a website do we typically take advantage of when we parse the code?
  3. What are three advantages and three disadvantages of scraping compared with using an API?

Introduction

Web-scraping is a way to get data from websites into R. Rather than going to a website ourselves through a browser, we write code that does it for us. This opens up a lot of data to us, but on the other hand, it is not typically data that is being made available for these purposes and so it is important to be respectful of it. While generally not illegal, the specifics with regard to the legality of web-scraping depends on jurisdictions and the specifics of what you’re doing, and so it is also important to be mindful of this. And finally, web-scraping imposes a cost on the website host, and so it is important to reduce this to the extent that it’s possible.

That all said, web-scraping is an invaluable source of data. But they are typically a datasets that can be created as a by-product of someone trying to achieve another aim. For instance, a retailer may have a website with their products and their prices. That has not been created deliberately as a source of data, but we can scrape it to create a dataset. As such, the following principals guide my web-scraping.

  1. Avoid it. Try to use an API wherever possible.
  2. Abide by their desires. Some websites have a file ‘robots.txt’ that contains information about what they are comfortable with scrapers doing, for instance ‘https://www.google.com/robots.txt’. If they have one of these then you should read it and abide by it.
  3. Reduce the impact.
    • Firstly, slow down your scraper, for instance, rather than having it visit the website every second, slow it down (using sys.sleep()). If you’re only after a few hundred files then why not just have it visit once a minute, running in the background overnight?
    • Secondly, consider the timing of when you run the scraper. For instance, if it’s a retailer then why not set your script to run from 10pm through to the morning, when fewer customers are likely to need the site? If it’s a government website and they have a big monthly release then why not avoid that day?
  4. Take only what you need. For instance, don’t scrape the entire of Wikipedia if all you need is to know the names of the 10 largest cities in Canada. This reduces the impact on their website and allows you to more easily justify what you are doing.
  5. Only scrape once. Save everything as you go so that you don’t have to re-collect data. Similarly, once you have the data, you should keep that separate and not modify it. Of course, if you need data over time then you will need to go back, but this is different to needlessly re-scraping a page.
  6. Don’t republish the pages that you scraped. (This is in contrast to datasets that you create from it.)
  7. Take ownership and ask permission if possible. At a minimum level your scripts should have your contact details in them. Depending on the circumstances, it may be worthwhile asking for permission before you scrape.

Getting started

Webscraping is possible by taking advantage of the underlying structure of a webpage. We use patterns in the HTML/CSS to get the data that we want. To look at the underlying HTML/CSS you can either: 1) open a browser, right-click, and choose something like ‘Inspect’; or 2) save the website and then open it with a text editor rather than a browser.

HTML/CSS is a markup language comprised of matching tags. So if you want text to be bold then you would use something like:

<b>My bold text</b>

Similarly, if you want a list then you start and end the list as well as each item.

<ul>
  <li>Learn webscraping</li>
  <li>Do data science</li>
  <li>Proft</li>
</ul>

When webscraping we will search for these tags.

To get started, this is some HTML/CSS from my website. Let’s say that we want to grab my name from it. We can see that the name is in bold, so we want to probably focus on that feature and extract it.

website_extract <- "<p>Hi, I’m <b>Rohan</b> Alexander.</p>"

We will use the rvest package Wickham (2019).

# install.packages("rvest")
library(rvest)

rohans_data <- read_html(website_extract)

rohans_data
{html_document}
<html>
[1] <body><p>Hi, I’m <b>Rohan</b> Alexander.</p></body>

The language used by rvest to look for tags is ‘node’, so we will focus on bold nodes. By default html_nodes() returns the tags as well. So we can focus on the text that they contain, using html_text().

rohans_data %>% 
  html_nodes("b")
{xml_nodeset (1)}
[1] <b>Rohan</b>
first_name <- 
  rohans_data %>% 
  html_nodes("b") %>%
  html_text()

first_name
[1] "Rohan"

The result is that we learn my first name.

Case study - Rohan’s books

Introduction

In this case study we are going to scrape a list of books that I own, clean it, and look at the distribution of the first letters of author surnames. It is slightly more complicated than the example above, but the underlying approach is the same - download the website, look for the nodes of interest, extract the information, clean it.

Gather

Again, the key library that we are using is the rvest library. This makes it easier to download a website, and to then navigate the html to find the aspects that we are interested in. You should create a new project in a new folder (File -> New Project). Within that new folder you should make three new folders: inputs, outputs, and scripts.

In the scripts folder you should write and save a script along these lines. This script loads the libraries that we need, then visits my website, and saves a local copy.

#### Contact details ####
# Title: Get data from rohanalexander.com
# Purpose: This script gets data from Rohan's website about the books that he 
# owns. It calls his website and then saves the dataset to inputs.
# Author: Rohan Alexander
# Contact: rohan.alexander@utoronto.ca
# Last updated: 20 May 2020


#### Set up workspace ####
library(rvest)
library(tidyverse)


#### Get html ####
rohans_data <- read_html("https://rohanalexander.com/bookshelf.html")
# This takes a website as an input and will read it into R, in the same way that we 
# can read a, say, CSV into R.

write_html(rohans_data, "inputs/my_website/raw_data.html") 
# Always save your raw dataset as soon as you get it so that you have a record 
# of it. This is the equivalent of, say, write_csv() that we have used earlier.

Clean

Now we need to navigate the HTML to get the aspects that we want, and to then put them into some sensible structure. I always try to get the data into a tibble as early as possible. While it’s possible to work with the nested data, I move to a tibble so that the usual verbs that I’m used to can be used.

In the scripts folder you should write and save a new R script along these lines. First, we need to add the top matter, read in the libraries and the data that we scraped.

#### Contact details ####
# Title: Clean data from rohanaledander.com
# Purpose: This script cleans data that was downloaded in 01-get_data.R.
# Author: Rohan Alexander
# Contact: rohan.alexander@utoronto.ca
# Pre-requisites: Need to have run 01_get_data.R and have saved the data.
# Last updated: 20 May 2020


#### Set up workspace ####
library(tidyverse)
library(rvest)

rohans_data <- read_html("inputs/my_website/raw_data.html")

rohans_data
{html_document}
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n\n<!--radix_placeholder_front_matter-->\n\n<script id="distill-fr ...

Now we need to identify the data that we are interested in using html tags and convert it to a tibble. If you look at the website, then you should notice that we are likely trying to focus on list items (Figure 1).

Some of Rohan's books

Figure 1: Some of Rohan’s books

Let’s look at the source (Figure 2).

Source code for top of the page

Figure 2: Source code for top of the page

There’s a lot of debris, but scrolling down we eventually get to a list (Figure 3).

Source code for list

Figure 3: Source code for list

The tag for a list item is ‘li’, so we modify the earlier code to focus on that and to get the text.

#### Clean data ####
# Identify the lines that have books on them based on the list html tag
text_data <- rohans_data %>%
  html_nodes("li") %>%
  html_text()

all_books <- tibble(books = text_data)

head(all_books)
# A tibble: 6 x 1
  books                                                                         
  <chr>                                                                         
1 "-“A Little Life”, Hanya Yanighara. Recommended by Lauren."                   
2 "“The Andromeda Strain”, Michael Crichton."                                   
3 "“Is There Life After Housework”, Don Aslett.\nGot given this at the Museum o…
4 "“The Chosen”, Chaim Potok."                                                  
5 "“The Forsyth Saga”, John Galsworthy."                                        
6 "“Freakonomics”, Steven Levitt and Stephen Dubner."                           

We now need to clean the data. First we want to separate the title and the author

# All content is just one string, so need to separate title and author
all_books <-
  all_books %>%
  separate(books, into = c("title", "author"), sep = "”")

# Remove leading comma and clean up the titles a little
all_books <-
  all_books %>%
  mutate(author = str_remove_all(author, "^, "),
         author = str_squish(author),
         title = str_remove(title, "“"),
         title = str_remove(title, "^-")
         )

head(all_books)
# A tibble: 6 x 2
  title                  author                                                 
  <chr>                  <chr>                                                  
1 A Little Life          Hanya Yanighara. Recommended by Lauren.                
2 The Andromeda Strain   Michael Crichton.                                      
3 Is There Life After H… Don Aslett. Got given this at the Museum of Clean in P…
4 The Chosen             Chaim Potok.                                           
5 The Forsyth Saga       John Galsworthy.                                       
6 Freakonomics           Steven Levitt and Stephen Dubner.                      

Finally, some specific cleaning is needed.

# Some authors have comments after their name, so need to get rid of them, although there are some exceptions that will not work
# J. K. Rowling.
# M. Mitchell Waldrop.
# David A. Price
all_books <-
  all_books %>%
  mutate(author = str_replace_all(author,
                              c("J. K. Rowling." = "J K Rowling.",
                                "M. Mitchell Waldrop." = "M Mitchell Waldrop.",
                                "David A. Price" = "David A Price")
                              )
         ) %>%
  separate(author, into = c("author_correct", "throw_away"), sep = "\\.", extra = "drop") %>%
  select(-throw_away) %>%
  rename(author = author_correct)

# Some books have multiple authors, so need to separate them
# One has multiple authors:
# "Daniela Witten, Gareth James, Robert Tibshirani, and Trevor Hastie"
all_books <-
  all_books %>%
  mutate(author = str_replace(author,
                              "Daniela Witten, Gareth James, Robert Tibshirani, and Trevor Hastie",
                              "Daniela Witten and Gareth James and Robert Tibshirani and Trevor Hastie")) %>%
  separate(author, into = c("author_first", "author_second", "author_third", "author_fourth"), sep = " and ", fill = "right") %>%
  pivot_longer(cols = starts_with("author_"),
               names_to = "author_position",
               values_to = "author") %>%
  select(-author_position) %>%
  filter(!is.na(author))

head(all_books)
# A tibble: 6 x 2
  title                         author          
  <chr>                         <chr>           
1 A Little Life                 Hanya Yanighara 
2 The Andromeda Strain          Michael Crichton
3 Is There Life After Housework Don Aslett      
4 The Chosen                    Chaim Potok     
5 The Forsyth Saga              John Galsworthy 
6 Freakonomics                  Steven Levitt   

It looks there is some at the end because I have a best of. I’ll just get rid of those manually because it’s not the focus.

all_books <- 
  all_books %>% 
  slice(1:118)

Explore

Finally, just because we have the data now, so we may as well try to do something with it, let’s look at the distribution of the first letter of the author names.

all_books %>% 
  mutate(
    first_letter = str_sub(author, 1, 1)
    ) %>% 
  group_by(first_letter) %>% 
  count()
# A tibble: 21 x 2
# Groups:   first_letter [21]
   first_letter     n
   <chr>        <int>
 1 ""               1
 2 "A"              8
 3 "B"              5
 4 "C"              4
 5 "D"             10
 6 "E"              3
 7 "F"              1
 8 "G"             10
 9 "H"              6
10 "I"              1
# … with 11 more rows

Case study - Canadian Prime Ministers

Introduction

In this case study we are interested in how long Canadian prime ministers lived, based on the year that they were born. We will scrape data from Wikipedia, clean it, and then make a graph.

The key library that we will use for scraping is rvest. This adds a lot of functions that will make life easier. That said, every time you scrape a website things will change. Each scrape will largely be bespoke, even if you can borrow some code from earlier projects that you have completed. It is completely normal to feel frustrated at times. It helps to begin with an end in mind.

To that end, let’s generate some simulated data. Ideally, we want a table that has a row for each prime minister, a column for their name, and a column each for the birth and death years. If they are still alive, then that death year can be empty. We know that birth and death years should be somewhere between 1700 and 1990, and that death year should be larger than birth year. Finally, we also know that the years should be integers, and the names should be characters. So, we want something that looks roughly like this:

library(babynames)
library(tidyverse)

simulated_dataset <- 
  tibble(prime_minister = sample(x = babynames %>% filter(prop > 0.01) %>% 
                                   select(name) %>% unique() %>% unlist(), 
                                 size = 10, replace = FALSE),
         birth_year = sample(x = c(1700:1990), size = 10, replace = TRUE),
         years_lived = sample(x = c(50:100), size = 10, replace = TRUE),
         death_year = birth_year + years_lived) %>% 
  select(prime_minister, birth_year, death_year, years_lived) %>% 
  arrange(birth_year)

head(simulated_dataset)
# A tibble: 6 x 4
  prime_minister birth_year death_year years_lived
  <chr>               <int>      <int>       <int>
1 Marie                1725       1806          81
2 Robert               1741       1823          82
3 Hannah               1764       1829          65
4 Margaret             1885       1970          85
5 Ashley               1892       1961          69
6 Donna                1912       1965          53

One of the advantages of generating a simulated dataset is that if you are working in groups then one person can start making the graph, using the simulated dataset, while the other person gathers the data. In terms of a graph, we want something like Figure 4.

Sketch of planned graph.

Figure 4: Sketch of planned graph.

Gather

We are starting with a question that is of interest, which how long each Canadian prime minister lived. As such, we need to identify a source of data While there are likely to be plenty of data sources that have the births and deaths of each prime minister, we want one that we can trust, and as we are going to be scraping, we want one that has some structure to it. The Wikipedia page (https://en.wikipedia.org/wiki/List_of_prime_ministers_of_Canada) fits both these criteria. As it is a popular page the information is more likely to be correct, and the data are available in a table.

We load the library and then we read in the data from the relevant page. The key function here is read_html(), which you can use in the same way as, say, read_csv(), except that it takes a html page as an input. Once you call read_html() then the page is downloaded to your own computer, and it is usually a good idea to save this, using write_html() as it is your raw data. Saving it also means that we don’t have to keep visiting the website when we want to start again with our cleaning, and so it is part of being polite. However, it is likely not our property (in the case of Wikipedia, we might be okay), and so you should probably not share it.

raw_data <- read_html("https://en.wikipedia.org/wiki/List_of_prime_ministers_of_Canada")
write_html(raw_data, "inputs/wiki/pms.html") # Note that we save the file as a html file.

Clean

Websites are made up of html, which is a markup language. We are looking for patterns in the mark-up that we can use to help us get closer to the data that we want. This is an iterative process and requires a lot of trial and error. Even simple examples will take time. You can look at the html by using a browser, right clicking, and then selecting view page source. Similarly, you could open the html file using a text editor.

By inspection

We are looking for patterns that we can use to select the information that is of interest - names, birth year, and death year. When we look at the html it looks like there is something going on with <tr>, and then <td> (thanks to Thomas Rosenthal for identifying this). We select those nodes using html_nodes(), which takes the tags as an input. If you only want the first one then there is a singular version, html_node().

# Read in our saved data
raw_data <- read_html("inputs/wiki/pms.html")

# We can parse tags in order
parse_data_inspection <- 
  raw_data %>% 
  html_nodes("tr") %>% 
  html_nodes("td") %>% 
  html_text() # html_text removes any remaining html tags

# But this code does exactly the same thing - the nodes are just pushed into 
# the one function call
parse_data_inspection <- 
  raw_data %>% 
  html_nodes("tr td") %>% 
  html_text()

head(parse_data_inspection)
[1] "Abbreviation key:"                                                                                                                                                                                                                              
[2] "No.: Incumbent number, Min.: Ministry, Refs: References\n"                                                                                                                                                                                      
[3] "Colour key:"                                                                                                                                                                                                                                    
[4] "\n\n  Liberal Party of Canada\n \n  Historical Conservative parties (including Liberal-Conservative, Conservative (Historical),     Unionist, National Liberal and Conservative, Progressive Conservative) \n  Conservative Party of Canada\n\n"
[5] "Provinces key:"                                                                                                                                                                                                                                 
[6] "AB: Alberta, BC: British Columbia, MB: Manitoba, NS: Nova Scotia,ON: Ontario, QC: Quebec, SK: Saskatchewan\n"                                                                                                                                   

At this point our data is in a character vector, we want to convert it to a table, and reduce the data down to just the information that we want. The key that is going to allow us to do this is the fact that there seems to be a blank line (which in html is denoted by \n) before the key information that we need. So, once we identify that line then we can filter to just the line below it!

parsed_data <- 
  tibble(raw_text = parse_data_inspection) %>% # Convert the character vector to a table
  mutate(is_PM = if_else(raw_text == "\n\n", 1, 0), # Look for the blank line that is 
         # above the row that we want
         is_PM = lag(is_PM, n = 1)) %>% # Identify the actual row that we want
  filter(is_PM == 1) # Just get the rows that we want

head(parsed_data)
# A tibble: 6 x 2
  raw_text                                                                 is_PM
  <chr>                                                                    <dbl>
1 "\nSir John A. MacDonald(1815–1891)MP for Kingston, ON\n"                    1
2 "\nAlexander Mackenzie(1822–1892)MP for Lambton, ON\n"                       1
3 "\nSir John A. MacDonald(1815–1891)MP for Victoria, BC until 1882MP for…     1
4 "\nSir John Abbott(1821–1893)Senator for Quebec\n"                           1
5 "\nSir John Thompson(1845–1894)MP for Antigonish, NS\n"                      1
6 "\nSir Mackenzie Bowell(1823–1917)Senator for Ontario\n"                     1

Using the selector gadget

If you are comfortable with html then you might be able to see patterns, but one tool that may help is the SelectorGadget: https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html. This allows you to pick and choose the elements that you want, and then gives you the input to give to html_nodes() (Figure 5)

Using the Selector Gadget to identify the tag, as at 13 March 2020.

Figure 5: Using the Selector Gadget to identify the tag, as at 13 March 2020.

# Read in our saved data
raw_data <- read_html("inputs/wiki/pms.html")

# We can parse tags in order
parse_data_selector_gadget <- 
  raw_data %>% 
  html_nodes("td:nth-child(3)") %>% 
  html_text() # html_text removes any remaining html tags

head(parse_data_selector_gadget)
[1] "\nSir John A. MacDonald(1815–1891)MP for Kingston, ON\n"                                                            
[2] "\nAlexander Mackenzie(1822–1892)MP for Lambton, ON\n"                                                               
[3] "\nSir John A. MacDonald(1815–1891)MP for Victoria, BC until 1882MP for Carleton, ON until 1887MP for Kingston, ON\n"
[4] "\nSir John Abbott(1821–1893)Senator for Quebec\n"                                                                   
[5] "\nSir John Thompson(1845–1894)MP for Antigonish, NS\n"                                                              
[6] "\nSir Mackenzie Bowell(1823–1917)Senator for Ontario\n"                                                             

In this case there is one prime minister - Robert Borden - who changed party and we would need to filter away that row: \nUnionist Party\n".

Clean data

Now that we have the parsed data, we need to clean it to match what we wanted. In particular we want a names column, as well as columns for birth year and death year. We will use separate() to take advantage of the fact that it looks like the dates are distinguished by brackets.

initial_clean <- 
  parsed_data %>% 
  separate(raw_text, 
            into = c("Name", "not_name"), 
            sep = "\\(",
            remove = FALSE) %>% # The remove = FALSE option here means that we 
  # keep the original column that we are separating.
  separate(not_name, 
            into = c("Date", "all_the_rest"), 
            sep = "\\)",
            remove = FALSE)

head(initial_clean)
# A tibble: 6 x 6
  raw_text           Name     not_name          Date  all_the_rest         is_PM
  <chr>              <chr>    <chr>             <chr> <chr>                <dbl>
1 "\nSir John A. Ma… "\nSir … "1815–1891)MP fo… 1815… "MP for Kingston, O…     1
2 "\nAlexander Mack… "\nAlex… "1822–1892)MP fo… 1822… "MP for Lambton, ON…     1
3 "\nSir John A. Ma… "\nSir … "1815–1891)MP fo… 1815… "MP for Victoria, B…     1
4 "\nSir John Abbot… "\nSir … "1821–1893)Senat… 1821… "Senator for Quebec…     1
5 "\nSir John Thomp… "\nSir … "1845–1894)MP fo… 1845… "MP for Antigonish,…     1
6 "\nSir Mackenzie … "\nSir … "1823–1917)Senat… 1823… "Senator for Ontari…     1

Finally, we need to clean up the columns.

cleaned_data <- 
  initial_clean %>% 
  select(Name, Date) %>% 
  separate(Date, into = c("Birth", "Died"), sep = "–", remove = FALSE) %>% # The 
  # PMs who have died have their birth and death years separated by a hyphen, but 
  # you need to be careful with the hyphen as it seems to be a slightly odd type of 
  # hyphen and you need to copy/paste it.
  mutate(Birth = str_remove(Birth, "b. ")) %>% # Alive PMs have slightly different format
  select(-Date) %>% 
  mutate(Name = str_remove(Name, "\n")) %>% # Remove some html tags that remain
  mutate_at(vars(Birth, Died), ~as.integer(.)) %>% # Change birth and death to integers
  mutate(Age_at_Death = Died - Birth) %>%  # Add column of the number of years they lived
  distinct() # Some of the PMs had two goes at it.

head(cleaned_data)
# A tibble: 6 x 4
  Name                  Birth  Died Age_at_Death
  <chr>                 <int> <int>        <int>
1 Sir John A. MacDonald  1815  1891           76
2 Alexander Mackenzie    1822  1892           70
3 Sir John Abbott        1821  1893           72
4 Sir John Thompson      1845  1894           49
5 Sir Mackenzie Bowell   1823  1917           94
6 Sir Charles Tupper     1821  1915           94

Explore

At this point we’d like to make a graph that illustrates how long each prime minister lived. If they are still alive then we would like them to run to the end, but we would like to colour them differently.

cleaned_data %>% 
  mutate(still_alive = if_else(is.na(Died), "Yes", "No"),
         Died = if_else(is.na(Died), as.integer(2020), Died)) %>% 
  mutate(Name = as_factor(Name)) %>% 
  ggplot(aes(x = Birth, 
             xend = Died,
             y = Name,
             yend = Name, 
             color = still_alive)) +
  geom_segment() +
  labs(x = "Year of birth",
       y = "Prime minister",
       color = "PM is alive",
       title = "Canadian Prime Minister, by year of birth") +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")

Wickham, Hadley. 2019. Rvest: Easily Harvest (Scrape) Web Pages. https://CRAN.R-project.org/package=rvest.

References