Getting text data

Text as data

Aspects of this section have been previously published.

Required reading

Recommended reading

Recommended viewing

Key concepts/skills/etc

Key libraries

Key functions/etc



Text data is all around us, and in many cases is some of the earliest types of data that we are exposed to. Recent increases in computational power, the development of new methods, and the enormous availability of text, means that there has been a great deal of interest in using text as data. Initial methods tend to focus, essentially, on converting text into numbers and then analysing them using traditional methods. More recent methods have begun to take advantage of the structure that is inherent in text, to draw additional meaning. The difference is perhaps akin to a child who can group similar colors, compared with a child who knows what objects are; although both crocodiles and trees are green, and you can do something with that knowledge, you can do more by knowing that a crocodile could eat you, and a tree probably won’t.

In this section we cover a variety of techniques designed to equip you with the basics of using text as data. One of the great things about text data is that it is typically not generated for the purposes of our analysis. That’s great because it removes one of the unobservable variables that we typically have to worry about. The trade-off is that we typically have to do a bunch more work to get it into a form that we can work with.

Getting text data

Text as data is an exciting tool to apply. But many guides assume that you already have a nice dataset. Because we’ve focused on workflow in these notes, we know that’s not likely to be true! In this section we will scrape some text from a website. We’ve already seen examples of scraping, but in general those were focused on exploiting tables in the website. Here we’re going to instead focus on paragraphs of text, hence we’ll focus on different html/css tags.

We’re going to us the rvest package to make it easier to scrape data. We’re also going to use the purrr package to apply a function to a bunch of different URLs. For those of you with a little bit of programming, this is an alternative to using a for loop. For those of you with a bit of CS, this is a package that adds functional programming to R.


# Some websites
address_to_visit <- c("",

# Save names
save_name <- address_to_visit %>% 
  str_remove("") %>% 
  str_remove(".html") %>%
  str_remove("20[:digit:]{2}/") %>% 
  str_c("inputs/rba/", ., ".csv")

Create the function that will visit address_to_visit and save to save_name files.

visit_address_and_save_content <-
           name_of_file_to_save_as) {
    # The function takes two inputs
    name_of_address_to_visit <- address_to_visit[1]
    name_of_file_to_save_as <- save_name[1]
    read_html(name_of_address_to_visit) %>% # Go to the website and read the html
      html_node("#content") %>% # Find the content part
      html_text() %>% # Extract the text of the content part
      write_lines(name_of_file_to_save_as) # Save as a text file
    print(paste("Done with", name_of_address_to_visit, "at", Sys.time()))  
    # Helpful so that you know progress when running it on all the records
    Sys.sleep(sample(30:60, 1)) # Space out each request by somewhere between 
    # 30 and 60 seconds each so that we don't overwhelm their server

# If there is an error then ignore it and move to the next one
visit_address_and_save_content <-

We now apply that function to our list of URLs.

# Walk through the addresses and apply the function to each
      ~ visit_address_and_save_content(.x, .y))

The result is a bunch of files with saved text data.

In this case we used scraping, but there are, of course, many ways. We may be able to use APIs, for instance, In the case of the Airbnb dataset that we examined earlier in the notes. If you are lucky then it may simply be that there is a column that contains text data in your dataset.

Preparing text datasets

This section draws on Sharla Gelfand’s blog post, linked in the required readings.

As much as I would like to stick with Australian economics and politics examples, I realise that this is probably only of limited interest to most of you. As such, in this section we will consider a dataset of Sephora reviews. Please read Sharla’s blog post ( for another take on this dataset.

In this section we assume that there is some text data that you have gathered. At this point we need to change it into a form that we can work with. For some applications this will be counts of words. For others it may be some variant of this. The dataset that we are going to use is from Sephora, was scraped by Connie and I originally became aware of it because of Sharla.

First let’s read in the data.

# This code is taken from

crying <- fromJSON("",
  simplifyDataFrame = TRUE

crying <- as_tibble(crying[["reviews"]])

# A tibble: 6 x 6
  date  product_info$br… $name $type $url  review_body review_title stars
  <chr> <chr>            <chr> <chr> <chr> <chr>       <chr>        <chr>
1 29 M… Too Faced        Bett… Masc… http… "Now I can… AWESOME      5 st…
2 29 S… Too Faced        Bett… Masc… http… "This hold… if you're s… 5 st…
3 23 M… Too Faced        Bett… Masc… http… "I just bo… Hate it      1 st…
4 15 A… Too Faced        Bett… Masc… http… "To start … Nearly perf… 5 st…
5 21 S… Too Faced        Bett… Masc… http… "This masc… Amazing!!    5 st…
6 30 M… Too Faced        Bett… Masc… http… "Let's tal… Tricky but … 5 st…
# … with 1 more variable: userid <dbl>
[1] "date"         "product_info" "review_body"  "review_title" "stars"       
[6] "userid"      

We’ll focus on the review_body variable and the number of stars stars that the reviewer gave. Most of them are 5 stars, so we’ll just focus on whether or not the review is five stars.

crying <- 
  crying %>% 
  select(review_body, stars) %>% 
  mutate(stars = str_remove(stars, " stars?"),  # The question mark at the end means it'l get rid of 'star' and 'stars'.
         stars = as.integer(stars)
         ) %>% 
  mutate(five_stars = if_else(stars == 5, 1, 0))


 1  2  3  4  5 
 6  2  4 14 79 

In this example we are going to split everything into separate words. When we do this it is just searching for a space, and so what other types of elements are going to be considered ‘words’?

crying_by_words <- 
  crying %>%
  unnest_tokens(word, review_body, token = "words")

# A tibble: 6 x 3
  stars five_stars word 
  <int>      <dbl> <chr>
1     5          1 now  
2     5          1 i    
3     5          1 can  
4     5          1 cry  
5     5          1 all  
6     5          1 i    

We now want to count the number of times each word is used by each of the star classifications.

crying_by_words <- 
  crying_by_words %>% 
  count(stars, word, sort = TRUE)

# A tibble: 6 x 3
  stars word      n
  <int> <chr> <int>
1     5 i       348
2     5 and     249
3     5 the     239
4     5 it      211
5     5 a       193
6     5 this    178
crying_by_words %>% 
  filter(stars == 1) %>% 
# A tibble: 6 x 3
  stars word      n
  <int> <chr> <int>
1     1 the      39
2     1 i        24
3     1 and      21
4     1 it       21
5     1 to       19
6     1 my       16

So you can see that the most popular word for five star reviews is ‘i’, and that the most popular word for one star reviews is ‘the’.

At this point, we can use the data to do a whole bunch of different things, but one nice measure to look at is term frequency e.g. in this case how many times is a word used in reviews with a particular star rating. The issue is that there are a lot of words that are commonly used regardless of context. As such, we may also like to look at the inverse document frequency in which we ‘penalise’ words that occur in many particular star ratings. For instance, ‘the’ probably occurs in both one star and five star reviews and so its idf is lower than ‘hate’ which probably only occurs in one star reviews. The term frequency–inverse document frequency (tf-idf) is then the product of these.

We can create this value using the bind_tf_idf() function from the tidytext package, and this will create a bunch of new columns, one for each word and star combination.

# This code, and the one in the next block, is from Julia Silge:
crying_by_words_tf_idf <- 
  crying_by_words %>%
  bind_tf_idf(word, stars, n) %>%

# A tibble: 6 x 6
  stars word              n      tf   idf tf_idf
  <int> <chr>         <int>   <dbl> <dbl>  <dbl>
1     2 below             1 0.00826  1.61 0.0133
2     2 boy               1 0.00826  1.61 0.0133
3     2 choice            1 0.00826  1.61 0.0133
4     2 contrary          1 0.00826  1.61 0.0133
5     2 exceptionally     1 0.00826  1.61 0.0133
6     2 migrates          1 0.00826  1.61 0.0133
crying_by_words_tf_idf %>% 
  group_by(stars) %>%
  top_n(10) %>%
  ungroup %>% 
  mutate(word = reorder_within(word, tf_idf, stars)) %>%
  mutate(stars = as_factor(stars)) %>%
  filter(stars %in% c(1, 5)) %>% 
  ggplot(aes(word, tf_idf, fill = stars)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(vars(stars), scales = "free") +
    scale_x_reordered() +
    coord_flip() +
    labs(x = "Word", 
         y = "tf-idf") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1")