Text as data

Rohan Alexander https://rohanalexander.com/ (University of Toronto)
November 26, 2020

Required reading

Required viewing

Recommended reading

Key concepts/skills/etc

Key libraries

Key functions/etc

Pre-quiz

Introduction

TBD

Lasso regression

This subsection, and much of the code that is used, directly draws on Julia Silge’s notes, in particular: https://juliasilge.com/blog/tidy-text-classification/ (Silge 2018).

One of the nice aspects of text is that we can adapt our existing methods to use it as an input. Here we are going to use a variation of logistics regression, along with text inputs, to forecast. If you want to learn more about Lasso regression, then you should consider taking Arik’s course over the summer, where he will dive into machine learning using Python.

In this section we are going to have two different text inputs, train a model on a sample of text from each of them, and then try to use that model to forecast the text in a training set. Although this is a arbitrary example, you could imagine many real-world applications. For instance, if you work at Twitter then you may want to know if a tweet was likely written by a bot, or by a human. Or similarly, imagine that you work for a political party - you may like to know if an email was likely from an email campaign organised by a group, or from an individual.

First we need to get some data. Julia Silge’s example, nicely, uses book text as input. Seeing as I am jointly appointed at a Faculty of Information, that seems especially nice. The wonderful thing about this is that there is an R package - gutenbergr - that makes it easy to get text from Project Gutenberg into R. The key function is gutenberg_download(), which needs a key for the book that you want. We’ll consider Jane Eyre and Alice’s Adventures in Wonderland, which have the keys of 1260 and 11, respectively.

library(gutenbergr)

alice_and_jane <- gutenberg_download(c(1260, 11), meta_fields = "title")
# Save the dataset so that we don't need to overwhelm the servers each time
write_csv(alice_and_jane, "inputs/books/alice_and_jane.csv")

head(alice_and_jane)
library(gutenbergr)

alice_and_jane <- read_csv("inputs/books/alice_and_jane.csv")

head(alice_and_jane)
gutenberg_idtexttitle
11ALICE'S ADVENTURES IN WONDERLANDAlice's Adventures in Wonderland
11Alice's Adventures in Wonderland
11Lewis CarrollAlice's Adventures in Wonderland
11Alice's Adventures in Wonderland
11THE MILLENNIUM FULCRUM EDITION 3.0Alice's Adventures in Wonderland
11Alice's Adventures in Wonderland

One of the great things about this is that the dataset is a tibble. So we can just work with all our familiar skills. The package has a lot more functionality, so I’d encourage you to look at the package’s website: https://github.com/ropensci/gutenbergr. Each line of the book is read in as a different row in the dataset. Notice that we have downloaded two books here at once, and so we added the title. The two books are one after each other. You can see that we have both by looking at some summary statistics.

table(alice_and_jane$title)

Alice's Adventures in Wonderland      Jane Eyre: An Autobiography 
                            3339                            20659 

So it looks like Jane Eyre is much longer than Alice in Wonderland, which isn’t a surprise to those who have read them. I don’t want to step into Digital Humanities too much, as I don’t know anything about it, but looking at things like the broader context of when these books were written, or other books that were written at similar times, is likely a fascinating area.

We’ll just get rid of blank lines

library(janitor)
# TODO There's a way to do this within janitor, but I forget, need to look it up.
alice_and_jane <- 
  alice_and_jane %>% 
  mutate(blank_line = if_else(text == "", 1, 0)) %>% 
  filter(blank_line == 0) %>% 
  select(-blank_line)

table(alice_and_jane$title)

Alice's Adventures in Wonderland      Jane Eyre: An Autobiography 
                            2481                            16395 

There’s still an overwhelming amount of Jane Eyre in there. So we’ll just sample from Jane Eyre to make it more equal.

set.seed(853)

alice_and_jane$rows <- c(1:nrow(alice_and_jane))
sample_from_me <- alice_and_jane %>% filter(title == "Jane Eyre: An Autobiography")
keep_me <- sample(x = sample_from_me$rows, size = 2481, replace = FALSE)

alice_and_jane <- 
  alice_and_jane %>% 
  filter(title == "Alice's Adventures in Wonderland" | rows %in% keep_me) %>% 
  select(-rows)

table(alice_and_jane$title)

Alice's Adventures in Wonderland      Jane Eyre: An Autobiography 
                            2481                             2481 

There’s a bunch of issues here, for instance, we have the whole of Alice, but we only have random bits of Jane, but nonetheless let’s continue and we’ll try to do something about that in a moment.

Now we want to get a sample of text from each book. We will use the lines to distinguish these samples. So we use a counter that will add a line number.

alice_and_jane <- 
  alice_and_jane %>% 
  group_by(title) %>% 
  mutate(line_number = paste(gutenberg_id, row_number(), sep = "_")) %>% 
  ungroup()

We now want to sepearate out the words. We’ll just use tidytext, because the focus here is on modelling, but there are a bunch of alternatives and one especially good one is the quanteda package, specifically, the tokens() function.

library(tidytext)
alice_and_jane_by_word <- 
  alice_and_jane %>% 
  unnest_tokens(word, text) %>%
  group_by(word) %>%
  filter(n() > 10) %>%
  ungroup()

Notice here that we removed any word that wasn’t used more than 10 times. Nonetheless we still have a lot of unique words. (If we didn’t require that the word be used by the author at least 10 times then we end up with more than 6,000 words.)

alice_and_jane_by_word$word %>% unique() %>% length()
[1] 585

The reason this is relevant is because these are our independent variables. So where you may be used to having something less than 10 explanatory variables, in this case we are going to have 585 As such, we need a model that can handle this.

However, as mentioned before, we are going to have some rows that essentially just had one word. While we could allow that, it might also be nice to give the model at least a few words to work with.

alice_and_jane_by_word <- 
  alice_and_jane_by_word %>% 
  group_by(title, line_number) %>% 
  mutate(number_of_words_in_line = n()) %>% 
  ungroup() %>% 
  filter(number_of_words_in_line > 2) %>% 
  select(-number_of_words_in_line)

We’ll create a test/training split, and load in tidymodels.

library(tidymodels)

set.seed(853)

alice_and_jane_by_word_split <- 
  alice_and_jane_by_word %>%
  select(title, line_number) %>% 
  distinct() %>% 
  initial_split(prop = 3/4, strata = title)

# alice_and_jane_by_word_train <- training(alice_and_jane_by_word_split) %>% select(line_number)
# alice_and_jane_by_word_test <- testing(alice_and_jane_by_word_split)
# 
# rm(alice_and_jane_by_word_split)

Now we need to create a document-term matrix.

alice_and_jane_dtm_training <- 
  alice_and_jane_by_word %>% 
  count(line_number, word) %>% 
  inner_join(training(alice_and_jane_by_word_split) %>% select(line_number)) %>% 
  cast_dtm(term = word, document = line_number, value = n)

dim(alice_and_jane_dtm_training)
[1] 3415  585

So we have our independent variables sorted, now we need our binary dependent variable, which is whether the book is Alice in Wonderland or Jane Eyre.

response <- 
  data.frame(id = dimnames(alice_and_jane_dtm_training)[[1]]) %>% 
  separate(id, into = c("book", "line", sep = "_")) %>% 
  mutate(is_alice = if_else(book == 11, 1, 0)) 
  

predictor <- alice_and_jane_dtm_training[] %>% as.matrix()

Now we can run our model.

library(glmnet)

model <- cv.glmnet(x = predictor,
                   y = response$is_alice,
                   family = "binomial",
                   keep = TRUE
                   )

save(model, file = "outputs/models/alice_vs_jane.rda")
load("outputs/models/alice_vs_jane.rda")
library(glmnet)
library(broom)

coefs <- model$glmnet.fit %>%
  tidy() %>%
  filter(lambda == model$lambda.1se)

coefs %>% head()
termstepestimatelambdadev.ratio
(Intercept)36-0.335  0.005970.562
in36-0.144  0.005970.562
she360.39   0.005970.562
so360.002490.005970.562
a36-0.117  0.005970.562
about360.279  0.005970.562
coefs %>%
  group_by(estimate > 0) %>%
  top_n(10, abs(estimate)) %>%
  ungroup() %>%
  ggplot(aes(fct_reorder(term, estimate), estimate, fill = estimate > 0)) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  coord_flip() +
  theme_minimal() +
  labs(x = "Coefficient",
       y = "Word") +
  scale_fill_brewer(palette = "Set1")

Perhaps unsurprisingly, if you mention Alice then it’s likely to be a Alice in Wonderland and if you mention Jane then it’s likely to be Jane Eyre.

Topic models

A version of these notes was previously circulated as part of Alexander and Alexander (2020).

Overview

Sometimes we have a statement and we want to know what it is about. Sometimes this will be easy, but we don’t always have titles for statements, and even when we do, sometimes we do not have titles that define topics in a well-defined and consistent way. One way to get consistent estimates of the topics of each statement is to use topic models. While there are many variants, one way is to use the latent Dirichlet allocation (LDA) method of Blei, Ng, and Jordan (2003), as implemented by the R package ‘topicmodels’ by Grün and Hornik (2011).

The key assumption behind the LDA method is that each statement, ‘a document’, is made by a person who decides the topics they would like to talk about in that document, and then chooses words, ‘terms’, that are appropriate to those topics. A topic could be thought of as a collection of terms, and a document as a collection of topics. The topics are not specified ex ante; they are an outcome of the method. Terms are not necessarily unique to a particular topic, and a document could be about more than one topic. This provides more flexibility than other approaches such as a strict word count method. The goal is to have the words found in documents group themselves to define topics.

Document generation process

The LDA method considers each statement to be a result of a process where a person first chooses the topics they want to speak about. After choosing the topics, the person then chooses appropriate words to use for each of those topics. More generally, the LDA topic model works by considering each document as having been generated by some probability distribution over topics. For instance, if there were five topics and two documents, then the first document may be comprised mostly of the first few topics; the other document may be mostly about the final few topics (Figure 1).

Probability distributions over topicsProbability distributions over topics

Figure 1: Probability distributions over topics

Similarly, each topic could be considered a probability distribution over terms. To choose the terms used in each document the speaker picks terms from each topic in the appropriate proportion. For instance, if there were ten terms, then one topic could be defined by giving more weight to terms related to immigration; and some other topic may give more weight to terms related to the economy (Figure 2).

Probability distributions over termsProbability distributions over terms

Figure 2: Probability distributions over terms

Following Blei and Lafferty (2009), Blei (2012) and Griffiths and Steyvers (2004), the process by which a document is generated is more formally considered to be:

  1. There are \(1, 2, \dots, k, \dots, K\) topics and the vocabulary consists of \(1, 2, \dots, V\) terms. For each topic, decide the terms that the topic uses by randomly drawing distributions over the terms. The distribution over the terms for the \(k\)th topic is \(\beta_k\). Typically a topic would be a small number of terms and so the Dirichlet distribution with hyperparameter \(0<\eta<1\) is used: \(\beta_k \sim \mbox{Dirichlet}(\eta)\).1 Strictly, \(\eta\) is actually a vector of hyperparameters, one for each \(K\), but in practice they all tend to be the same value.
  2. Decide the topics that each document will cover by randomly drawing distributions over the \(K\) topics for each of the \(1, 2, \dots, d, \dots, D\) documents. The topic distributions for the \(d\)th document are \(\theta_d\), and \(\theta_{d,k}\) is the topic distribution for topic \(k\) in document \(d\). Again, the Dirichlet distribution with the hyperparameter \(0<\alpha<1\) is used here because usually a document would only cover a handful of topics: \(\theta_d \sim \mbox{Dirichlet}(\alpha)\). Again, strictly \(\alpha\) is vector of length \(K\) of hyperparameters, but in practice each is usually the same value.
  3. If there are \(1, 2, \dots, n, \dots, N\) terms in the \(d\)th document, then to choose the \(n\)th term, \(w_{d, n}\):
    1. Randomly choose a topic for that term \(n\), in that document \(d\), \(z_{d,n}\), from the multinomial distribution over topics in that document, \(z_{d,n} \sim \mbox{Multinomial}(\theta_d)\).
    2. Randomly choose a term from the relevant multinomial distribution over the terms for that topic, \(w_{d,n} \sim \mbox{Multinomial}(\beta_{z_{d,n}})\).

Given this set-up, the joint distribution for the variables is (Blei (2012), p.6): \[p(\beta_{1:K}, \theta_{1:D}, z_{1:D, 1:N}, w_{1:D, 1:N}) = \prod^{K}_{i=1}p(\beta_i) \prod^{D}_{d=1}p(\theta_d) \left(\prod^N_{n=1}p(z_{d,n}|\theta_d)p\left(w_{d,n}|\beta_{1:K},z_{d,n}\right) \right).\]

Based on this document generation process the analysis problem, discussed in the next section, is to compute a posterior over \(\beta_{1:K}\) and \(\theta_{1:D}\), given \(w_{1:D, 1:N}\). This is intractable directly, but can be approximated (Griffiths and Steyvers (2004) and Blei (2012)).

Analysis process

After the documents are created, they are all that we have to analyse. The term usage in each document, \(w_{1:D, 1:N}\), is observed, but the topics are hidden, or ‘latent’. We do not know the topics of each document, nor how terms defined the topics. That is, we do not know the probability distributions of Figures 1 or 2. In a sense we are trying to reverse the document generation process – we have the terms and we would like to discover the topics.

If the earlier process around how the documents were generated is assumed and we observe the terms in each document, then we can obtain estimates of the topics (Steyvers and Griffiths (2006)). The outcomes of the LDA process are probability distributions and these define the topics. Each term will be given a probability of being a member of a particular topic, and each document will be given a probability of being about a particular topic. That is, we are trying to calculate the posterior distribution of the topics given the terms observed in each document (Blei (2012), p.7): \[p(\beta_{1:K}, \theta_{1:D}, z_{1:D, 1:N} | w_{1:D, 1:N}) = \frac{p\left(\beta_{1:K}, \theta_{1:D}, z_{1:D, 1:N}, w_{1:D, 1:N}\right)}{p(w_{1:D, 1:N})}.\]

The initial practical step when implementing LDA given a corpus of documents is to remove ‘stop words’. These are words that are common, but that don’t typically help to define topics. There is a general list of stop words such as: “a”; “a’s”; “able”; “about”; “above”… We also remove punctuation and capitalisation. The documents need to then be transformed into a document-term-matrix. This is essentially a table with a column of the number of times each term appears in each document.

After the dataset is ready, the R package ‘topicmodels’ by Grün and Hornik (2011) can be used to implement LDA and approximate the posterior. It does this using Gibbs sampling or the variational expectation-maximization algorithm. Following Steyvers and Griffiths (2006) and Darling (2011), the Gibbs sampling process attempts to find a topic for a particular term in a particular document, given the topics of all other terms for all other documents. Broadly, it does this by first assigning every term in every document to a random topic, specified by Dirichlet priors with \(\alpha = \frac{50}{K}\) and \(\eta = 0.1\) (Steyvers and Griffiths (2006) recommends \(\eta = 0.01\)), where \(\alpha\) refers to the distribution over topics and \(\eta\) refers to the distribution over terms (Grün and Hornik (2011), p.7). It then selects a particular term in a particular document and assigns it to a new topic based on the conditional distribution where the topics for all other terms in all documents are taken as given (Grün and Hornik (2011), p.6): \[p(z_{d, n}=k | w_{1:D, 1:N}, z'_{d, n}) \propto \frac{\lambda'_{n\rightarrow k}+\eta}{\lambda'_{.\rightarrow k}+V\eta} \frac{\lambda'^{(d)}_{n\rightarrow k}+\alpha}{\lambda'^{(d)}_{-i}+K\alpha} \] where \(z'_{d, n}\) refers to all other topic assignments; \(\lambda'_{n\rightarrow k}\) is a count of how many other times that term has been assigned to topic \(k\); \(\lambda'_{.\rightarrow k}\) is a count of how many other times that any term has been assigned to topic \(k\); \(\lambda'^{(d)}_{n\rightarrow k}\) is a count of how many other times that term has been assigned to topic \(k\) in that particular document; and \(\lambda'^{(d)}_{-i}\) is a count of how many other times that term has been assigned in that document. Once \(z_{d,n}\) has been estimated, then estimates for the distribution of words into topics and topics into documents can be backed out.

This conditional distribution assigns topics depending on how often a term has been assigned to that topic previously, and how common the topic is in that document (Steyvers and Griffiths (2006)). The initial random allocation of topics means that the results of early passes through the corpus of document are poor, but given enough time the algorithm converges to an appropriate estimate.

Warnings and extensions

The choice of the number of topics, k, affects the results, and must be specified a priori. If there is a strong reason for a particular number, then this can be used. Otherwise, one way to choose an appropriate number is to use a test and training set process. Essentially, this means running the process on a variety of possible values for k and then picking an appropriate value that performs well.

One weakness of the LDA method is that it considers a ‘bag of words’ where the order of those words does not matter (Blei (2012)). It is possible to extend the model to reduce the impact of the bag-of-words assumption and add conditionality to word order. Additionally, alternatives to the Dirichlet distribution can be used to extend the model to allow for correlation. For instance, in Hansard topics related the army may be expected to be more commonly found with topics related to the navy, but less commonly with topics related to banking.

Word embedding

Conclusion

Using text as data is exciting because of the quantity and variety of text that is available to us. In general, dealing with text datasets is messy. There is a lot of cleaning and preparation that is typically required. Often text datasets are large. As such, having a workflow in place, in which you work in a reproducible way, simulating data first, and then clearly communicating your findings becomes critical, if only to keep everything organised in your own mind. Nonetheless, it is an exciting area, and I encourage you to regularly use text analysis where possible.

In terms of next steps there are two, related, concerns: data and analysis.

In terms of data there are many places to get large amounts of text data relatively easily, including:

In terms of analysis:

Alexander, Rohan, and Monica Alexander. 2020. “The Increased Effect of Elections and Changing Prime Ministers on Topics Discussed in the Australian Federal Parliament Between 1901 and 2018.” https://rohanalexander.com/pdfs/AlexanderAlexander-EffectofElectionsandPrimeMinisters.pdf.

Blei, David M. 2012. “Probabilistic Topic Models.” Communications of the ACM 55 (4): 77–84.

Blei, David M, and John D Lafferty. 2009. “Topic Models.” In Text Mining, 101–24. Chapman; Hall/CRC.

Blei, David M, Andrew Y Ng, and Michael I Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (Jan): 993–1022.

Darling, William M. 2011. “A Theoretical and Practical Implementation Tutorial on Topic Modeling and Gibbs Sampling.” In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 642–47.

Griffiths, Thomas, and Mark Steyvers. 2004. “Finding Scientific Topics.” PNAS 101: 5228–35.

Grün, Bettina, and Kurt Hornik. 2011. “topicmodels: An R Package for Fitting Topic Models.” Journal of Statistical Software 40 (13): 1–30. https://doi.org/10.18637/jss.v040.i13.

Silge, Julia. 2018. Text Classification with Tidy Data Principles. https://juliasilge.com/blog/tidy-text-classification/.

Steyvers, Mark, and Tom Griffiths. 2006. “Probabilistic Topic Models.” In Latent Semantic Analysis: A Road to Meaning, edited by T. Landauer, D McNamara, S. Dennis, and W. Kintsch.


  1. The Dirichlet distribution is a variation of the beta distribution that is commonly used as a prior for categorical and multinomial variables. If there are just two categories, then the Dirichlet and the beta distributions are the same. In the special case of a symmetric Dirichlet distribution, \(\eta=1\), it is equivalent to a uniform distribution. If \(\eta<1\), then the distribution is sparse and concentrated on a smaller number of the values, and this number decreases as \(\eta\) decreases. A hyperparameter is a parameter of a prior distribution.↩︎

References