Extracting data from PDFs

Required reading

Recommended reading

Key concepts/skills/etc

Key libraries

Key functions

Pre-quiz

  1. Name three functions from the purrr package and situations in which you may use them.
  2. What are three delimiters that could be useful when trying to bring order to the PDF that you read in as a character vector?
  3. What do I need to put inside “SOMETHING_HERE” if I want to match regular expressions for a full stop i.e. “.”?
  4. Name three reasons for sketching out what you want before starting to try to extract data from a PDF.
  5. If you are interested in demographic data then what are three checks that you might like to do? What are three if you are interested in economic data such as GDP, interest rates, and exchange rates?

Introduction

In contrast to an API, a PDF is usually only produced for human (not computer) consumption. The nice thing about PDFs is that they are static and constant. And it is nice that they make data available at all. But the trade-off is that:

Indeed, sometimes governments publish data as PDFs because they don’t actually want you to be able to analyse it! Being able to get data from PDFs opens up a large number of dataset for you, some of which we’ll see in this chapter.

There are two important aspects to keep in mind when approaching a PDF with a mind to extracting data from it:

  1. Begin with an end in mind. Planning and then literally sketching out what you want from a final dataset/graph/paper stops you wasting time and keeps you focused.
  2. Start simple, then iterate. The quickest way to make a complicated model is often to first build a simple model and then complicate it. Start with just trying to get one page of the PDF working or even just one line. Then iterate from there.

In this chapter we start by walking through several examples and then go through three case studies of varying difficulty.

Getting started

Figure 1 is a PDF that consists of just the first sentence from Jane Eyre taken from Project Gutenberg Bronte (1847).

First sentence of Jane Eyre

Figure 1: First sentence of Jane Eyre

We will use the package pdftools Ooms (2019a) to get the text in this one page PDF into R.

# install.packages("pdftools")
library(pdftools)
library(tidyverse)

first_example <- pdftools::pdf_text("inputs/pdfs/first_example.pdf")

first_example
[1] "There was no possibility of taking a walk that day."
class(first_example)
[1] "character"

We can see that the PDF has been correctly read in, as a character vector.

We will now try a slightly more complicated example that consists of the first few paragraphs of Jane Eyre (Figure 2). Also notice that now we have the chapter heading as well.

First few paragraphs of Jane Eyre

Figure 2: First few paragraphs of Jane Eyre

We use the same function as before.

second_example <- pdftools::pdf_text("inputs/pdfs/second_example.pdf")

second_example
[1] "CHAPTER I\nThere was no possibility of taking a walk that day. We had been wandering, indeed, in the\nleafless shrubbery an hour in the morning; but since dinner (Mrs. Reed, when there was no\ncompany, dined early) the cold winter wind had brought with it clouds so sombre, and a rain so\npenetrating, that further out-door exercise was now out of the question.\nI was glad of it: I never liked long walks, especially on chilly afternoons: dreadful to me was the\ncoming home in the raw twilight, with nipped fingers and toes, and a heart saddened by the\nchidings of Bessie, the nurse, and humbled by the consciousness of my physical inferiority to\nEliza, John, and Georgiana Reed.\nThe said Eliza, John, and Georgiana were now clustered round their mama in the drawing-room:\nshe lay reclined on a sofa by the fireside, and with her darlings about her (for the time neither\nquarrelling nor crying) looked perfectly happy. Me, she had dispensed from joining the group;\nsaying, “She regretted to be under the necessity of keeping me at a distance; but that until she\nheard from Bessie, and could discover by her own observation, that I was endeavouring in good\nearnest to acquire a more sociable and childlike disposition, a more attractive and sprightly\nmanner—something lighter, franker, more natural, as it were—she really must exclude me from\nprivileges intended only for contented, happy, little children.”\n“What does Bessie say I have done?” I asked.\n“Jane, I don’t like cavillers or questioners; besides, there is something truly forbidding in a child\ntaking up her elders in that manner. Be seated somewhere; and until you can speak pleasantly,\nremain silent.”\nA breakfast-room adjoined the drawing-room, I slipped in there. It contained a bookcase: I soon\npossessed myself of a volume, taking care that it should be one stored with pictures. I mounted\ninto the window-seat: gathering up my feet, I sat cross-legged, like a Turk; and, having drawn the\nred moreen curtain nearly close, I was shrined in double retirement.\nFolds of scarlet drapery shut in my view to the right hand; to the left were the clear panes of\nglass, protecting, but not separating me from the drear November day. At intervals, while\nturning over the leaves of my book, I studied the aspect of that winter afternoon. Afar, it offered\na pale blank of mist and cloud; near a scene of wet lawn and storm-beat shrub, with ceaseless\nrain sweeping away wildly before a long and lamentable blast.\n"
class(second_example)
[1] "character"

Again, we have a character vector. The end of each line is signalled by ‘\n’, but other than that it looks pretty good.

Finally, we consider the first two pages.

We use the same function as before.

third_example <- pdftools::pdf_text("inputs/pdfs/third_example.pdf")

third_example
[1] "CHAPTER I\nThere was no possibility of taking a walk that day. We had been wandering, indeed, in the\nleafless shrubbery an hour in the morning; but since dinner (Mrs. Reed, when there was no\ncompany, dined early) the cold winter wind had brought with it clouds so sombre, and a rain so\npenetrating, that further out-door exercise was now out of the question.\nI was glad of it: I never liked long walks, especially on chilly afternoons: dreadful to me was the\ncoming home in the raw twilight, with nipped fingers and toes, and a heart saddened by the\nchidings of Bessie, the nurse, and humbled by the consciousness of my physical inferiority to\nEliza, John, and Georgiana Reed.\nThe said Eliza, John, and Georgiana were now clustered round their mama in the drawing-room:\nshe lay reclined on a sofa by the fireside, and with her darlings about her (for the time neither\nquarrelling nor crying) looked perfectly happy. Me, she had dispensed from joining the group;\nsaying, “She regretted to be under the necessity of keeping me at a distance; but that until she\nheard from Bessie, and could discover by her own observation, that I was endeavouring in good\nearnest to acquire a more sociable and childlike disposition, a more attractive and sprightly\nmanner—something lighter, franker, more natural, as it were—she really must exclude me from\nprivileges intended only for contented, happy, little children.”\n“What does Bessie say I have done?” I asked.\n“Jane, I don’t like cavillers or questioners; besides, there is something truly forbidding in a child\ntaking up her elders in that manner. Be seated somewhere; and until you can speak pleasantly,\nremain silent.”\nA breakfast-room adjoined the drawing-room, I slipped in there. It contained a bookcase: I soon\npossessed myself of a volume, taking care that it should be one stored with pictures. I mounted\ninto the window-seat: gathering up my feet, I sat cross-legged, like a Turk; and, having drawn the\nred moreen curtain nearly close, I was shrined in double retirement.\nFolds of scarlet drapery shut in my view to the right hand; to the left were the clear panes of\nglass, protecting, but not separating me from the drear November day. At intervals, while\nturning over the leaves of my book, I studied the aspect of that winter afternoon. Afar, it offered\na pale blank of mist and cloud; near a scene of wet lawn and storm-beat shrub, with ceaseless\nrain sweeping away wildly before a long and lamentable blast.\nI returned to my book—Bewick’s History of British Birds: the letterpress thereof I cared little\nfor, generally speaking; and yet there were certain introductory pages that, child as I was, I could\nnot pass quite as a blank. They were those which treat of the haunts of sea-fowl; of “the solitary\nrocks and promontories” by them only inhabited; of the coast of Norway, studded with isles from\nits southern extremity, the Lindeness, or Naze, to the North Cape—\n“Where the Northern Ocean, in vast whirls,\nBoils round the naked, melancholy isles\n"
[2] "Of farthest Thule; and the Atlantic surge\nPours in among the stormy Hebrides.”\nNor could I pass unnoticed the suggestion of the bleak shores of Lapland, Siberia, Spitzbergen,\nNova Zembla, Iceland, Greenland, with “the vast sweep of the Arctic Zone, and those forlorn\nregions of dreary space,—that reservoir of frost and snow, where firm fields of ice, the\naccumulation of centuries of winters, glazed in Alpine heights above heights, surround the pole,\nand concentre the multiplied rigours of extreme cold.” Of these death-white realms I formed an\nidea of my own: shadowy, like all the half-comprehended notions that float dim through\nchildren’s brains, but strangely impressive. The words in these introductory pages connected\nthemselves with the succeeding vignettes, and gave significance to the rock standing up alone in\na sea of billow and spray; to the broken boat stranded on a desolate coast; to the cold and ghastly\nmoon glancing through bars of cloud at a wreck just sinking.\nI cannot tell what sentiment haunted the quite solitary churchyard, with its inscribed headstone;\nits gate, its two trees, its low horizon, girdled by a broken wall, and its newly-risen crescent,\nattesting the hour of eventide.\nThe two ships becalmed on a torpid sea, I believed to be marine phantoms.\nThe fiend pinning down the thief’s pack behind him, I passed over quickly: it was an object of\nterror.\nSo was the black horned thing seated aloof on a rock, surveying a distant crowd surrounding a\ngallows.\nEach picture told a story; mysterious often to my undeveloped understanding and imperfect\nfeelings, yet ever profoundly interesting: as interesting as the tales Bessie sometimes narrated on\nwinter evenings, when she chanced to be in good humour; and when, having brought her ironing-\ntable to the nursery hearth, she allowed us to sit about it, and while she got up Mrs. Reed’s lace\nfrills, and crimped her nightcap borders, fed our eager attention with passages of love and\nadventure taken from old fairy tales and other ballads; or (as at a later period I discovered) from\nthe pages of Pamela, and Henry, Earl of Moreland.\nWith Bewick on my knee, I was then happy: happy at least in my way. I feared nothing but\ninterruption, and that came too soon. The breakfast-room door opened.\n“Boh! Madam Mope!” cried the voice of John Reed; then he paused: he found the room\napparently empty.\n“Where the dickens is she!” he continued. “Lizzy! Georgy! (calling to his sisters) Joan is not\nhere: tell mama she is run out into the rain—bad animal!”\n“It is well I drew the curtain,” thought I; and I wished fervently he might not discover my hiding-\nplace: nor would John Reed have found it out himself; he was not quick either of vision or\nconception; but Eliza just put her head in at the door, and said at once—\n"                                                                                                                                                                                                                
class(third_example)
[1] "character"

Now, notice that the first page is the first element of the character vector and the second page is the second element.

As we’re most familiar with rectangular data we’ll try to get it into that format as quickly as possible. And then we can use our regular tools to deal with it.

First we want to convert the character vector into a tibble. At this point we may like to add page numbers as well.

jane_eyre <- tibble(raw_text = third_example,
                    page_number = c(1:2))

We probably now want to separate the lines so that each line is an observation. We can do that by looking for the ‘\n’ remembering that we need to escape the backslash as it’s a special character.

jane_eyre <- separate_rows(jane_eyre, raw_text, sep = "\\n", convert = FALSE)
head(jane_eyre)
# A tibble: 6 x 2
  raw_text                                                           page_number
  <chr>                                                                    <int>
1 CHAPTER I                                                                    1
2 There was no possibility of taking a walk that day. We had been w…           1
3 leafless shrubbery an hour in the morning; but since dinner (Mrs.…           1
4 company, dined early) the cold winter wind had brought with it cl…           1
5 penetrating, that further out-door exercise was now out of the qu…           1
6 I was glad of it: I never liked long walks, especially on chilly …           1

Case-study: US Total Fertility Rate, by state and year (2000-2018)

Introduction

If you’re married to a demographer it is not too long until you are asked to look at a US Department of Health and Human Services Vital Statistics Report. In this case we are interested in trying to get the total fertility rate (the average number of births per woman assuming that woman experience the current age-specific fertility rates throughout their reproductive years)1 for each state for nineteen years. Annoyingly, the US persists in only making this data available in PDFs, but it makes a nice case study.

In the case of the year 2000 the table that we are interested in is on page 40 of a PDF that is available https://www.cdc.gov/nchs/data/nvsr/nvsr50/nvsr50_05.pdf and it is the column labelled: “Total fertility rate” (Figure 3).

Example Vital Statistics Report, from 2000

Figure 3: Example Vital Statistics Report, from 2000

Begin with an end in mind

The first step when getting data out of a PDF is to sketch out what you eventually want. A PDF typically contains a lot of information, and so it is handy to be very clear about what you need. This helps keep you focused, and prevents scope creep, but it is also helpful when thinking about data checks. Literally write down on paper what you have in mind.

In this case, what is needed is a table with a column for state, year and TFR (Figure 4).

Desired output from the PDF

Figure 4: Desired output from the PDF

Start simple, then iterate.

There are 19 different PDFs and we are interested in a particular column in a particular table in each of them. Unfortunately there is nothing magical about what is coming. This first step requires working out the link for each, and the page and column name that is of interest. In the end, this looks like this.

monicas_data <- read_csv("inputs/tfr_tables_info.csv")

monicas_data %>% 
  select(year, page, table, column_name, url) %>% 
  gt()
year page table column_name url
2000 40 10 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr50/nvsr50_05.pdf
2001 41 10 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr51/nvsr51_02.pdf
2002 46 10 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr52/nvsr52_10.pdf
2003 45 10 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr54/nvsr54_02.pdf
2004 52 11 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr55/nvsr55_01.pdf
2005 52 11 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr56/nvsr56_06.pdf
2006 49 11 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr57/nvsr57_07.pdf
2007 41 11 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr58/nvsr58_24.pdf
2008 43 12 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr59/nvsr59_01.pdf
2009 43 12 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr60/nvsr60_01.pdf
2010 42 12 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr61/nvsr61_01.pdf
2011 40 12 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr62/nvsr62_01.pdf
2012 38 12 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr62/nvsr62_09.pdf
2013 37 12 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr64/nvsr64_01.pdf
2014 38 12 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr64/nvsr64_12.pdf
2015 42 12 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr66/nvsr66_01.pdf
2016 29 8 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr67/nvsr67_01.pdf
2016 30 8 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr67/nvsr67_01.pdf
2017 23 12 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr67/nvsr67_08-508.pdf
2017 24 12 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr67/nvsr67_08-508.pdf
2018 23 12 Total fertility rate https://www.cdc.gov/nchs/data/nvsr/nvsr68/nvsr68_13-508.pdf

The first step is to get some code that works for one of them. I’ll step through the code in a lot more detail than normal because we’re going to use these pieces a lot.

We will choose the year 2000. We first download the data and save it.

download.file(url = monicas_data$url[1], 
              destfile = "inputs/pdfs/dhs/year_2000.pdf")

We now want to read the PDF in as a character vector.

dhs_2000 <- pdftools::pdf_text("inputs/pdfs/dhs/year_2000.pdf")

Convert it to a tibble, so that we can use familiar verbs on it.

dhs_2000 <- tibble(raw_data = dhs_2000)

head(dhs_2000)
# A tibble: 6 x 1
  raw_data                                                                      
  <chr>                                                                         
1 "Volume 50, Number 5                                                         …
2 "2     National Vital Statistics Report, Vol. 50, No. 5, February 12, 2002\nH…
3 "                                                                            …
4 "4     National Vital Statistics Report, Vol. 50, No. 5, February 12, 2002\nD…
5 "                                                                            …
6 "6     National Vital Statistics Report, Vol. 50, No. 5, February 12, 2002\n …

Grab the page that is of interest (remembering that each page is a element of the character vector, hence a row in the tibble).

dhs_2000 <- 
  dhs_2000 %>% 
  slice(monicas_data$page[1])

head(dhs_2000)
# A tibble: 1 x 1
  raw_data                                                                      
  <chr>                                                                         
1 "40 National Vital Statistics Report, Vol. 50, No. 5, Revised May 15, 20022\n…

Now we want to separate the rows.

dhs_2000 <- 
  dhs_2000 %>% 
  separate_rows(raw_data, sep = "\\n", convert = FALSE)

head(dhs_2000)
# A tibble: 6 x 1
  raw_data                                                                      
  <chr>                                                                         
1 40 National Vital Statistics Report, Vol. 50, No. 5, Revised May 15, 20022    
2 Table 10. Number of births, birth rates, fertility rates, total fertility rat…
3 United States, each State and territory, 2000                                 
4 [By place of residence. Birth rates are live births per 1,000 estimated popul…
5 estimated in each area; total fertility rates are sums of birth rates for 5-y…
6 age group estimated in each area]                                             

Now we are searching for patterns that we can use. (If you have a lot of tables that you are interested in grabbing from PDFs then it may also be worthwhile considering the tabulizer package which is specifically designed for that. The issue is that it depends on Java and I always seem to run into trouble when I need to use Java so I avoid it when I can.)

Let’s look at the first ten lines of content.

dhs_2000[13:22,]
# A tibble: 10 x 1
   raw_data                                                                     
   <chr>                                                                        
 1 United States 1 ......................................................      …
 2 Alabama ...............................................................     …
 3 Alaska ...................................................................  …
 4 Arizona .................................................................   …
 5 Arkansas ...............................................................    …
 6 California ..............................................................   …
 7 Colorado ...............................................................    …
 8 Connecticut ...........................................................     …
 9 Delaware ..............................................................     …
10 District of Columbia ..............................................         …

It doesn’t get much better than this:

  1. We have dots separating the states from the data.
  2. We have a space between each of the columns.

So we can now separate this in to separate columns. First we want to match on when there is at least two dots (remembering that the dot is a special character and so needs to be escaped).

dhs_2000 <- 
  dhs_2000 %>% 
  separate(col = raw_data, 
           into = c("state", "data"), 
           sep = "\\.{2,}", 
           remove = FALSE,
           fill = "right"
           )

head(dhs_2000)
# A tibble: 6 x 3
  raw_data                              state                              data 
  <chr>                                 <chr>                              <chr>
1 40 National Vital Statistics Report,… 40 National Vital Statistics Repo… <NA> 
2 Table 10. Number of births, birth ra… Table 10. Number of births, birth… <NA> 
3 United States, each State and territ… United States, each State and ter… <NA> 
4 [By place of residence. Birth rates … [By place of residence. Birth rat… <NA> 
5 estimated in each area; total fertil… estimated in each area; total fer… <NA> 
6 age group estimated in each area]     age group estimated in each area]  <NA> 

We get the expected warnings about the top and the bottom as they don’t have multiple dots.

(Another option here is to use the pdf_data() function which would allow us to use location rather than delimiters.)

We can now separate the data based on spaces. There is an inconsistent number of spaces, so we first squish any example of more than one space into just one.

dhs_2000 <- 
  dhs_2000 %>%
  mutate(data = str_squish(data)) %>% 
  tidyr::separate(col = data, 
           into = c("number_of_births", 
                    "birth_rate", 
                    "fertility_rate", 
                    "TFR", 
                    "teen_births_all", 
                    "teen_births_15_17", 
                    "teen_births_18_19"), 
           sep = "\\s", 
           remove = FALSE
           )

head(dhs_2000)
# A tibble: 6 x 10
  raw_data state data  number_of_births birth_rate fertility_rate TFR  
  <chr>    <chr> <chr> <chr>            <chr>      <chr>          <chr>
1 40 Nati… 40 N… <NA>  <NA>             <NA>       <NA>           <NA> 
2 Table 1… Tabl… <NA>  <NA>             <NA>       <NA>           <NA> 
3 United … Unit… <NA>  <NA>             <NA>       <NA>           <NA> 
4 [By pla… [By … <NA>  <NA>             <NA>       <NA>           <NA> 
5 estimat… esti… <NA>  <NA>             <NA>       <NA>           <NA> 
6 age gro… age … <NA>  <NA>             <NA>       <NA>           <NA> 
# … with 3 more variables: teen_births_all <chr>, teen_births_15_17 <chr>,
#   teen_births_18_19 <chr>

This is all looking fairly great. The only thing left is to clean up.

dhs_2000 <- 
  dhs_2000 %>% 
  select(state, TFR) %>% 
  slice(13:69) %>% 
  mutate(year = 2000)

dhs_2000
# A tibble: 57 x 3
   state                   TFR      year
   <chr>                   <chr>   <dbl>
 1 "United States 1 "      2,130.0  2000
 2 "Alabama "              2,021.0  2000
 3 "Alaska "               2,437.0  2000
 4 "Arizona "              2,652.5  2000
 5 "Arkansas "             2,140.0  2000
 6 "California "           2,186.0  2000
 7 "Colorado "             2,356.5  2000
 8 "Connecticut "          1,931.5  2000
 9 "Delaware "             2,014.0  2000
10 "District of Columbia " 1,975.5  2000
# … with 47 more rows

And we’re done for that year. Now we want to take these pieces, put them into a function and then run that function over all 19 years.

Iterating

Get the PDFs

The first part is downloading each of the 19 PDFs that we need. We’re going to build on the code that we used before. That code was:

download.file(url = monicas_data$url[1], destfile = "inputs/pdfs/dhs/year_2000.pdf")

To modify this we need:

  1. To have it iterate through each of the lines in the dataset that contains our CSVs (i.e. where it says 1, we want 1, then 2, then 3, etc.).
  2. Where it has a filename, we need it to iterate through our desired filenames (i.e. year_2000, then year_2001, then year_2002, etc).
  3. We’d like for it to do all of this in a way that is a little robust to errors. For instance, if one of the URLs is wrong or the internet drops out then we’d like it to just move onto the next PDF, and then warn us at the end that it missed one, not to stop. (This doesn’t really matter because it’s only 19 files, but it’s pretty easy to find yourself doing this for thousands of files).

We will draw on the purrr package for this Henry and Wickham (2020).

library(purrr)
monicas_data <- 
  monicas_data %>% 
  mutate(pdf_name = paste0("inputs/pdfs/dhs/year_", year, ".pdf"))
purrr::walk2(monicas_data$url, monicas_data$pdf_name, purrr::safely(~download.file(.x , .y)))

What this code does it take the function download.file() and give it two arguments: .x and .y. The function walk2() then applies that function to the inputs that we give it, in this case the URLs columns is the .x and the pdf_names column is the .y. Finally, the safely() function means that if there are any failures then it just moves onto the next file instead of throwing an error.

We now have each of the PDFs saved and we can move onto getting the data from them.

Get data from the PDFs

Now we need to get the data from the PDFs. As before, we’re going to build on the code that we used before. That code (overly condensed) was:

dhs_2000 <- pdftools::pdf_text("inputs/pdfs/dhs/year_2000.pdf")

dhs_2000 <- 
  tibble(raw_data = dhs_2000) %>% 
  slice(monicas_data$page[1]) %>% 
  separate_rows(raw_data, sep = "\\n", convert = FALSE) %>% 
  separate(col = raw_data, into = c("state", "data"), sep = "\\.{2,}", remove = FALSE) %>% 
  mutate(data = str_squish(data)) %>% 
  separate(col = data, 
           into = c("number_of_births", "birth_rate", "fertility_rate", "TFR", "teen_births_all", "teen_births_15_17", "teen_births_18_19"), 
           sep = "\\s", 
           remove = FALSE) %>% 
  select(state, TFR) %>% 
  slice(13:69) %>% 
  mutate(year = 2000)

dhs_2000

There are a bunch of aspects here that have been hardcoded, but the first thing that we want to iterate is the argument to pdf_text(), then the number in in slice() will also need to change (that is doing the work to get only the page that we are interested in).

Two aspects are hardcoded and these may need to be updated. In particular: 1) The separate only works if each of the tables has the same columns in the same order; and 2) the slice (which restricts the data to just the states) only works. Finally, we add the year only at the end, whereas we’d need to bring that up earlier in the process.

We’ll start by writing a function that will go through all the files, grab the data, get the page of interest, and then expand the rows. We’ll then use a function from purrr to apply that function to all of the PDFs and to output a tibble.

get_pdf_convert_to_tibble <- function(pdf_name, page, year){
  
  dhs_table_of_interest <- 
    tibble(raw_data = pdftools::pdf_text(pdf_name)) %>% 
    slice(page) %>% 
    separate_rows(raw_data, sep = "\\n", convert = FALSE) %>% 
    separate(col = raw_data, 
             into = c("state", "data"), 
             sep = "[�|\\.]\\s+(?=[[:digit:]])", 
             remove = FALSE) %>% 
    mutate(
      data = str_squish(data),
      year_of_data = year)

  print(paste("Done with", year))
  
  return(dhs_table_of_interest)
}

raw_dhs_data <- purrr::pmap_dfr(monicas_data %>% select(pdf_name, page, year),
                                get_pdf_convert_to_tibble)
[1] "Done with 2000"
[1] "Done with 2001"
[1] "Done with 2002"
[1] "Done with 2003"
[1] "Done with 2004"
[1] "Done with 2005"
[1] "Done with 2006"
[1] "Done with 2007"
[1] "Done with 2008"
[1] "Done with 2009"
[1] "Done with 2010"
[1] "Done with 2011"
[1] "Done with 2012"
[1] "Done with 2013"
[1] "Done with 2014"
[1] "Done with 2015"
[1] "Done with 2016"
[1] "Done with 2016"
[1] "Done with 2017"
[1] "Done with 2017"
[1] "Done with 2018"
head(raw_dhs_data)
# A tibble: 6 x 4
  raw_data                       state                        data  year_of_data
  <chr>                          <chr>                        <chr>        <dbl>
1 40 National Vital Statistics … 40 National Vital Statistic… 50, …         2000
2 Table 10. Number of births, b… Table 10. Number of births,… <NA>          2000
3 United States, each State and… United States, each State a… <NA>          2000
4 [By place of residence. Birth… [By place of residence. Bir… <NA>          2000
5 estimated in each area; total… estimated in each area; tot… <NA>          2000
6 age group estimated in each a… age group estimated in each… <NA>          2000

Now we need to clean up the state names and then filter on them.

states <- c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", 
            "Connecticut", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho", 
            "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", 
            "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", 
            "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", 
            "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", 
            "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", 
            "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", 
            "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", 
            "Wyoming", "District of Columbia")

raw_dhs_data <- 
  raw_dhs_data %>% 
  mutate(state = str_remove_all(state, "\\."),
         state = str_remove_all(state, "�"),
         state = str_remove_all(state, ""),
         state = str_replace_all(state, "United States 1", "United States"),
         state = str_replace_all(state, "United States1", "United States"),
         state = str_replace_all(state, "United States 2", "United States"),
         state = str_replace_all(state, "United States2", "United States"),
         state = str_replace_all(state, "United States²", "United States"),
         ) %>% 
  mutate(state = str_squish(state)) %>% 
  filter(state %in% states)

head(raw_dhs_data)
# A tibble: 6 x 4
  raw_data                              state   data                year_of_data
  <chr>                                 <chr>   <chr>                      <dbl>
1 Alabama ............................… Alabama 63,299 14.4 65.0 2…         2000
2 Alaska .............................… Alaska  9,974 16.0 74.6 2,…         2000
3 Arizona ............................… Arizona 85,273 17.5 84.4 2…         2000
4 Arkansas ...........................… Arkans… 37,783 14.7 69.1 2…         2000
5 California .........................… Califo… 531,959 15.8 70.7 …         2000
6 Colorado ...........................… Colora… 65,438 15.8 73.1 2…         2000

The next step is to separate the data and get the correct column from it. We’re going to separate based on spaces once it is cleaned up.

raw_dhs_data <- 
  raw_dhs_data %>% 
  mutate(data = str_remove_all(data, "\\*")) %>% 
  separate(data, into = c("col_1", "col_2", "col_3", "col_4", "col_5", 
                          "col_6", "col_7", "col_8", "col_9", "col_10"), 
           sep = " ",
           remove = FALSE)
head(raw_dhs_data)
# A tibble: 6 x 14
  raw_data state data  col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
  <chr>    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Alabama… Alab… 63,2… 63,2… 14.4  65.0  2,02… 62.9  37.9  97.3  <NA>  <NA> 
2 Alaska … Alas… 9,97… 9,974 16.0  74.6  2,43… 42.4  23.6  69.4  <NA>  <NA> 
3 Arizona… Ariz… 85,2… 85,2… 17.5  84.4  2,65… 69.1  41.1  111.3 <NA>  <NA> 
4 Arkansa… Arka… 37,7… 37,7… 14.7  69.1  2,14… 68.5  36.7  114.1 <NA>  <NA> 
5 Califor… Cali… 531,… 531,… 15.8  70.7  2,18… 48.5  28.6  75.6  <NA>  <NA> 
6 Colorad… Colo… 65,4… 65,4… 15.8  73.1  2,35… 49.2  28.6  79.8  <NA>  <NA> 
# … with 2 more variables: col_10 <chr>, year_of_data <dbl>

We can now grab the correct column.

tfr_data <- 
  raw_dhs_data %>% 
  mutate(TFR = if_else(year_of_data < 2008, col_4, col_3)) %>% 
  select(state, year_of_data, TFR) %>% 
  rename(year = year_of_data)
head(tfr_data)
# A tibble: 6 x 3
  state       year TFR    
  <chr>      <dbl> <chr>  
1 Alabama     2000 2,021.0
2 Alaska      2000 2,437.0
3 Arizona     2000 2,652.5
4 Arkansas    2000 2,140.0
5 California  2000 2,186.0
6 Colorado    2000 2,356.5

Finally, we need to convert the case.

head(tfr_data)
# A tibble: 6 x 3
  state       year TFR    
  <chr>      <dbl> <chr>  
1 Alabama     2000 2,021.0
2 Alaska      2000 2,437.0
3 Arizona     2000 2,652.5
4 Arkansas    2000 2,140.0
5 California  2000 2,186.0
6 Colorado    2000 2,356.5
tfr_data <- 
  tfr_data %>% 
  mutate(TFR = str_remove_all(TFR, ","),
         TFR = as.numeric(TFR))

head(tfr_data)
# A tibble: 6 x 3
  state       year   TFR
  <chr>      <dbl> <dbl>
1 Alabama     2000 2021 
2 Alaska      2000 2437 
3 Arizona     2000 2652.
4 Arkansas    2000 2140 
5 California  2000 2186 
6 Colorado    2000 2356.

And run some checks.

tfr_data %>% 
  skimr::skim()
Table 1: Data summary
Name Piped data
Number of rows 969
Number of columns 3
_______________________
Column type frequency:
character 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
state 0 1 4 20 0 51 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2009.00 5.48 2000.0 2004 2009 2014 2018.0 ▇▇▆▇▇
TFR 0 1 1962.76 217.03 1346.5 1816 1950 2091 2761.5 ▁▇▇▂▁

In particular we want for there to be 51 states and for there to be 19 years.

And we’re done.

head(tfr_data)
# A tibble: 6 x 3
  state       year   TFR
  <chr>      <dbl> <dbl>
1 Alabama     2000 2021 
2 Alaska      2000 2437 
3 Arizona     2000 2652.
4 Arkansas    2000 2140 
5 California  2000 2186 
6 Colorado    2000 2356.
write_csv(tfr_data, "outputs/monicas_tfr.csv")

Case-study: Kenyan census data

The distribution of population by age, sex, and administrative unit from the 2019 Kenyan census can be downloaded here: https://www.knbs.or.ke/?wpdmpro=2019-kenya-population-and-housing-census-volume-iii-distribution-of-population-by-age-sex-and-administrative-units.

And while it is great that they make it easily available, and it is easy to look-up a particular result, it is not overly useful to do larger-scale data analysis, such as building a Bayesian hierarchical model.

In this section we will convert a PDF of Kenyan census results of counts, by age and sex, by county and sub-county, into a tidy dataset that can be analysed. I will draw on and introduce a bunch of handy packages including: janitor by Firke (2020), pdftools by Ooms (2019b), tidyverse by Wickham et al. (2019), and stringi by Gagolewski (2020).

Set-up

To get started I need to load the necessary packages.

And then I need to read in the PDF that I want to convert.

# Read in the PDF
all_content <- pdftools::pdf_text("inputs/pdfs/2019_Kenya_census.pdf")

The pdf_text function from pdftools is useful when you have a PDF and you want to read the content into R. For many recently produced PDFs it’ll work pretty well, but there are alternatives. If the PDF is an image, then it won’t work and you’ll need to turn to OCR.

You can see a page of the PDF here:

knitr::include_graphics("images/2020-04-10-screenshot-of-census.png") 

Extract

The first challenge is to get the dataset into a format that we can more easily manipulate. The way that I am going to do this is to consider each page of the PDF and extract the relevant parts. To do this, I first write a function that I want to apply to each page.

# The function is going to take an input of a page
get_data <- function(i){
  # Just look at the page of interest
  # Based on https://stackoverflow.com/questions/47793326/tabulize-function-in-r
  just_page_i <- stringi::stri_split_lines(all_content[[i]])[[1]] 
  
  # Grab the name of the location
  area <- just_page_i[3] %>% str_squish()
  area <- str_to_title(area)
  
  # Grab the type of table
  type_of_table <- just_page_i[2] %>% str_squish()
  
  # Get rid of the top matter
  just_page_i_no_header <- just_page_i[5:length(just_page_i)] # Just manually for now, but could create some rules if needed
  
  # Get rid of the bottom matter
  just_page_i_no_header_no_footer <- just_page_i_no_header[1:62] # Just manually for now, but could create some rules if needed
  
  # Convert into a tibble
  demography_data <- tibble(all = just_page_i_no_header_no_footer)
  
  # # Split columns
  demography_data <-
    demography_data %>%
    mutate(all = str_squish(all)) %>% # Any space more than two spaces is squished down to one
    mutate(all = str_replace(all, "10 -14", "10-14")) %>% 
    mutate(all = str_replace(all, "Not Stated", "NotStated")) %>% # Any space more than two spaces is squished down to one
    separate(col = all,
             into = c("age", "male", "female", "total", "age_2", "male_2", "female_2", "total_2"),
             sep = " ", # Just looking for a space. Seems to work fine because the tables are pretty nicely laid out
             remove = TRUE,
             fill = "right"
    )
  
  # They are side by side at the moment, need to append to bottom
  demography_data_long <-
    rbind(demography_data %>% select(age, male, female, total),
          demography_data %>%
            select(age_2, male_2, female_2, total_2) %>%
            rename(age = age_2, male = male_2, female = female_2, total = total_2)
    )
  
  # There is one row of NAs, so remove it
  demography_data_long <- 
    demography_data_long %>% 
    janitor::remove_empty(which = c("rows"))
  
  # Add the area and the page
  demography_data_long$area <- area
  demography_data_long$table <- type_of_table
  demography_data_long$page <- i
  
  rm(just_page_i,
     i,
     area,
     type_of_table,
     just_page_i_no_header,
     just_page_i_no_header_no_footer,
     demography_data)
  
  return(demography_data_long)
}

At this point, I have a function that does what I need to each page of the PDF. I’m going to use the function map_dfr from the purrr package to apply that function to each page, and then combine all the outputs into one tibble.

# Run through each relevant page and get the data
pages <- c(30:513)
all_tables <- map_dfr(pages, get_data)
rm(pages, get_data, all_content)

Clean

I now need to clean the dataset to make it useful.

Values

The first step is to make the numbers into actual numbers, rather than characters. Before I can convert the type I need to remove anything that is not a number otherwise it’ll be converted into an NA. So I first identify any values that are not numbers so that I can remove them.

# Need to convert male, female, and total to integers
# First find the characters that should not be in there
all_tables %>% 
  select(male, female, total) %>%
  mutate_all(~str_remove_all(., "[:digit:]")) %>% 
  mutate_all(~str_remove_all(., ",")) %>%
  mutate_all(~str_remove_all(., "_")) %>%
  mutate_all(~str_remove_all(., "-")) %>% 
  distinct()
# A tibble: 3 x 3
  male  female total
  <chr> <chr>  <chr>
1 ""    ""     ""   
2 "Aug" ""     ""   
3 "Jun" ""     ""   
# We clearly need to remove ",", "_", and "-". 
# This also highlights a few issues on p. 185 that need to be manually adjusted
# https://twitter.com/RohanAlexander/status/1244337583016022018
all_tables$male[all_tables$male == "23-Jun"] <- 4923
all_tables$male[all_tables$male == "15-Aug"] <- 4611

While you could use the janitor package here, it is worthwhile at least first looking at what is going on because sometimes there is odd stuff that janitor (and other packages) will deal with, but not in a way that you want. In this case, they’ve used Excel or similar and this has converted a couple of their entries into dates. If we just took the numbers from the column then we’d have 23 and 15 here, but by inspecting the column we can use Excel to reverse the process and enter the correct values of 4,923 and 4,611, respectively.

Having identified everything that needs to be removed, we can do the actual removal and convert our character column of numbers to integers.

all_tables <-
  all_tables %>%
  mutate_at(vars(male, female, total), ~str_remove_all(., ",")) %>% # First get rid of commas
  mutate_at(vars(male, female, total), ~str_replace(., "_", "0")) %>%
  mutate_at(vars(male, female, total), ~str_replace(., "-", "0")) %>%
  mutate_at(vars(male, female, total), ~as.integer(.))

Areas

The next thing to clean is the areas. We know that there are 47 counties in Kenya, and a whole bunch of sub-counties. They give us a list on pages 19 to 22 of the PDF (document pages 7 to 10). However, this list is not complete, and there are a few minor issues that we’ll deal with later.

In any case, I first need to fix a few inconsistencies.

# Fix some area names
all_tables$area[all_tables$area == "Taita/ Taveta"] <- "Taita/Taveta"
all_tables$area[all_tables$area == "Elgeyo/ Marakwet"] <- "Elgeyo/Marakwet"
all_tables$area[all_tables$area == "Nairobi City"] <- "Nairobi"

Kenya has 47 counties, each of which has sub-counties. The PDF has them arranged as the county data then the sub-counties, without designating which is which. We can use the names, to a certain extent, but in a handful of cases, there is a sub-county that has the same name as a county so we need to first fix that.

The PDF is made-up of three tables.

all_tables$table %>% table()
.
Table 2.3: Distribution of Population by Age, Sex*, County and Sub- County 
                                                                     48216 
      Table 2.4a: Distribution of Rural Population by Age, Sex* and County 
                                                                      5535 
      Table 2.4b: Distribution of Urban Population by Age, Sex* and County 
                                                                      5781 

So I can first get the names of the counties based on those first two tables and then reconcile them to get a list of the counties.

# Get a list of the counties 
list_counties <- 
  all_tables %>% 
  filter(table %in% c("Table 2.4a: Distribution of Rural Population by Age, Sex* and County",
                      "Table 2.4b: Distribution of Urban Population by Age, Sex* and County")
         ) %>% 
  select(area) %>% 
  distinct()

As I hoped, there are 47 of them. But before I can add a flag based on those names, I need to deal with the sub-counties that share their name. We will do this based on the page, then looking it up and deciding which is the county page and which is the sub-county page.

# The following have the issue of the name being used for both a county and a sub-county:
all_tables %>% 
  filter(table == "Table 2.3: Distribution of Population by Age, Sex*, County and Sub- County") %>% 
  filter(area %in% c("Busia",
                     "Garissa",
                     "Homa Bay",
                     "Isiolo",
                     "Kiambu",
                     "Machakos",
                     "Makueni",
                     "Samburu",
                     "Siaya",
                     "Tana River",
                     "Vihiga",
                     "West Pokot")
         ) %>% 
  select(area, page) %>% 
  distinct()
# A tibble: 24 x 2
   area        page
   <chr>      <int>
 1 Samburu       42
 2 Tana River    53
 3 Tana River    56
 4 Garissa       65
 5 Garissa       69
 6 Isiolo        98
 7 Isiolo       100
 8 Machakos     149
 9 Machakos     154
10 Makueni      159
# … with 14 more rows

Now we can add the flag for whether the area is a county and adjust for the ones that are troublesome,

# Add flag for whether it is a county or a sub-county
all_tables <- 
  all_tables %>% 
  mutate(area_type = if_else(area %in% list_counties$area, "county", "sub-county"))
# Fix the flag for the ones that have their names used twice
all_tables <- 
  all_tables %>% 
  mutate(area_type = case_when(
    area == "Samburu" & page == 42 ~ "sub-county",
    area == "Tana River" & page == 56 ~ "sub-county",
    area == "Garissa" & page == 69 ~ "sub-county",
    area == "Isiolo" & page == 100 ~ "sub-county",
    area == "Machakos" & page == 154 ~ "sub-county",
    area == "Makueni" & page == 164 ~ "sub-county",
    area == "Kiambu" & page == 213 ~ "sub-county",
    area == "West Pokot" & page == 233 ~ "sub-county",
    area == "Vihiga" & page == 333 ~ "sub-county",
    area == "Busia" & page == 353 ~ "sub-county",
    area == "Siaya" & page == 360 ~ "sub-county",
    area == "Homa Bay" & page == 375 ~ "sub-county",
    TRUE ~ area_type
    )
  )

rm(list_counties)

Ages

Now we can deal with the ages.

First we need to fix some errors.

# Clean up ages
table(all_tables$age) %>% head()

    0   0-4     1    10 10-14 10-19 
  484   484   484   484   482     1 
unique(all_tables$age) %>% head()
[1] "Total" "0"     "1"     "2"     "3"     "4"    
# Looks like there should be 484, so need to follow up on some:
all_tables$age[all_tables$age == "NotStated"] <- "Not Stated"
all_tables$age[all_tables$age == "43594"] <- "5-9"
all_tables$age[all_tables$age == "43752"] <- "10-14"
all_tables$age[all_tables$age == "9-14"] <- "5-9"
all_tables$age[all_tables$age == "10-19"] <- "10-14"

The census has done some of the work of putting together age-groups for us, but we want to make it easy to just focus on the counts by single-year-age. As such I’ll add a flag as to the type of age it is: an age group, such as ages 0 to 5, or a single age, such as 1.

# Add a flag as to whether it's a summary or not
all_tables$age_type <- if_else(str_detect(all_tables$age, c("-")), "age-group", "single-year")
all_tables$age_type <- if_else(str_detect(all_tables$age, c("Total")), "age-group", all_tables$age_type)

At the moment, age is a character variable. We have a decision to make here, because we don’t want it to be a character variable (because it won’t graph properly), but we don’t want it to be a numeric, because there is total and also 100+ in there. So for now, we’ll just make it into a factor, and at least that will be able to be nicely graphed.

all_tables$age <- as_factor(all_tables$age)

Check

Gender sum

Given the format of the data, at this point it is easy to check that total is the sum of male and female.

# Check the parts and the sums
follow_up <- 
  all_tables %>% 
  mutate(check_sum = male + female,
         totals_match = if_else(total == check_sum, 1, 0)
         ) %>% 
  filter(totals_match == 0)

There is just one that seems wrong.

# There is just one that looks wrong
all_tables$male[all_tables$age == "10" & all_tables$page == 187] <- as.integer(1)

rm(follow_up)

Rural-urban split

The census provides different tables for the total of each county and sub-county; and then within each county, for the number in an urban area in that county, and the number in a urban area in that county. Some counties only have an urban count, but we’d like to make sure that the sum of rural and urban counts equals the total count. This requires reshaping the data from a long to wide format.

First, construct different tables for each of the three. I just do it manually, but I could probably do this a nicer way.

# Table 2.3
table_2_3 <- all_tables %>% 
  filter(table == "Table 2.3: Distribution of Population by Age, Sex*, County and Sub- County")
table_2_4a <- all_tables %>% 
  filter(table == "Table 2.4a: Distribution of Rural Population by Age, Sex* and County")
table_2_4b <- all_tables %>% 
  filter(table == "Table 2.4b: Distribution of Urban Population by Age, Sex* and County")

Having constructed the constituent parts, I now join then based on age, area, and whether it is a county.

both_2_4s <- full_join(table_2_4a, table_2_4b, by = c("age", "area", "area_type"), suffix = c("_rural", "_urban"))

all <- full_join(table_2_3, both_2_4s, by = c("age", "area", "area_type"), suffix = c("_all", "_"))

all <- 
  all %>% 
  mutate(page = glue::glue('Total from p. {page}, rural from p. {page_rural}, urban from p. {page_urban}')) %>% 
  select(-page, -page_rural, -page_urban,
         -table, -table_rural, -table_urban,
         -age_type_rural, -age_type_urban
         )

rm(both_2_4s, table_2_3, table_2_4a, table_2_4b)

We can now check that the sum of rural and urban is the same as the total.

# Check that the urban + rural = total
follow_up <- 
  all %>% 
  mutate(total_from_bits = total_rural + total_urban,
         check_total_is_rural_plus_urban = if_else(total == total_from_bits, 1, 0),
         total_from_bits - total) %>% 
  filter(check_total_is_rural_plus_urban == 0)

head(follow_up)
# A tibble: 3 x 16
  age     male female  total area  area_type age_type male_rural female_rural
  <fct>  <int>  <int>  <int> <chr> <chr>     <chr>         <int>        <int>
1 Not …     31     10     41 Naku… county    single-…          8            6
2 Total 434287 441379 875666 Bomet county    age-gro…     420119       427576
3 Not …      3      2      5 Bomet county    single-…          2            1
# … with 7 more variables: total_rural <int>, male_urban <int>,
#   female_urban <int>, total_urban <int>, total_from_bits <int>,
#   check_total_is_rural_plus_urban <dbl>, `total_from_bits - total` <int>
rm(follow_up)

There are just a few, but they only have a a difference of 1, so I’ll just move on.

Ages sum to age-groups

Finally, I want to check that the single age counts sum to the age-groups.

# One last thing to check is that the ages sum to their age-groups.
follow_up <- 
  all %>% 
  mutate(groups = case_when(age %in% c("0", "1", "2", "3", "4", "0-4") ~ "0-4",
                            age %in% c("5", "6", "7", "8", "9", "5-9") ~ "5-9",
                            age %in% c("10", "11", "12", "13", "14", "10-14") ~ "10-14",
                            age %in% c("15", "16", "17", "18", "19", "15-19") ~ "15-19",
                            age %in% c("20", "21", "22", "23", "24", "20-24") ~ "20-24",
                            age %in% c("25", "26", "27", "28", "29", "25-29") ~ "25-29",
                            age %in% c("30", "31", "32", "33", "34", "30-34") ~ "30-34",
                            age %in% c("35", "36", "37", "38", "39", "35-39") ~ "35-39",
                            age %in% c("40", "41", "42", "43", "44", "40-44") ~ "40-44",
                            age %in% c("45", "46", "47", "48", "49", "45-49") ~ "45-49",
                            age %in% c("50", "51", "52", "53", "54", "50-54") ~ "50-54",
                            age %in% c("55", "56", "57", "58", "59", "55-59") ~ "55-59",
                            age %in% c("60", "61", "62", "63", "64", "60-64") ~ "60-64",
                            age %in% c("65", "66", "67", "68", "69", "65-69") ~ "65-69",
                            age %in% c("70", "71", "72", "73", "74", "70-74") ~ "70-74",
                            age %in% c("75", "76", "77", "78", "79", "75-79") ~ "75-79",
                            age %in% c("80", "81", "82", "83", "84", "80-84") ~ "80-84",
                            age %in% c("85", "86", "87", "88", "89", "85-89") ~ "85-89",
                            age %in% c("90", "91", "92", "93", "94", "90-94") ~ "90-94",
                            age %in% c("95", "96", "97", "98", "99", "95-99") ~ "95-99",
                            TRUE ~ "Other")
         ) %>% 
  group_by(area_type, area, groups) %>% 
  mutate(group_sum = sum(total, na.rm = FALSE),
         group_sum = group_sum / 2,
         difference = total - group_sum) %>% 
  ungroup() %>% 
  filter(age == groups) %>% 
  filter(total != group_sum) 

head(follow_up)
# A tibble: 6 x 16
  age    male female total area  area_type age_type male_rural female_rural
  <fct> <int>  <int> <int> <chr> <chr>     <chr>         <int>        <int>
1 0-4       1      5     6 Mt. … sub-coun… age-gro…         NA           NA
2 5-9       1      2     3 Mt. … sub-coun… age-gro…         NA           NA
3 10-14     6      0     6 Mt. … sub-coun… age-gro…         NA           NA
4 15-19     9      1    10 Mt. … sub-coun… age-gro…         NA           NA
5 20-24    21      4    25 Mt. … sub-coun… age-gro…         NA           NA
6 25-29    59      9    68 Mt. … sub-coun… age-gro…         NA           NA
# … with 7 more variables: total_rural <int>, male_urban <int>,
#   female_urban <int>, total_urban <int>, groups <chr>, group_sum <dbl>,
#   difference <dbl>
rm(follow_up)

Mt. Kenya Forest, Aberdare Forest, Kakamega Forest are all slightly dodgy. I can’t see it in the documentation, but it looks like they have apportioned these between various countries. It’s understandable why they’d do this and it’s probably not a big deal, so I’ll just move on.

Tidy-up

Now that we are confident that everything is looking good, we can just convert it to long-format so that it is easy to work with.

all <- 
  all %>% 
  rename(male_total = male,
         female_total = female,
         total_total = total) %>% 
  pivot_longer(cols = c(male_total, female_total, total_total, male_rural, female_rural, total_rural, male_urban, female_urban, total_urban),
               names_to = "type",
               values_to = "number"
               ) %>% 
  separate(col = type, into = c("gender", "part_of_area"), sep = "_") %>% 
  select(area, area_type, part_of_area, age, age_type, gender, number)

write_csv(all, path = "outputs/data/cleaned_kenya_2019_census.csv")

head(all)
# A tibble: 6 x 7
  area    area_type part_of_area age   age_type  gender  number
  <chr>   <chr>     <chr>        <fct> <chr>     <chr>    <int>
1 Mombasa county    total        Total age-group male    610257
2 Mombasa county    total        Total age-group female  598046
3 Mombasa county    total        Total age-group total  1208303
4 Mombasa county    rural        Total age-group male        NA
5 Mombasa county    rural        Total age-group female      NA
6 Mombasa county    rural        Total age-group total       NA

Make Monica’s dataset

The original purpose of all of this was to make a table for Monica. She needed single-year counts, by gender, for the counties.

monicas_dataset <- 
  all %>% 
  filter(area_type == "county") %>% 
  filter(part_of_area == "total") %>%
  filter(age_type == "single-year") %>% 
  select(area, age, gender, number)

head(monicas_dataset)
# A tibble: 6 x 4
  area    age   gender number
  <chr>   <fct> <chr>   <int>
1 Mombasa 0     male    15111
2 Mombasa 0     female  15009
3 Mombasa 0     total   30120
4 Mombasa 1     male    15805
5 Mombasa 1     female  15308
6 Mombasa 1     total   31113
write_csv(monicas_dataset, "outputs/data/monicas_dataset.csv")

I’ll leave the fancy stats to Monica, but I’ll just make a quick graph of Nairobi.

monicas_dataset %>% 
  filter(area == "Nairobi") %>%
  ggplot() +
  geom_col(aes(x = age, y = number, fill = gender), position = "dodge") + 
  scale_y_continuous(labels = scales::comma) +
  scale_x_discrete(breaks = c(seq(from = 0, to = 99, by = 5), "100+")) +
  theme_classic()+
  scale_fill_brewer(palette = "Set1") +
  labs(y = "Number",
       x = "Age",
       fill = "Gender",
       title = "Distribution of age and gender in Nairobi in 2019",
       caption = "Data source: 2019 Kenya Census")

Next steps: OCR

All of the above is predicated on having a PDF that is already ‘digitized’. But what if it is images? In that case you need to first use Optical Character Recognition (OCR). The go-to package is Tesseract Ooms (2019c).

Bronte, Charlotte. 1847. Jane Eyre. https://www.gutenberg.org/files/1260/1260-h/1260-h.htm.

Firke, Sam. 2020. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.

Gagolewski, Marek. 2020. R Package Stringi: Character String Processing Facilities. http://www.gagolewski.com/software/stringi/.

Henry, Lionel, and Hadley Wickham. 2020. Purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.

Ooms, Jeroen. 2019a. Pdftools: Text Extraction, Rendering and Converting of Pdf Documents. https://CRAN.R-project.org/package=pdftools.

———. 2019b. Pdftools: Text Extraction, Rendering and Converting of Pdf Documents. https://CRAN.R-project.org/package=pdftools.

———. 2019c. Tesseract: Open Source Ocr Engine. https://CRAN.R-project.org/package=tesseract.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.


  1. And if you’d like to know more about this then I’d recommend starting a PhD with Monica Alexander.↩︎

References