Chapter 6 Static communication

Last updated: 25 April 2021.

Required reading

  • (The) Economist, 2013, ‘Johnson: Those six little rules,’ Prospero, 29 July 2013, available at: https://www.economist.com/prospero/2013/07/29/johnson-those-six-little-rules.
  • Alexander, Monica, 2019, ‘The concentration and uniqueness of baby names in Australia and the US,’ https://www.monicaalexander.com/posts/2019-20-01-babynames/. (Look at how Monica explains concepts, especially the Gini coefficient, in a way that you can understand even if you’ve never heard of it before.)
  • Bronner, Laura, 2020, ‘Quant Editing,’ http://www.laurabronner.com/quant-editing. (Read these points and evaluate your own writing against them. It’s fine to not comply with them if you have a good reason, but you need to know that you’re not complying with them).
  • Girouard, Dave, 2020, ‘A Founder’s Guide to Writing Well,’ First Round Review, 4 August, https://firstround.com/review/a-founders-guide-to-writing-well/.
  • Graham, Paul, 2020, ‘How to Write Usefully,’ http://paulgraham.com/useful.html. (Graham is good at writing for a programmer, but if you have a similar background then you may like this.)
  • Healy, Kieran, 2019, Data Visualization: A Practical Introduction, Princeton University Press, Chapters 3, 4, and 7, https://socviz.co/.
  • Hodgetts, Paul, 2020, ‘The ggfortify Package,’ 31 December, https://www.hodgettsp.com/posts/r-ggfortify/.
  • Wickham, Hadley, and Garrett Grolemund, 2017, R for Data Science, Chapter 28, https://r4ds.had.co.nz/.
  • Zinsser, William, 1976 [2016], On Writing Well. (Any edition is fine. This book is included because if you’re serious about improving your writing then you should start with this book. It only takes a few hours to read. You’ll go onto other books, but start with this one.)
  • Zinsser, William, 2009, ‘Writing English as a Second Language,’ Lecture, Columbia Graduate School of Journalism, 11 August, https://theamericanscholar.org/writing-english-as-a-second-language/. (I’m realistic enough to realise that requiring a book, even though I’ve said it’s great and it’s short, is a bit of a stretch. If you really don’t want to commit to reading the Zinsser, then please at least read this ‘crib notes’ version of it.)

Required viewing

Recommended reading

Examples of well-written papers

  • Barron, Alexander TJ, Jenny Huang, Rebecca L. Spang, and Simon DeDeo. “Individuals, institutions, and innovation in the debates of the French Revolution.” Proceedings of the National Academy of Sciences 115, no. 18 (2018): 4607-4612. r
  • Chambliss, Daniel F. “The Mundanity of Excellence: An Ethnographic Report on Stratification and Olympic Swimmers.” Sociological Theory 7, no. 1 (1989): 70-86. doi:10.2307/202063.
  • Joyner, Michael J. “Modeling: optimal marathon performance on the basis of physiological factors.” Journal of Applied Physiology, 70, no. 2 (1991): 683-687.
  • Kharecha, Pushker A., and James E. Hansen, 2013, ‘Prevented mortality and greenhouse gas emissions from historical and projected nuclear power,’ Environmental science & technology, 47, no. 9, pp. 4889-4895.
  • Samuel, Arthur L., 1959, ‘Some studies in machine learning using the game of checkers,’ IBM Journal of research and development, 3, no. 3, pp. 210-229.
  • Wardrop, Robert L., 1995, ‘Simpson’s paradox and the hot hand in basketball,’ The American Statistician, 49, no. 1, 24-28.

Key concepts/skills/etc

  • Show the reader your raw data, or as close as you can come to it.
  • Use either geom_point or geom_bar initially.
  • Writing efficiently and effectively is a requirement if you want your work to be convincing.
  • Don’t waste your reader’s time.
  • A good title says what the paper is about, a great title says what the paper found.
  • For a six-page paper, a good abstract is a three to five sentence paragraph. For a longer paper your abstract can be slightly longer.
  • Thinking of maps as a (often fiddly, but strangely enjoyable) variant of a usual ggplot.
  • Broadening the data that we make available via interactive maps, while still telling a clear story.
  • Becoming comfortable with (and excited about) creating static maps.

Key libraries

  • ggplot
  • patchwork
  • ggmap
  • maps

Key functions/etc

  • ggplot::geom_point()
  • ggplot::geom_bar()
  • canada.cities
  • geom_polygon()
  • ggmap()
  • map()
  • map_data()

Quiz

  1. I have a dataset that contains measurements of height (in cm) for a sample of 300 penguins, who are either the Adeline or Emperor species. I am interested in visualizing the distribution of heights by species in a graphical way. Please discuss whether a pie chart is an appropriate type of graph to use. What about a box and whisker plot? Finally, what are some considerations if you made a histogram? [Please write a paragraph or two for each aspect.]
  2. Assume the dataset and columns exist. Would this code work? data %>% ggplot(aes(x = col_one)) %>% geom_point() (pick one)?
    1. Yes
    2. No
  3. If I have categorical data, which geom should I use to plot it (pick one)?
    1. geom_bar()
    2. geom_point()
    3. geom_abline()
    4. geom_boxplot()
  4. Why are box plots often inappropriate (pick one)?
    1. They hide the full distribution of the data.
    2. They are hard to make.
    3. They are ugly.
    4. The mode is clearly displayed.
  5. Which of the following is the best title (pick one)?
    1. “Problem Set 1”
    2. “Unemployment”
    3. “Examining Canada’s Unemployment (2010-2020)”
    4. “Canada’s Unemployment Increased between 2010 and 2020”

6.1 Introduction

[T]he duty of a scientist is not only to find new things, bu to communicate them successfully in at least three forms: 1) Writing papers and books. 2) Prepared public talks. 3) Impromptu talks.

Hamming (1996, 65)

In order to convince someone of your story, your paper must be well-written, well-organized, and easy to follow. It should flow easily from one point to the next. It should have proper sentence structure, spelling, vocabulary, and grammar. Each point should be articulated clearly and completely without being overly verbose. Papers should demonstrate your understanding of the topics you are writing about and your confidence in discussing the terms, techniques and issues that are relevant. References must be included and properly cited because this enhances your credibility.

People who need to write: founders, VCs, lawyers, software engineers, designers, painters, data scientists, musicians, filmmakers, creative directors, physical trainers, teachers, writers. Learn to write.

Sahil Lavingia.

This is great advice. Writing well has done just as much for me as knowing how to code. I’d add that if you’re intimidated by writing, start a blog and write often about something you’re interested in. You’ll get better. At least that’s what I’ve done for the past 10 years. :)

Vicki Boykis.

This chapter is about writing. By the end of it you will have a better idea of how to write short, detailed, quantitative papers that communicate exactly what you want them to and don’t waste the time of your reader.

One critical part of telling stories with data, is that it’s ultimately the data that has to convince them. You’re the medium, but the data are the message. To that end, the easiest way to try to convince someone of your story is to show them the data that allowed you to come to that story. Plot your raw data, or as close to it as possible.

While ggplot is a fantastic tool for doing this, there is a lot to that package and so it can be difficult to know where to start. My recommendation is that you start with either a scatter plot or a bar chart. What is critical is that you show the reader your raw data. These notes run through how to do that. It then discusses some more advanced options, but the important thing is that you show the reader your raw data (or as close to it as you can). Students seem to get confused what ‘raw’ means; I’m using it to refer to as close to the original dataset as possible, so no sums, or averages, etc, if possible. Sometimes your data are too disperse for that or you’ve got other constraints, so there needs to be an element of manipulation. The main point is that you, at the very least, need to plot the data that you’re going to be modelling. If you are dealing with larger datasets then just take a 10/1/0.1/etc per cent sample.

Show me the data!

FIGURE 6.1: Show me the data!

Source: YouTube screenshot.

6.2 Graphs

Graphs are critical to tell a compelling story. And the most important thing with your graphs is to plot your raw data. Again: Plot. Your. Raw. Data.

Figure 6.2 provides invaluable advice (thank you to Thomas William Rosenthal).

How do we get started with our data?

FIGURE 6.2: How do we get started with our data?

Let’s look at a somewhat fun example from the datasauRus package (Locke and D’Agostino McGowan 2018).

library(datasauRus)

# Code from: https://juliasilge.com/blog/datasaurus-multiclass/
datasaurus_dozen %>%
  filter(dataset %in% c("dino", "star", "away", "bullseye")) %>%
  group_by(dataset) %>%
  summarise(across(c(x, y), list(mean = mean, sd = sd)),
    x_y_cor = cor(x, y)
  ) %>% 
  ungroup()
## # A tibble: 4 x 6
##   dataset  x_mean  x_sd y_mean  y_sd x_y_cor
##   <chr>     <dbl> <dbl>  <dbl> <dbl>   <dbl>
## 1 away       54.3  16.8   47.8  26.9 -0.0641
## 2 bullseye   54.3  16.8   47.8  26.9 -0.0686
## 3 dino       54.3  16.8   47.8  26.9 -0.0645
## 4 star       54.3  16.8   47.8  26.9 -0.0630

And despite these similarities at a summary statistic level, they’re actually very different, well, beasts, when you plot the raw data.

datasaurus_dozen %>% 
  filter(dataset %in% c("dino", "star", "away", "bullseye")) %>%
  ggplot(aes(x=x, y=y, colour=dataset)) +
  geom_point() +
  theme_minimal() +
  facet_wrap(vars(dataset), nrow = 2, ncol = 2) +
  labs(colour = "Dataset")

6.2.1 Bar chart

Bar charts are useful when you have one variable that you want to focus on. Hint: you almost always have one variable that you want to focus on. Hence, you should almost always include at least one (and likely many) bar charts. Bar charts go by a variety of names, depending on their specifics. I recommend the R Studio Data Viz Cheat Sheet.

To get started, let’s simulate some data.

set.seed(853)

number_of_observation <- 10000

example_data <- tibble(person = c(1:number_of_observation),
                       smoker = sample(x = c("Smoker", "Non-smoker"),
                                       size = number_of_observation, 
                                       replace = TRUE),
                       age_died = runif(number_of_observation,
                                        min = 0,
                                        max = 100) %>% round(digits = 0),
                       height = sample(x = c(50:220), 
                                       size =  number_of_observation, 
                                       replace = TRUE),
                       num_children = sample(x = c(0:5),
                                             size = number_of_observation, 
                                             replace = TRUE,
                                             prob = c(0.1, 0.2, 0.40, 0.15, 0.1, 0.05))
                       )

First, let’s have a look at the data.

head(example_data)
## # A tibble: 6 x 5
##   person smoker     age_died height num_children
##    <int> <chr>         <dbl>  <int>        <int>
## 1      1 Smoker           55     80            3
## 2      2 Non-smoker       54     78            2
## 3      3 Non-smoker       84    109            1
## 4      4 Smoker           75    114            4
## 5      5 Smoker           32    135            1
## 6      6 Smoker           37    220            0

Now let’s plot the age distribution. Based on our simulated data, we’re expecting a fairly uniform plot.

example_data %>% 
  ggplot(mapping = aes(x = age_died)) +
  geom_bar()

Now let’s make it look a little better. There are themes that are built into ggplot, or you can install other themes from other packages, or you can edit aspects yourself. I’d recommend starting with the ggthemes package for some fun ones, but I tend to just use classic or minimal. Remember that you must always refer to your graphs in your text (Figure 6.3).

example_data %>% 
  ggplot(mapping = aes(x = age_died)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Age died",
       y = "Number",
       title = "Number of people who died at each age",
       caption = "Source: Simulated data.")
Number of people who died at each age

FIGURE 6.3: Number of people who died at each age

We may want to facet by some variable, in this case whether the person is a smoker (Figure 6.4).

example_data %>% 
  ggplot(mapping = aes(x = age_died)) +
  geom_bar() +
  theme_minimal() +
  facet_wrap(vars(smoker)) +
  labs(x = "Age died",
       y = "Number",
       title = "Number of people who died at each age, by whether they smoke",
       caption = "Source: Simulated data.")
Number of people who died at each age, by whether they smoke

FIGURE 6.4: Number of people who died at each age, by whether they smoke

Alternatively, we may wish to colour by that instead (Figure 6.5). I’ll filter to just a handful of age-groups to keep it tractable.

example_data %>% 
  filter(age_died < 25) %>% 
  ggplot(mapping = aes(x = age_died, fill = smoker)) +
  geom_bar(position = "dodge") +
  theme_minimal() +
  labs(x = "Age died",
       y = "Number",
       fill = "Smoker",
       title = "Number of people who died at each age, by whether they smoke",
       caption = "Source: Simulated data.")
Number of people who died at each age, by whether they smoke

FIGURE 6.5: Number of people who died at each age, by whether they smoke

It’s important to recognise that a boxplot hides the full distribution of a variable. Unless you need to communicate the general distribution of many variables at once then you should not use them. The same box plot can apply to very different distributions.

6.2.2 Scatter plot

Often, we are also interested in the relationship between two series. We’ll do that with a scatter plot. In this case, let’s simulate some data, say years of education and income.

set.seed(853)

number_of_observation <- 500

scatter_data <- 
  tibble(years_of_education = runif(n = number_of_observation, min = 10, max = 25),
         error = rnorm(n= number_of_observation, mean = 0, sd = 10000),
         ) %>% 
  mutate(income = years_of_education * 5000 + error,
         income = if_else(income < 0, 0, income))

head(scatter_data)
## # A tibble: 6 x 3
##   years_of_education   error income
##                <dbl>   <dbl>  <dbl>
## 1               15.4 -13782. 63180.
## 2               11.8   7977. 66985.
## 3               17.3  -9787. 76498.
## 4               14.7  12999. 86689.
## 5               10.6  -1500. 51302.
## 6               16.1   1911. 82202.

Now let’s look at income as a function of years of education (Figure 6.6).

scatter_data %>% 
  ggplot(mapping = aes(x = years_of_education, y = income)) +
  geom_point() +
  theme_minimal() +
  labs(x = "Years of education",
       y = "Income",
       title = "Relationship between income and years of education",
       caption = "Source: Simulated data.")
Relationship between income and years of education

FIGURE 6.6: Relationship between income and years of education

6.2.3 Never use box plots

Box plots are almost never appropriate because they hide the distribution of data. To see this, consider some data from a beta distribution.

left <- rbeta(10000,5,2)
right <- rbeta(10000,2,5)
middle <- rbeta(10000,5,5)

tricky_data <- 
  tibble(left_and_right = 
           c(
             rbeta(10000,5,2),
             rbeta(10000,2,5)
           ),
         middle = 
           rbeta(20000,1,1))

Then compare the box plots.

boxplot(tricky_data$left_and_right)

boxplot(tricky_data$middle)

hist(tricky_data$left_and_right)

hist(tricky_data$middle)

6.2.4 Other

6.2.4.1 Best fit

If we’re interested in quickly adding a line of best fit then, continuing with the earlier income example, we can do that with geom_smooth() (Figure 6.7).

scatter_data %>% 
  ggplot(mapping = aes(x = years_of_education, y = income)) +
  geom_point() +
  geom_smooth(method = lm, color = "black") +
  theme_minimal() +
  labs(x = "Years of education",
       y = "Income",
       title = "Relationship between income and years of education",
       caption = "Source: Simulated data.")
## `geom_smooth()` using formula 'y ~ x'
Relationship between income and years of education

FIGURE 6.7: Relationship between income and years of education

6.2.4.2 Histogram

If we want to get counts by groups, then we may want to use a histogram. Figure 6.8 shows the counts for our simulated incomes.

scatter_data %>% 
  ggplot(mapping = aes(x = income)) +
  geom_histogram() +
  theme_minimal() +
  labs(x = "Income",
       y = "Number",
       title = "Distribution of income",
       caption = "Source: Simulated data.")
## `stat_bin()` using `bins = 30`. Pick better value
## with `binwidth`.
Distribution of income

FIGURE 6.8: Distribution of income

6.2.4.3 Multiple plots

Finally, let’s try putting them together. We’re going to use the patchwork package (Pedersen 2020) and the penguins package for data. Don’t forget install.packages("palmerpenguins") as this is probably the first time you’ve used the package.

library(patchwork)
library(palmerpenguins)

p1 <- 
  ggplot(palmerpenguins::penguins) + 
  geom_point(aes(bill_length_mm, bill_depth_mm)) +
  labs(x = "Bill length (mm)",
       y = "Bill depth (mm)")
p2 <- 
  ggplot(palmerpenguins::penguins) + 
  geom_bar(aes(species)) +
  labs(x = "Species",
       y = "Number")

p1 + p2

And we can make things fairly involved fairly quickly.

(p1 | p2) /
  p2

6.3 Tables

Tables are also critical to tell a compelling story. We may prefer a table to a graph when there are only a few features that we want to focus on. We’ll use knitr::kable() alongside the ‘kableExtra’ package and also the gt package.

Let’s start with the kable package and the summary dinosaur data from earlier.

example_data <- 
  datasaurus_dozen %>% 
  filter(dataset %in% c("dino", "star", "away")) %>% 
  group_by(dataset) %>% 
  summarize(
    Mean    = mean(x),
    Std_dev = sd(x),
    ) 

example_data %>% 
  knitr::kable()
dataset Mean Std_dev
away 54.27 16.77
dino 54.26 16.77
star 54.27 16.77

Even the defaults are pretty good, but we can add a few tweaks to make the table better. The first is that this many significant digits is inappropriate, we may also like to add a caption, make the column names consistent, and change the alignment.

example_data %>% 
  knitr::kable(digits = 2, 
               caption = "My first table.", 
               col.names = c("Dataset", "Mean", "Standard deviation"),
               align = c('l', 'l', 'l')
               )
TABLE 6.1: My first table.
Dataset Mean Standard deviation
away 54.27 16.77
dino 54.26 16.77
star 54.27 16.77

The ‘’kableExtra’ package builds extra functionality (Zhu 2020).

The gt package (Iannone, Cheng, and Schloerke 2020) is a newer package that brings a lot of exciting features. However, being newer it sometimes has issues with PDF output.

library(gt)

example_data %>% 
  gt()
dataset Mean Std_dev
away 54.27 16.77
dino 54.26 16.77
star 54.27 16.77

We could add sub-titles easily.

example_data %>% 
  gt() %>%
  tab_header(
    title = "Summary stats can be misleading",
    subtitle = "With an example from a dinosaur!"
  )
Summary stats can be misleading
With an example from a dinosaur!
dataset Mean Std_dev
away 54.27 16.77
dino 54.26 16.77
star 54.27 16.77

One common reason for needing a table is to report regression results. You should consider gtsummary, stargazer, and modelsummary. But at the moment, my favourite is modelsummary (Arel-Bundock 2021).

library(modelsummary)

mod <- lm(y ~ x, datasaurus_dozen)
modelsummary(mod)
Model 1
(Intercept) 53.590
(2.119)
x -0.106
(0.037)
Num.Obs. 1846
R2 0.004
R2 Adj. 0.004
AIC 17383.0
BIC 17399.6
Log.Lik. -8688.506
F 8.072

6.4 Maps

In many ways maps can be thought of as a fancy graph, where the x-axis is latitude, the y-axis is longitude, and there is some outline or a background image. We are used to this type of set-up, for instance, in a ggplot setting that is quite familiar. Static maps will be useful for printed output, such as a PDF or Word report, or where there is something in particular that you want to illustrate.

ggplot() +
  geom_polygon( # First draw an outline
    data = some_data, 
    aes(x = latitude, 
        y = longitude,
        group = group
        )) +
  geom_point( # Then add points of interest
    data = some_other_data, 
    aes(x = latitude, 
        y = longitude)
    )

And while there are some small complications, for the most part it is as straight-forward as that. The first step is to get some data. And helpfully, there is some geographic data built into ggplot, and there is some other information built into a package called maps.

library(maps)
library(tidyverse)

canada <- map_data(database = "world", regions = "canada")
canadian_cities <- maps::canada.cities

head(canada)
##     long   lat group order region    subregion
## 1 -59.79 43.94     1     1 Canada Sable Island
## 2 -59.92 43.90     1     2 Canada Sable Island
## 3 -60.04 43.91     1     3 Canada Sable Island
## 4 -60.11 43.94     1     4 Canada Sable Island
## 5 -60.12 43.95     1     5 Canada Sable Island
## 6 -59.94 43.94     1     6 Canada Sable Island
head(canadian_cities)
##            name country.etc    pop   lat    long
## 1 Abbotsford BC          BC 157795 49.06 -122.30
## 2      Acton ON          ON   8308 43.63  -80.03
## 3 Acton Vale QC          QC   5153 45.63  -72.57
## 4    Airdrie AB          AB  25863 51.30 -114.02
## 5    Aklavik NT          NT    643 68.22 -135.00
## 6    Albanel QC          QC   1090 48.87  -72.42
##   capital
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0

With that information in hand we can then create a map of Canada that shows the cities with a population over 1,000. (The geom_polygon() function within ggplot draws shapes, by connecting points within groups. And the coord_map() function adjusts for the fact that we are making something that is 2D map to represent something that is 3D.)

ggplot() +
  geom_polygon(data = canada,
               aes(x = long,
                   y = lat,
                   group = group),
               fill = "white", 
               colour = "grey") +
  coord_map(ylim = c(40, 70)) +
  geom_point(aes(x = canadian_cities$long, 
                 y = canadian_cities$lat),
             alpha = 0.3,
             color = "black") +
  theme_classic() +
  labs(x = "Longitude",
       y = "Latitude")

# If I'm being honest, this 'simple example' took me six hours to work out. Firstly 
# to find Canada and then to find Canadian cities.

As is often the case with R, there are many different ways to get started creating static maps. We’ve already seen how they can be built using simply ggplot, but here we’ll explore one package that has a bunch of functionalities built in that will make things easier: ggmap.

There are two essential components to a map: 1) some border or background image (also known as a tile); and 2) something of interest within that border or on top of that tile. In ggmap, we will use an open-source option for our tile, Stamen Maps (maps.stamen.com), and we will use plot points based on latitude and longitude.

6.4.1 Australian polling places

Like Canada, in Australia people go to specific locations, called booths, to vote. These booths have latitudes and longitudes and so we can plot these. One reason we may like to do this is to notice patterns over geographies.

To get started we need to get a tile. We are going to use ggmap to get a tile from Stamen Maps, which builds on OpenStreetMap (openstreetmap.org). The main argument to this function is to specify a bounding box. This requires two latitudes - one for the top of the box and one for the bottom of the box - and two longitudes - one for the left of the box and one for the right of the box. (It can be useful to use Google Maps, or an alternative, to find the values of these that you need.) The bounding box provides the coordinates of the edges that you are interested in. In this case I have provided it with coordinates such that it will be centered around Canberra, Australia (our equivalent of Ottawa - a small city that was created for the purposes of being the capital).

library(ggmap)

bbox <- c(left = 148.95, bottom = -35.5, right = 149.3, top = -35.1)

Once you have defined the bounding box, then the function get_stamenmap() will get the tiles in that area. The number of tiles that it needs to get depends on the zoom, and the type of tiles that it gets depends on the maptype. I’ve chosen the maptype that I like here - the black and white option - but the helpfile specifies a few others that you may like. At this point you can pass your maps to ggmap and it will plot the tile! It will be actively downloading these tiles, so you need an internet connection.

canberra_stamen_map <- get_stamenmap(bbox, zoom = 11, maptype = "toner-lite")

ggmap(canberra_stamen_map)

Once we have a map then we can use ggmap() to plot it. (That circle in the middle of the map is where the Australian Parliament House is… yes, our parliament is surrounded by circular roads (we call them ‘roundabouts’), actually it’s surrounded by two of them.)

Now we want to get some data that we will plot on top of our tiles. We will just plot the location of the polling places, based on which ‘division’ (the Australian equivalent to ‘ridings’ in Canada) it is. This is available here: https://results.aec.gov.au/20499/Website/Downloads/HouseTppByPollingPlaceDownload-20499.csv. (The Australian Electoral Commission (AEC) is the official government agency that is responsible for elections in Australia.)

# Read in the booths data for each year
booths <- readr::read_csv("https://results.aec.gov.au/24310/Website/Downloads/GeneralPollingPlacesDownload-24310.csv", 
                          skip = 1, 
                          guess_max = 10000)

head(booths)
## # A tibble: 6 x 15
##   State DivisionID DivisionNm PollingPlaceID
##   <chr>      <dbl> <chr>               <dbl>
## 1 ACT          318 Bean                93925
## 2 ACT          318 Bean                93927
## 3 ACT          318 Bean                11877
## 4 ACT          318 Bean                11452
## 5 ACT          318 Bean                 8761
## 6 ACT          318 Bean                 8763
## # … with 11 more variables: PollingPlaceTypeID <dbl>,
## #   PollingPlaceNm <chr>, PremisesNm <chr>,
## #   PremisesAddress1 <chr>, PremisesAddress2 <chr>,
## #   PremisesAddress3 <chr>, PremisesSuburb <chr>,
## #   PremisesStateAb <chr>, PremisesPostCode <chr>,
## #   Latitude <dbl>, Longitude <dbl>

This dataset is for the whole of Australia, but as we are just going to plot the area around Canberra we will filter to that and only to booths that are geographic (the AEC has various options for people who are in hospital, or not able to get to a booth, etc, and these are still ‘booths’ in this dataset).

# Reduce the booths data to only rows with that have latitude and longitude
booths_reduced <-
  booths %>%
  filter(State == "ACT") %>% 
  select(PollingPlaceID, DivisionNm, Latitude, Longitude) %>% 
  filter(!is.na(Longitude)) %>% # Remove rows that don't have a geography
  filter(Longitude < 165) # Remove Norfolk Island

Now we can use ggmap in the same way as before to plot our underlying tiles, and then build on that using geom_point() to add our points of interest.

ggmap(canberra_stamen_map, 
      extent = "normal", 
      maprange = FALSE) +
  geom_point(data = booths_reduced,
             aes(x = Longitude, 
                 y = Latitude, 
                 colour = DivisionNm),
             ) +
  scale_color_brewer(name = "2019 Division", palette = "Set1") +
  coord_map(projection="mercator",
            xlim=c(attr(map, "bb")$ll.lon, attr(map, "bb")$ur.lon),
            ylim=c(attr(map, "bb")$ll.lat, attr(map, "bb")$ur.lat)) +
  labs(x = "Longitude",
       y = "Latitude") +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

We may like to save the map so that we don’t have to draw it every time, and we can do that in the same way as any other graph, using ggsave().

ggsave("outputs/figures/map.pdf", width = 20, height = 10, units = "cm")

Finally, the reason that I used Stamen Maps and OpenStreetMap is because it is open source, however you can also use Google Maps if you want. This requires you to first register a credit card with Google, and specify a key, but with low usage should be free. The get_googlemap() function with ggmap, brings some nice features that get_stamenmap() does not have. For instance, you can enter a placename and it’ll do it’s best to find it rather than needing to specify a bounding box.

6.4.2 Toronto bike parking

Let’s see another example of a static map, this time using Toronto data accessed via the opendatatoronto package. The dataset that we are going to plot is available here: https://open.toronto.ca/dataset/street-furniture-bicycle-parking/.

# This code is based on code from: https://open.toronto.ca/dataset/street-furniture-bicycle-parking/.
library(opendatatoronto)
# (The string identifies the package.)
resources <- list_package_resources("71e6c206-96e1-48f1-8f6f-0e804687e3be")
# In this case there is only one dataset within this resource so just need the first one    
raw_data <- filter(resources, row_number()==1) %>% get_resource()
write_csv(raw_data, "inputs/data/bike_racks.csv")
head(raw_data)

Now that we’ve saved a copy of the data, we can use that one. First, we need to clean it up a bit. There are some clear errors in the ADDRESSNUMBERTEXT field, but not too many, so we’ll just ignore it.

raw_data <- read_csv("inputs/data/bike_racks.csv")
# We'll just focus on the data that we want
bike_data <- tibble(ward = raw_data$WARD,
                    id = raw_data$ID,
                    status = raw_data$STATUS,
                    street_address = paste(raw_data$ADDRESSNUMBERTEXT, raw_data$ADDRESSSTREET),
                    latitude = raw_data$LATITUDE,
                    longitude = raw_data$LONGITUDE)
rm(raw_data)

Some of the bike racks were temporary so remove them and also let’s just look at the area around the university, which is Ward 11

# Only keep ones that still exist
bike_data <- 
  bike_data %>%
  filter(status == "Existing") %>% 
  select(-status)

bike_data <- bike_data %>% 
  filter(ward == 11) %>% 
  select(-ward)

If you look at the dataset at this point, then you’ll notice that there is a row for every bike parking spot. But we don’t really need to know that, because sometimes there are lots right next to each other. Instead, we’d just like the one point (we’ll take advantage of this in an interactive graph in a moment). So, we want to create a count by address, and then just get one instance per address.

bike_data <- 
  bike_data %>%
  group_by(street_address) %>% 
  mutate(number_of_spots = n(),
         running_total = row_number()
         ) %>% 
  ungroup() %>% 
  filter(running_total == 1) %>% 
  select(-id, -running_total)

head(bike_data)
## # A tibble: 6 x 4
##   street_address   latitude longitude number_of_spots
##   <chr>               <dbl>     <dbl>           <int>
## 1 8 Kensington Ave     43.7     -79.4               1
## 2 87 Avenue Rd         43.7     -79.4               4
## 3 162 Mc Caul St       43.7     -79.4               1
## 4 147 Baldwin St       43.7     -79.4               2
## 5 888 Yonge St         43.7     -79.4               1
## 6 180 Elizabeth St     43.7     -79.4              10
write_csv(bike_data, "outputs/data/bikes.csv")

Now we can grab our tile and add our bike rack data onto it.

bbox <- c(left = -79.420390, bottom = 43.642658, right = -79.383354, top = 43.672557)

toronto_stamen_map <- get_stamenmap(bbox, zoom = 14, maptype = "toner-lite")

ggmap(toronto_stamen_map,  maprange = FALSE) +
  geom_point(data = bike_data,
             aes(x = longitude, 
                 y = latitude),
             alpha = 0.3
             ) +
  labs(x = "Longitude",
       y = "Latitude") +
  theme_minimal() 

6.4.3 Geocoding

To this point we just assumed that we already had geocoded data. The places ‘Canberra, Australia,’ or ‘Ottawa, Canada,’ are just names, they don’t actually inherently have a location. In order to plot them we need to get a latitude and longitude for them. The process of going from names to coordinates is called geocoding.

There are a range of options to geocode data in R, but one good package is tidygeocoder (Cambon and Belanger 2021). To get started using the package we need a dataframe of locations. So we’ll just quickly make one here.

some_locations <- 
  tibble(city = c('Canberra', 'Ottawa'),
         country = c('Australia', 'Canada'))
tidygeocoder::geo(city = some_locations$city, 
                  country = some_locations$country, 
                  method = 'osm')
## # A tibble: 2 x 4
##   city     country     lat  long
##   <chr>    <chr>     <dbl> <dbl>
## 1 Canberra Australia -35.3 149. 
## 2 Ottawa   Canada     45.4 -75.7

6.5 Writing

I had not indeed published anything before I commenced “The Professor,” but in many a crude effort, destroyed almost as soon as composed, I had got over any such taste as I might once have had for ornamented and redundant composition, and come to prefer what was plain and homely.

Currer Bell (aka Charlotte Brontë), The Professor.

6.5.1 Title, abstract, and introduction

A title is the first opportunity that you have to tell the reader your story. Ideally you will tell the reader exactly what you found. An effective title is critical in order to get your work read when there are other competing priorities. A title doesn’t have to be ‘cute’ to be great.

  • Good: ‘On the 2019 Canadian Federal Election.’ (At least the reader knows what the paper is about.)
  • Better: ‘The Liberal Party performance in the 2019 Canadian Federal Election.’ (At least the reader knows what the paper is about more specifically.)
  • Even better: ‘The Liberal Party did poorly in rural areas in the 2019 Canadian Federal Election.’ (The reader knows what the paper is about.)

You should put your name and the date on the paper because this provides an important context to the paper.

For a six-page paper, a good abstract is a three to five sentence paragraph. For a longer paper your abstract can be slightly longer. The abstract should say: What you did, what you found, and why the reader should care. Each of these should just be a sentence or two, so keep it very high level.

You should then have an introduction that tells the reader everything they need to know. You are not writing a mystery story - tell the reader the most important points in the introduction. For a six-page paper, your introduction may be two or three paragraphs. Four would likely be too much, but it depends on the context.

Your introduction should set the scene and give the reader some background. For instance, you may like to start of a little broader, to provide some context to your paper. You should then describe how your paper fits into that context. Then give some high-level results - provide more detail than you provided in the abstract, but don’t get into the weeds - and finally broadly discuss next steps or glaring weaknesses. With regard to that high-level result: you need to pick one. If you have a bunch of interesting findings, then good for you, but pick one and write your introduction around that. If it’s compelling enough then the reader will end up reading all your other interesting findings in the discussion/results sections. Finally, you should highlight the remainder of the paper.

As an example:

The Canadian Liberal Party has always struggled in rural ridings. In the past 100 years they have never won more than 25 per cent of them. But even by those standards the 2019 Federal Election was a disappointment with the Liberal Party winning only 2 of the 40 rural ridings.

In this paper we look at why the performance of the Liberal Party in this most recent election was so poor. We construct a model in which whether the Liberal Party won the riding is explained by the number of farms in the riding, the average internet connectivity, and the median age. We find that as the median age of a riding increases, the likelihood that a riding was won by the Liberal Party decreases by 14 percentage points. Future work could expand the time horizon that is considered which would allow a more nuanced understanding of these effects.

The remainder of this paper is structured as follows: Section 2 discusses the data, Section 3 discusses the model, Section 4 presents the results, and finally Section 5 discusses our findings and some weaknesses.

The recommended readings provide some lovely examples of titles, abstracts, and introductions. Please take the time to briefly read these papers.

6.5.2 Figures, tables, equations, and technical terms

Figure and tables are a critical aspect of convincing people of your story. In a graph you can show your data and then let people decide for themselves. And in a table, you can more easily summarise your data.

Figures, tables, equations, etc, should be numbered and then referenced in the text e.g. “Figure 1 shows…” and then have Figure 1.

You should make sure that all aspects of your graph are legible. Always label all of the axes. Your graphs should have titles, and the point that you want to communicate should be clear.

If you use a technical term, then it should be briefly explained in plain language for readers who might not be familiar with it. A great example of this is this post by Monica Alexander where she explains the Gini coefficient:

To look at the concentration of baby names, let’s calculate the Gini coefficient for each country, sex and year. The Gini coefficient measures dispersion or inequality among values of a frequency distribution. It can take any value between 0 and 1. In the case of income distributions, a Gini coefficient of 1 would mean one person has all the income. In this case, a Gini coefficient of 1 would mean that all babies have the same name. In contrast, a Gini coefficient of 0 would mean names are evenly distributed across all babies.

6.5.3 On brevity

'No more than four pages, or he's never going to read it. Two pages is preferable.'

FIGURE 6.9: ‘No more than four pages, or he’s never going to read it. Two pages is preferable.’

Source: Shipman, Tim, 2020, "The prime minister’s vanishing briefs’, The Sunday Times, 23 February, available at: https://www.thetimes.co.uk/article/the-prime-ministers-vanishing-briefs-67mt0bg95 via Sarah Nickson.

Insisting on two page briefs is sensible - not ‘government by ADHD.’ PM has to be across lots of issues - cannot and should not be across (most of) them in the same depth as secretaries of state. Danger lies in PM trying to take on too much and getting bogged down in detail.

This might irk officials who lack a sense of where their issue sits within the PM’s list of priorities - or the writing skills to draft a succinct brief. But there’d be very few occasions when a brief to the PM warrants more than two pages.

This is not something peculiar to the current PM - other ministers have raised the same in interviews with @instituteforgov Oliver Letwin complained of ‘huge amount of terrible guff, at huge, colossal, humungous length coming from some departments’ https://www.instituteforgovernment.org.uk/ministers-reflect/person/oliver-letwin/

Letwin sent briefs back and asked they be re-drafted to one quarter of the length. ‘Somewhere along the line the Civil Service had got used to splurge of the meaningless kind’ Similarly, Theresa Villiers talked about the civil service’s ‘frustrating tendency to produce six pages of obscure and rather impenetrable text’ and wishes she’d be firmer in sending documents back for re-drafting: https://www.instituteforgovernment.org.uk/ministers-reflect/person/theresa-villiers/

Sarah Nickson, 23 Feb 2020.

Brevity is important. Partly this because you are writing for the reader, not yourself, and your reader has other priorities. But it is also because as the writer it focuses you to consider what your most important points are, how you can best support them, and where your arguments are weakest.

If you don’t think that examples from government are persuasive, then please consider Amazon’s 2017 Letter to Shareholders, or other statements about Bezos and memo writing, for instance:

Well structured, narrative text is what we’re after rather than just text… The reason writing a 4 page memo is harder than “writing” a 20 page powerpoint is because the narrative structure of a good memo forces better thought and better understanding of what’s more important than what, and how things are related.

Jeff Bezos, 9 June 2004.

6.5.4 Other

Typos and other grammatical mistakes affect the credibility of your claims. If the reader can’t trust you to use a spell-checker, then why should they trust you to use logistic regression? Microsoft Word has a fantastic spell-checker that is much better than what is available for R Markdown: copy/paste your work into there, look for the red lines and fix them in your R Markdown. Then look for the green lines and think about if you need to fix them in your R Markdown. If you don’t have Word, then Google Docs is pretty good and so is Apple’s Pages.

A few other general tips that I have stolen from various people including the Reserve Bank of Australia’s style guide:

  • Think about what you are writing. Aim to write everything as though it were on the front page of the newspaper, because one day it could be.
  • Be concise. Remove as many words as possible.
  • Be direct. Think about the structure of your story and identify the key pieces of information and arrange them so that your paper flows logically from one to the next. You should use sub-headings if you need.
  • Be precise. For instance, the stock-market didn’t improve or worsen, it rose or fell. Distinguish levels from rates of change.
  • Be clear.
  • Write simply.
  • Use short sentence where possible.
  • Avoid jargon.

You should break these rules when you need to. But the only way to know whether you need to break a rule is to know the rules in the first instance.

References

Arel-Bundock, Vincent. 2021. Modelsummary: Summary Tables and Plots for Statistical Models and Data: Beautiful, Customizable, and Publication-Ready. https://CRAN.R-project.org/package=modelsummary.
Cambon, Jesse, and Christopher Belanger. 2021. “Tidygeocoder: Geocoding Made Easy.” Zenodo. https://doi.org/10.5281/zenodo.3981510.
Hamming, Richard W. 1996. The Art of Doing Science and Engineering. Stripe Press.
Iannone, Richard, Joe Cheng, and Barret Schloerke. 2020. Gt: Easily Create Presentation-Ready Display Tables. https://CRAN.R-project.org/package=gt.
Locke, Steph, and Lucy D’Agostino McGowan. 2018. datasauRus: Datasets from the Datasaurus Dozen. https://CRAN.R-project.org/package=datasauRus.
Pedersen, Thomas Lin. 2020. Patchwork: The Composer of Plots. https://CRAN.R-project.org/package=patchwork.
Zhu, Hao. 2020. kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.