6 Static communication

Required material

  • Read R for Data Science, Chapter 28 ‘Graphics for communication,’ (Wickham and Grolemund 2017).
  • Read Data Visualization: A Practical Introduction, Chapters 3 ‘Make a plot,’ 4 ‘Show the right numbers,’ and 5 ‘Graph tables, add labels, make notes,’ (Healy 2018).
  • Read Testing Statistical Charts: What Makes a Good Graph?, (Vanderplas, Cook, and Hofmann 2020).
  • Read Data Feminism, Chapter 3 ‘On Rational, Scientific, Objective Viewpoints from Mythical, Imaginary, Impossible Standpoints,’ (D’Ignazio and Klein 2020).

Key concepts and skills

  • Knowing the importance of showing the reader the actual dataset, or as close as is possible.
  • Using a variety of different graph options, including bar charts, scatterplots, line plots, and histograms.
  • Knowing how to use tables to show part of a dataset, communicate summary statistics, and display regression results.
  • Approaching maps as a type of a graph.
  • Comfort with geocoding places.

Key libraries

Key functions

6.1 Introduction

When telling stories with data, we are trying to allow the data to convince our reader of something. The paper is the medium, and the data are the message. To that end, we want to try to show our reader the data that allowed us to come to our understanding of the story. We use graphs, tables, and maps to help achieve this.

In the first instance, we must show the actual data that underpin our analysis, or as close to it as we can. For instance, if our dataset consists of 2,500 responses to a survey, then at some point in our paper we would expect a graph that contains 2,500 points or sums to 2,500. To do this we build graphs using ggplot2 (Wickham 2016). We will go through a variety of different options here including bar charts, scatterplots, histograms, and line charts.

In contrast to the role of graphs, which is to show the actual data, or as close to it as possible, the role of tables is typically to convey various summary statistics. We will build tables using knitr (Xie 2021) and kableExtra (Zhu 2020) in the first instance, and then gt (Iannone, Cheng, and Schloerke 2020) and modelsummary (Arel-Bundock 2021a).

Finally, we cover maps as a variant of graphs that are used to show a particular type of data. We will build static maps using ggmap (Kahle and Wickham 2013), having obtained the geocoded data that we need using tidygeocoder (Cambon and Belanger 2021).

6.2 Graphs

Graphs are a critical aspect of compelling stories told with data.

Graphs allow us to explore data to see overall patterns and to see detailed behavior; no other approach can compete in revealing the structure of data so thoroughly. Graphs allow us to view complex mathematical models fitted to data, and they allow us to assess the validity of such models.

Cleveland (1994, 5)

In a way, the graphing of data is an information coding process where we create a glyph, or purposeful mark, that we mean to convey information to our audience. The audience must decode our glyph. The success of our graph turns on how much data are lost in this process. And the decoding is the critical aspect (Cleveland 1994, 221), which means that we are creating graphs for the audience. If nothing else is possible, the most important feature is to convey as much of the actual data as possible.

To see why this is important we begin by using the dataset ‘datasaurus_dozen’ from datasauRus (Locke and D’Agostino McGowan 2018). First, we can take a quick look at the dataset.

install.packages('datasauRus')
library(tidyverse)
library(datasauRus)

head(datasaurus_dozen)
#> # A tibble: 6 × 3
#>   dataset     x     y
#>   <chr>   <dbl> <dbl>
#> 1 dino     55.4  97.2
#> 2 dino     51.5  96.0
#> 3 dino     46.2  94.5
#> 4 dino     42.8  91.4
#> 5 dino     40.8  88.3
#> 6 dino     38.7  84.9
datasaurus_dozen |> count(dataset)
#> # A tibble: 13 × 2
#>    dataset        n
#>    <chr>      <int>
#>  1 away         142
#>  2 bullseye     142
#>  3 circle       142
#>  4 dino         142
#>  5 dots         142
#>  6 h_lines      142
#>  7 high_lines   142
#>  8 slant_down   142
#>  9 slant_up     142
#> 10 star         142
#> 11 v_lines      142
#> 12 wide_lines   142
#> 13 x_shape      142

We can see that the dataset consists of values for ‘x’ and ‘y,’ which should be plotted on the x-axis and y-axis, respectively. We can further see that there are thirteen different values in the variable ‘dataset’ including: “dino,” “star,” “away,” and “bullseye.” We will focus on those four and generate some summary statistics for each (Table 6.1).

# Code from Julia Silge: https://juliasilge.com/blog/datasaurus-multiclass/
datasaurus_dozen |>
  filter(dataset %in% c("dino", "star", "away", "bullseye")) |>
  group_by(dataset) |>
  summarise(across(c(x, y), 
                   list(mean = mean, sd = sd)),
            x_y_cor = cor(x, y)) |>
  knitr::kable(
    caption = "Mean and standard deviation for four 'datasaurus' datasets",
    col.names = c("Dataset", "x mean", "x sd", "y mean", "y sd", "correlation"),
    digits = 1,
    booktabs = TRUE,
    linesep = ""
  )
Table 6.1: Mean and standard deviation for four ‘datasaurus’ datasets
Dataset x mean x sd y mean y sd correlation
away 54.3 16.8 47.8 26.9 -0.1
bullseye 54.3 16.8 47.8 26.9 -0.1
dino 54.3 16.8 47.8 26.9 -0.1
star 54.3 16.8 47.8 26.9 -0.1

Despite the similarities of the summary statistics, it turns out the different ‘datasets’ are actually very different beasts when we graph the actual data (Figure 6.1).

datasaurus_dozen |> 
  filter(dataset %in% c("dino", "star", "away", "bullseye")) |>
  ggplot(aes(x=x, y=y, colour=dataset)) +
  geom_point() +
  theme_minimal() +
  facet_wrap(vars(dataset), nrow = 2, ncol = 2) +
  labs(colour = "Dataset")
Graph of four 'datasaurus' datasets

Figure 6.1: Graph of four ‘datasaurus’ datasets

This is a variant of the famous ‘Anscombe’s Quartet.’ That plot conveys the same message about the importance of plotting the actual data and not relying on summary statistics. The ‘anscombe’ dataset is built into R.

head(anscombe)
#>   x1 x2 x3 x4   y1   y2    y3   y4
#> 1 10 10 10  8 8.04 9.14  7.46 6.58
#> 2  8  8  8  8 6.95 8.14  6.77 5.76
#> 3 13 13 13  8 7.58 8.74 12.74 7.71
#> 4  9  9  9  8 8.81 8.77  7.11 8.84
#> 5 11 11 11  8 8.33 9.26  7.81 8.47
#> 6 14 14 14  8 9.96 8.10  8.84 7.04

It consists of six observations for four different datasets, again with x and y values for each observation. We need to manipulate this dataset with pivot_longer() to get it into a ‘tidy format.’

# From Nick Tierney: https://www.njtierney.com/post/2020/06/01/tidy-anscombe/
# Code from pivot_longer() vignette.
tidy_anscombe <- 
  anscombe |>
  pivot_longer(everything(),
               names_to = c(".value", "set"),
               names_pattern = "(.)(.)"
               )

We can again first create some summary statistics (Table 6.2) and then graph the data (Figure 6.2).

tidy_anscombe |>
  group_by(set) |>
  summarise(across(c(x, y),
                   list(mean = mean, sd = sd)),
            x_y_cor = cor(x, y)) |>
  knitr::kable(
    caption = "Mean and standard deviation for Anscombe",
    col.names = c("Dataset", "x mean", "x sd", "y mean", "y sd", "correlation"),
    digits = 1,
    booktabs = TRUE,
    linesep = ""
  )
Table 6.2: Mean and standard deviation for Anscombe
Dataset x mean x sd y mean y sd correlation
1 9 3.3 7.5 2 0.8
2 9 3.3 7.5 2 0.8
3 9 3.3 7.5 2 0.8
4 9 3.3 7.5 2 0.8
tidy_anscombe |> 
  ggplot(aes(x = x, y = y, colour = set)) +
  geom_point() +
  theme_minimal() +
  facet_wrap(vars(set), nrow = 2, ncol = 2) +
  labs(colour = "Dataset")
Recreation of Anscombe's Quartet

Figure 6.2: Recreation of Anscombe’s Quartet

6.2.1 Bar charts

We typically use a bar chart when we have a categorical variable that we want to focus on. We saw an example of this in Chapter 2 where we constructed a graph of the number of occupied beds. The geom that we primarily use is geom_bar(), but there are many variants to cater for the specific situation.

We will use a dataset from the 1997-2001 British Election Panel Study that was put together by Fox and Andersen (2006).

# Vincent Arel Bundock provides access to this dataset.
beps <- 
  read_csv(
    file = 
      "https://vincentarelbundock.github.io/Rdatasets/csv/carData/BEPS.csv"
    )
head(beps)
#> # A tibble: 6 × 11
#>    ...1 vote     age economic.cond.n… economic.cond.h… Blair
#>   <dbl> <chr>  <dbl>            <dbl>            <dbl> <dbl>
#> 1     1 Liber…    43                3                3     4
#> 2     2 Labour    36                4                4     4
#> 3     3 Labour    35                4                4     5
#> 4     4 Labour    24                4                2     2
#> 5     5 Labour    41                2                2     1
#> 6     6 Labour    47                3                4     4
#> # … with 5 more variables: Hague <dbl>, Kennedy <dbl>,
#> #   Europe <dbl>, political.knowledge <dbl>, gender <chr>

We have some data on the age of the respondents. We could begin by making a graph of this age distribution (Figure 6.3).

beps |> 
  ggplot(mapping = aes(x = age)) +
  geom_bar()
Distribution of ages in the 1997-2001 British Election Panel Study

Figure 6.3: Distribution of ages in the 1997-2001 British Election Panel Study

We can see that by default, geom_bar() has created a count of the number of times each age appears in the dataset. It does this because the default ‘stat’ for geom_bar() is ‘count.’ This saves us from having to create that ourselves. But if we had already constructed a count (for instance, with beps |> count(age)), then we could also specify a column of values for the y-axis and then use ‘stat = “identity”.’

We may also like to consider different groupings of the data, for instance, ‘vote’ (Figure 6.4).

beps |> 
  ggplot(mapping = aes(x = age, fill = vote)) +
  geom_bar()
Distribution of ages, and vote preference, in the 1997-2001 British Election Panel Study

Figure 6.4: Distribution of ages, and vote preference, in the 1997-2001 British Election Panel Study

The default is that these different groups are stacked, but they can be placed side-by-side with ‘position = “dodge”’ (Figure 6.5).

beps |> 
  ggplot(mapping = aes(x = age, fill = vote)) +
  geom_bar(position = "dodge")
Distribution of ages, and vote preference, in the 1997-2001 British Election Panel Study

Figure 6.5: Distribution of ages, and vote preference, in the 1997-2001 British Election Panel Study

At this point, we may like to address the general look of the graph. There are various themes that are built into ggplot2. Some of these include theme_bw(), theme_classic(), theme_linedraw(), and theme_minimal(). A full list is available at the ggplot2 cheatsheet: https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf. We can use these themes by adding them as a layer (Figure 6.6). We can use patchwork (Pedersen 2020) to bring together multiple graphs. To do this we assign the graph to a name, and then use ‘+’ to signal which should be next to each other, ‘/’ to signal which would be on top, and brackets for precedence.

library(patchwork)

theme_bw <- 
  beps |> 
  ggplot(mapping = aes(x = age, fill = vote)) +
  geom_bar(position = "dodge") +
  theme_bw()

theme_classic <- 
  beps |> 
  ggplot(mapping = aes(x = age, fill = vote)) +
  geom_bar(position = "dodge") +
  theme_classic()

theme_linedraw <- 
  beps |> 
  ggplot(mapping = aes(x = age, fill = vote)) +
  geom_bar(position = "dodge") +
  theme_linedraw()

theme_minimal <- 
  beps |> 
  ggplot(mapping = aes(x = age, fill = vote)) +
  geom_bar(position = "dodge") +
  theme_minimal()

(theme_bw + theme_classic) / (theme_linedraw + theme_minimal)
Distribution of ages, and vote preference, in the 1997-2001 British Election Panel Study, illustrating different themes

Figure 6.6: Distribution of ages, and vote preference, in the 1997-2001 British Election Panel Study, illustrating different themes

We can install themes from other packages, including ggthemes (Arnold 2021), and hrbrthemes (Rudis 2020), and we can even build our own.

The default labels are from the name of the relevant variable, and it is often useful to add more detail. We can add a title and caption at this point. A caption can be useful to add information about the source of the dataset. A title can be useful when the graph is going to be considered outside of the context of our paper. But the need to cross-reference all graphs that are in a paper means that included a title within labs() is unnecessary (Figure 6.7).

beps |> 
  ggplot(mapping = aes(x = age, fill = vote)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Age of respondent",
       y = "Number of respondents",
       fill = "Voted for",
       title = "Distribution of ages, and vote preference, in the 1997-2001 British Election Panel Study",
       caption = "Source: 1997-2001 British Election Panel Study.")
Distribution of ages, and vote preference, in the 1997-2001 British Election Panel Study

Figure 6.7: Distribution of ages, and vote preference, in the 1997-2001 British Election Panel Study

We use facets to create ‘many little graphics that are variations of a single graphic’ (L. Wilkinson 2005, 219). They are especially useful when we want to specifically compare some outcomes across some variable. For instance, we may be interested to explain vote, by age and gender (Figure 6.8).

beps |> 
  ggplot(mapping = aes(x = age, fill = vote)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Age of respondent",
       y = "Number of respondents",
       fill = "Voted for") +
  facet_wrap(vars(gender))
Distribution of age by gender, and vote preference, in the 1997-2001 British Election Panel Study

Figure 6.8: Distribution of age by gender, and vote preference, in the 1997-2001 British Election Panel Study

We could change facet_wrap() to wrap vertically instead of horizontally with ‘dir = “v”.’ Alternatively, we could specify a number of rows, say ‘nrow = 2,’ or a number of columns, say ‘ncol = 2.’ Additionally, by default, both facets will have the same scales. We could enable both facets to have different scales with ‘scales = “free”,’ or just the x-axis ‘scales = “free_x”,’ or just the y-axis ‘scales = “free_y”’ (Figure 6.9).

beps |> 
  ggplot(mapping = aes(x = age, fill = vote)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Age of respondent",
       y = "Number of respondents",
       fill = "Voted for") +
  facet_wrap(vars(gender),
             dir = "v",
             scales = "free")
Distribution of age by gender, and vote preference, in the 1997-2001 British Election Panel Study

Figure 6.9: Distribution of age by gender, and vote preference, in the 1997-2001 British Election Panel Study

Finally, we can change the labels of the facets (Figure 6.10).

new_labels <- c(female = "Female", male = "Male")

beps |> 
  ggplot(mapping = aes(x = age, fill = vote)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Age of respondent",
       y = "Number of respondents",
       fill = "Voted for") +
  facet_wrap(vars(gender),
             dir = "v",
             scales = "free",
             labeller = labeller(gender = new_labels))
Distribution of age by gender, and vote preference, in the 1997-2001 British Election Panel Study

Figure 6.10: Distribution of age by gender, and vote preference, in the 1997-2001 British Election Panel Study

There are a variety of different ways to change the colors, and many palettes are available including from RColorBrewer (Neuwirth 2014), which we specify with scale_fill_brewer(), viridis (Garnier et al. 2021), which we specify with scale_fill_viridis() and is particularly focused on color-blind palettes (Figure 6.11).

library(viridis)
library(patchwork)

RColorBrewerBrBG <- 
  beps |> 
  ggplot(mapping = aes(x = age, fill = vote)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Age of respondent",
       y = "Number of respondents",
       fill = "Voted for") + 
  scale_fill_brewer(palette = "Blues")

RColorBrewerSet2 <- 
  beps |> 
  ggplot(mapping = aes(x = age, fill = vote)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Age of respondent",
       y = "Number of respondents",
       fill = "Voted for") +
  scale_fill_brewer(palette = "Set1")

viridis <- 
  beps |> 
  ggplot(mapping = aes(x = age, fill = vote)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Age of respondent",
       y = "Number of respondents",
       fill = "Voted for") + 
  scale_fill_viridis(discrete = TRUE)

viridismagma <- 
  beps |> 
  ggplot(mapping = aes(x = age, fill = vote)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Age of respondent",
       y = "Number of respondents",
       fill = "Voted for") +
   scale_fill_viridis(discrete = TRUE, 
                      option = "magma")

(RColorBrewerBrBG + RColorBrewerSet2) /
 (viridis + viridismagma)
Distribution of age and vote preference, in the 1997-2001 British Election Panel Study

Figure 6.11: Distribution of age and vote preference, in the 1997-2001 British Election Panel Study

Details of the variety of palettes available in RColorBrewer and viridis are available in their help files. Many different palettes are available, and we can also build our own. That said, color is something to be considered with a great deal of care because the test of the success of a graph is the amount of information that is communicated (Cleveland 1994). Colors should not be added to graphs unnecessarily—that is to say, they must play some role. Typically, that role is to distinguish different groups, and that implies making the colors dissimilar. Colors may also be appropriate if there is some relationship between the color and the variable, for instance if making a graph of ‘lemons and cherries’ it would help the reader if the colors were ‘yellow and red’ respectively (Franconeri et al. 2021, 121).

6.2.2 Scatterplots

We are often interested in the relationship between two variables. We can use scatterplots to show this information. Unless there is a good reason to move to a different option, a scatterplot is almost always the best choice (Weissgerber et al. 2015). Indeed, ‘among all forms of statistical graphics, the scatterplot may be considered the most versatile and generally useful invention in the entire history of statistical graphics.’ (Friendly and Wainer 2021, 121) To illustrate scatterplots, we use WDI (Arel-Bundock 2021b) to download some economic indicators from the World Bank.

Oh, you think we have good data on that! Gross Domestic Product (GDP) ‘combines in a single figure, and with no double counting, all the output (or production) carried out by all the firms, non-profit institutions, government bodies and households in a given country during a given period, regardless of the type of goods and services produced, provided that the production takes place within the country’s economic territory. (OECD (2014), p. 15). The modern concept was developed by Simon Kuznets and is widely used and reported. There is a certain comfort in having a definitive and concrete single number to describe something as complicated as the entire economic activity of a country. And it is crucial that we have such summary statistics. But as with any summary statistic, its strength is also its weakness. A single number necessarily loses information about constituent components, and these distributional differences are critical. It highlights short term economic progress over longer term improvements. And ’the quantitative definiteness of the estimates makes it easy to forget their dependence upon imperfect data and the consequently wide margins of possible error to which both totals and components are liable’ (Kuznets 1941, xxvi). Reliance on any one summary measure of economic performance presents a misguided picture not only of a country’s economy, but also of its peoples.

library(tidyverse)
library(WDI)
WDIsearch("gdp growth")
#>      indicator             
#> [1,] "5.51.01.10.gdp"      
#> [2,] "6.0.GDP_growth"      
#> [3,] "NV.AGR.TOTL.ZG"      
#> [4,] "NY.GDP.MKTP.KD.ZG"   
#> [5,] "NY.GDP.MKTP.KN.87.ZG"
#>      name                                    
#> [1,] "Per capita GDP growth"                 
#> [2,] "GDP growth (annual %)"                 
#> [3,] "Real agricultural GDP growth rates (%)"
#> [4,] "GDP growth (annual %)"                 
#> [5,] "GDP growth (annual %)"
WDIsearch("inflation")
#>      indicator             
#> [1,] "FP.CPI.TOTL.ZG"      
#> [2,] "FP.FPI.TOTL.ZG"      
#> [3,] "FP.WPI.TOTL.ZG"      
#> [4,] "NY.GDP.DEFL.87.ZG"   
#> [5,] "NY.GDP.DEFL.KD.ZG"   
#> [6,] "NY.GDP.DEFL.KD.ZG.AD"
#>      name                                               
#> [1,] "Inflation, consumer prices (annual %)"            
#> [2,] "Inflation, food prices (annual %)"                
#> [3,] "Inflation, wholesale prices (annual %)"           
#> [4,] "Inflation, GDP deflator (annual %)"               
#> [5,] "Inflation, GDP deflator (annual %)"               
#> [6,] "Inflation, GDP deflator: linked series (annual %)"
WDIsearch("population, total")
#>           indicator                name 
#>       "SP.POP.TOTL" "Population, total"
WDIsearch("Unemployment, total")
#>      indicator          
#> [1,] "SL.UEM.TOTL.NE.ZS"
#> [2,] "SL.UEM.TOTL.ZS"   
#>      name                                                                 
#> [1,] "Unemployment, total (% of total labor force) (national estimate)"   
#> [2,] "Unemployment, total (% of total labor force) (modeled ILO estimate)"
world_bank_data <- 
  WDI(indicator = c("FP.CPI.TOTL.ZG",
                    "NY.GDP.MKTP.KD.ZG",
                    "SP.POP.TOTL",
                    "SL.UEM.TOTL.NE.ZS"
                    ),
      country = c("AU", "ET", "IN", "US")
      )

At this point we may like to change the names to be more meaningful and only keep the columns that we need.

world_bank_data <- 
  world_bank_data |> 
  rename(inflation = FP.CPI.TOTL.ZG,
         gdp_growth = NY.GDP.MKTP.KD.ZG,
         population = SP.POP.TOTL,
         unemployment_rate = SL.UEM.TOTL.NE.ZS
         ) |> 
  select(-iso2c)

head(world_bank_data)
#> # A tibble: 6 × 6
#>   country    year inflation gdp_growth population
#>   <chr>     <dbl>     <dbl>      <dbl>      <dbl>
#> 1 Australia  1960     3.73       NA      10276477
#> 2 Australia  1961     2.29        2.48   10483000
#> 3 Australia  1962    -0.319       1.29   10742000
#> 4 Australia  1963     0.641       6.21   10950000
#> 5 Australia  1964     2.87        6.98   11167000
#> 6 Australia  1965     3.41        5.98   11388000
#> # … with 1 more variable: unemployment_rate <dbl>

Now let us look at income as a function of years of education (Figure 6.12).

world_bank_data |>
  ggplot(mapping = aes(x = gdp_growth, y = inflation, color = country)) +
  geom_point()
#> Warning: Removed 26 rows containing missing values
#> (geom_point).
Relationship between inflation and GDP for Australia, Ethiopia, India, and the US

Figure 6.12: Relationship between inflation and GDP for Australia, Ethiopia, India, and the US

As with the bar plots, we change the theme, and update the labels (Figure 6.13).

world_bank_data |>
  ggplot(mapping = aes(x = gdp_growth, y = inflation, color = country)) +
  geom_point() +
  theme_minimal() +
  labs(x = "Inflation",
       y = "GDP growth",
       color = "Country",
       title = "Relationship between inflation and GDP growth",
       caption = "Data source: World Bank.")
#> Warning: Removed 26 rows containing missing values
#> (geom_point).
Relationship between inflation and GDP for Australia, Ethiopia, India, and the US

Figure 6.13: Relationship between inflation and GDP for Australia, Ethiopia, India, and the US

We use ‘color’ instead of ‘fill’ because we are using dots rather than bars. This also then slightly affects how we change the palette (Figure 6.14).

library(patchwork)

RColorBrewerBrBG <-
  world_bank_data |>
  ggplot(mapping = aes(x = gdp_growth, y = inflation, color = country)) +
  geom_point() +
  theme_minimal() +
  labs(x = "Inflation",
       y = "GDP growth",
       color = "Country",
       title = "Relationship between inflation and GDP growth",
       caption = "Data source: World Bank.") +
  scale_color_brewer(palette = "Blues")

RColorBrewerSet2 <- 
  world_bank_data |>
  ggplot(mapping = aes(x = gdp_growth, y = inflation, color = country)) +
  geom_point() +
  theme_minimal() +
  labs(x = "Inflation",
       y = "GDP growth",
       color = "Country",
       title = "Relationship between inflation and GDP growth",
       caption = "Data source: World Bank.") +
  scale_color_brewer(palette = "Set1")

viridis <- 
  world_bank_data |>
  ggplot(mapping = aes(x = gdp_growth, y = inflation, color = country)) +
  geom_point() +
  theme_minimal() +
  labs(x = "Inflation",
       y = "GDP growth",
       color = "Country",
       title = "Relationship between inflation and GDP growth",
       caption = "Data source: World Bank.") +
  scale_colour_viridis_d()

viridismagma <- 
  world_bank_data |>
  ggplot(mapping = aes(x = gdp_growth, y = inflation, color = country)) +
  geom_point() +
  theme_minimal() +
  labs(x = "Inflation",
       y = "GDP growth",
       color = "Country",
       title = "Relationship between inflation and GDP growth",
       caption = "Data source: World Bank.") +
  scale_colour_viridis_d(option = "magma")

(RColorBrewerBrBG + RColorBrewerSet2) /
 (viridis + viridismagma)
Relationship between inflation and GDP for Australia, Ethiopia, India, and the US

Figure 6.14: Relationship between inflation and GDP for Australia, Ethiopia, India, and the US

The dots of a dot plot often overlap. We can address this situation in one of two ways: adding a degree of transparency to our dots with ‘alpha’ (Figure 6.15). The value for ‘alpha’ can vary between 0, which is fully transparent, and 1, which is completely opaque. We can also specify a small amount by which we are comfortable if the points move with geom_jitter() (Figure 6.16). We can specify which direction movement occurs with ‘width’ or ‘height.’ The decision between these two options turns on the degree to which exact accuracy matters, and the number of points.

world_bank_data |>
  ggplot(mapping = aes(x = gdp_growth, y = inflation, color = country)) +
  geom_point(alpha = 0.5) +
  theme_minimal() +
  labs(x = "Inflation",
       y = "GDP growth",
       color = "Country",
       title = "Relationship between inflation and GDP growth",
       caption = "Data source: World Bank.")
#> Warning: Removed 26 rows containing missing values
#> (geom_point).
Relationship between inflation and GDP for Australia, Ethiopia, India, and the US

Figure 6.15: Relationship between inflation and GDP for Australia, Ethiopia, India, and the US

world_bank_data |>
  ggplot(mapping = aes(x = gdp_growth, y = inflation, color = country)) +
  geom_jitter() +
  theme_minimal() +
  labs(x = "Inflation",
       y = "GDP growth",
       color = "Country",
       title = "Relationship between inflation and GDP growth",
       caption = "Data source: World Bank.")
#> Warning: Removed 26 rows containing missing values
#> (geom_point).
Relationship between inflation and GDP for Australia, Ethiopia, India, and the US

Figure 6.16: Relationship between inflation and GDP for Australia, Ethiopia, India, and the US

A common use case for a scatterplot is to illustrate a relationship between two variables. It can be useful to add a line of best fit using geom_smooth() (Figure 6.17). By default geom_smooth() will impose a X relationship. By default, loess smoothing is used for datasets with less than 1,000 observations, but we can specify the relationship using ‘method,’ change the color with ‘color’ and remove standard errors with ‘se.’ We use geom_smooth() to add a layer to the graph, and so it inherits all the settings that it can from ggplot(). For instance, that is why here we have one line for each country. We could overwrite that by specifying a particular color, in which case we would only have one line.

defaults <- 
  world_bank_data |>
  ggplot(mapping = aes(x = gdp_growth, y = inflation, color = country)) +
  geom_jitter() +
  geom_smooth() +
  theme_minimal() +
  labs(x = "Inflation",
       y = "GDP growth",
       color = "Country",
       title = "Relationship between inflation and GDP growth",
       caption = "Data source: World Bank.")

straightline <- 
  world_bank_data |>
  ggplot(mapping = aes(x = gdp_growth, y = inflation, color = country)) +
  geom_jitter() +
  geom_smooth(method = lm, se = FALSE) +
  theme_minimal() +
  labs(x = "Inflation",
       y = "GDP growth",
       color = "Country",
       title = "Relationship between inflation and GDP growth",
       caption = "Data source: World Bank.")

onestraightline <- 
  world_bank_data |>
  ggplot(mapping = aes(x = gdp_growth, y = inflation, color = country)) +
  geom_jitter() +
  geom_smooth(method = lm, color = "black", se = FALSE) +
  theme_minimal() +
  labs(x = "Inflation",
       y = "GDP growth",
       color = "Country",
       title = "Relationship between inflation and GDP growth",
       caption = "Data source: World Bank.")

(defaults + straightline + onestraightline)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
#> Warning: Removed 26 rows containing non-finite values
#> (stat_smooth).
#> Warning: Removed 26 rows containing missing values
#> (geom_point).
#> `geom_smooth()` using formula 'y ~ x'
#> Warning: Removed 26 rows containing non-finite values
#> (stat_smooth).

#> Warning: Removed 26 rows containing missing values (geom_point).
#> `geom_smooth()` using formula 'y ~ x'
#> Warning: Removed 26 rows containing non-finite values
#> (stat_smooth).

#> Warning: Removed 26 rows containing missing values (geom_point).
Relationship between inflation and GDP for Australia, Ethiopia, India, and the US

Figure 6.17: Relationship between inflation and GDP for Australia, Ethiopia, India, and the US

6.2.3 Line plots

We can use a line plot when we have variables that should be joined together, for instance, economic time series. We will continue with the dataset from the World Bank and focus on initially on GDP (Figure 6.18).

world_bank_data |>
  ggplot(mapping = aes(x = year, y = gdp_growth, color = country)) +
  geom_line()
#> Warning: Removed 25 row(s) containing missing values
#> (geom_path).
GDP over time for Australia, Ethiopia, India, and the US

Figure 6.18: GDP over time for Australia, Ethiopia, India, and the US

As before, we can adjust the theme and labels (Figure 6.19).

world_bank_data |>
  ggplot(mapping = aes(x = year, y = gdp_growth, color = country)) +
  geom_line() +
  theme_minimal() +
  labs(x = "Year",
       y = "GDP",
       color = "Country",
       title = "GDP over time",
       caption = "Data source: World Bank.")
#> Warning: Removed 25 row(s) containing missing values
#> (geom_path).
GDP over time for Australia, Ethiopia, India, and the US

Figure 6.19: GDP over time for Australia, Ethiopia, India, and the US

We can use a slight variant, geom_step() to focus attention on the change from year to year (Figure 6.20).

world_bank_data |>
  ggplot(mapping = aes(x = year, y = gdp_growth, color = country)) +
  geom_step() +
  theme_minimal() +
  labs(x = "Year",
       y = "GDP",
       color = "Country",
       title = "GDP over time",
       caption = "Data source: World Bank.")
#> Warning: Removed 25 row(s) containing missing values
#> (geom_path).
GDP over time for Australia, Ethiopia, India, and the US

Figure 6.20: GDP over time for Australia, Ethiopia, India, and the US

The Phillips curve is the name given to plot of the relationship between unemployment and inflation over time. An inverse relationship is sometimes found in the data, for instance in the UK between 1861 and 1957 (Phillips 1958). We have a variety of ways to investigate this including geom_path() which links values in the order they appear in the datset. In Figure 6.21 we show a Phillips curve for the US between 1960 and 2020, and it is clear that any relationship that once existed between these variables does not appear to still exist.

world_bank_data |>
  filter(country == "United States") |>
  ggplot(mapping = aes(x = unemployment_rate, y = inflation)) +
  geom_path() +
  theme_minimal() +
  labs(x = "Unemployment rate",
       y = "Inflation",
       caption = "Data source: World Bank.")
Phillips curve for the US (1960-2020)

Figure 6.21: Phillips curve for the US (1960-2020)

6.2.4 Histograms

A histogram is useful to show the shape of a continuous variable and works by constructing counts of the number of observations in different subsets of the support, called ‘bins.’ In Figure 6.22 we examine the distribution of GDP in the US.

world_bank_data |> 
  filter(country == "United States") |> 
  ggplot(mapping = aes(x = gdp_growth)) +
  geom_histogram()
Distribution of income

Figure 6.22: Distribution of income

And again we can add a theme and labels (Figure 6.23.

world_bank_data |> 
  filter(country == "United States") |> 
  ggplot(mapping = aes(x = gdp_growth)) +
  geom_histogram() +
  theme_minimal() +
  labs(x = "GDP",
       y = "Number of occurrences",
       caption = "Data source: World Bank.")
#> `stat_bin()` using `bins = 30`. Pick better value with
#> `binwidth`.
#> Warning: Removed 1 rows containing non-finite values
#> (stat_bin).
Distribution of GDP in the United States

Figure 6.23: Distribution of GDP in the United States

The key component determining the shape of a histogram is the number of bins. This can be specified in one of two ways: 1) specifying the number of ‘bins’ to include, or 2) specifying how wide they should be with ‘binwidth’ (Figure 6.24).


twobins <- 
  world_bank_data |> 
  filter(country == "United States") |> 
  ggplot(mapping = aes(x = gdp_growth)) +
  geom_histogram(bins = 2) +
  theme_minimal() +
  labs(x = "GDP",
       y = "Number of occurrences",
       caption = "Data source: World Bank.")

fivebins <- 
  world_bank_data |> 
  filter(country == "United States") |> 
  ggplot(mapping = aes(x = gdp_growth)) +
  geom_histogram(bins = 5) +
  theme_minimal() +
  labs(x = "GDP",
       y = "Number of occurrences",
       caption = "Data source: World Bank.")

twentybins <- 
  world_bank_data |> 
  filter(country == "United States") |> 
  ggplot(mapping = aes(x = gdp_growth)) +
  geom_histogram(bins = 20) +
  theme_minimal() +
  labs(x = "GDP",
       y = "Number of occurrences",
       caption = "Data source: World Bank.")

halfbinwidth <- 
  world_bank_data |> 
  filter(country == "United States") |> 
  ggplot(mapping = aes(x = gdp_growth)) +
  geom_histogram(binwidth = 0.5) +
  theme_minimal() +
  labs(x = "GDP",
       y = "Number of occurrences",
       caption = "Data source: World Bank.")

twobinwidth <- 
  world_bank_data |> 
  filter(country == "United States") |> 
  ggplot(mapping = aes(x = gdp_growth)) +
  geom_histogram(binwidth = 2) +
  theme_minimal() +
  labs(x = "GDP",
       y = "Number of occurrences",
       caption = "Data source: World Bank.")

fivebinwidth <- 
  world_bank_data |> 
  filter(country == "United States") |> 
  ggplot(mapping = aes(x = gdp_growth)) +
  geom_histogram(binwidth = 5) +
  theme_minimal() +
  labs(x = "GDP",
       y = "Number of occurrences",
       caption = "Data source: World Bank.")

(twobins + fivebins + twentybins) / (halfbinwidth + twobinwidth + fivebinwidth)
Distribution of GDP in the United States

Figure 6.24: Distribution of GDP in the United States

The histogram is smoothing the data, and the number of bins affects how much smoothing occurs. When there are only two bins then the data are very smooth, but we have lost a great deal of accuracy. More specifically, ‘the histogram estimator is a piecewise constant function where the height of the function is proportional to the number of observations in each bin’ (Wasserman 2005, 303). Too few bins result in a biased estimator, while too many bins results in an estimator with high variance. Our decision as to the number of bins, or their width, is concerned with trying to balance bias and variance. This will depend on a variety of concerns including the subject matter and the goal (Cleveland 1994, 135).

Finally, while we can use ‘fill’ to distinguish between different types of observations, it can get quite messy. It is usually better to give away showing the distribution with columns and instead trace the outline of the distribution, using geom_freqpoly() (Figure 6.25) or to build it up using dots with geom_dotplot() (Figure 6.26) .

world_bank_data |> 
  ggplot(mapping = aes(x = gdp_growth, color = country)) +
  geom_freqpoly() +
  theme_minimal() +
  labs(x = "GDP",
       y = "Number of occurrences",
       color = "Country",
       caption = "Data source: World Bank.") +
  scale_color_brewer(palette = "Set1") 
Distribution of GDP in the United States

Figure 6.25: Distribution of GDP in the United States

world_bank_data |> 
  ggplot(mapping = aes(x = gdp_growth, group = country, fill = country)) +
  geom_dotplot(method = 'histodot', alpha = 0.4) +
  theme_minimal() +
  labs(x = "GDP",
       y = "Number of occurrences",
       fill = "Country",
       caption = "Data source: World Bank.") +
  scale_color_brewer(palette = "Set1") 
Distribution of GDP in the United States

Figure 6.26: Distribution of GDP in the United States

6.2.5 Boxplots

Boxplots are almost never an appropriate choice because they hide the distribution of data, rather than show it. Unless we need to compare the summary statistics of many variables at once, then they should almost never be used. This is because the same boxplot can apply to very different distributions. To see this, consider some simulated data from the beta distribution of two types. One type of data contains draws from two beta distributions: one that is right skewed and another that is left skewed. The other type of data contains draws from a beta distribution with no skew.

set.seed(853)
both_left_and_right_skew <- 
  c(
    rbeta(500, 5, 2),
    rbeta(500, 2, 5)
    )

no_skew <- 
  rbeta(1000, 1, 1)

beta_distributions <- 
  tibble(
    observation = c(both_left_and_right_skew, no_skew),
    source = c(rep("Left and right skew", 1000),
               rep("No skew", 1000)
               )
  )

We can first compare the boxplots of the two series (Figure 6.27.

beta_distributions |> 
  ggplot(aes(x = source, y = observation)) +
  geom_boxplot() +
  theme_classic()
Data drawn from beta distributions with different parameters

Figure 6.27: Data drawn from beta distributions with different parameters

But if we plot the actual data then we can see how different they are (Figure 6.28).

beta_distributions |> 
  ggplot(aes(x = observation, color = source)) +
  geom_freqpoly(binwidth = 0.05) +
  theme_classic()
Data drawn from beta distributions with different parameters

Figure 6.28: Data drawn from beta distributions with different parameters

One way forward, if a boxplot must be included, is to include the actual data as a layer on top of the boxplot (Figure 6.29.

beta_distributions |> 
  ggplot(aes(x = source, y = observation)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.3) +
  theme_classic()
Data drawn from beta distributions with different parameters

Figure 6.29: Data drawn from beta distributions with different parameters

An even better solution is to graph the quantiles of each distribution against (Figure 6.30. The Q-Q plot was developed by Wilk and Gnanadesikan (1968) and requires us to plot the distributions against each other.

beta_distributions |> 
  ggplot(aes(sample = observation, color = source)) +
  stat_qq(alpha = 0.3) +
  stat_qq_line() +
  theme_classic() +
  scale_color_brewer(palette = "Set1") 
Data drawn from beta distributions with different parameters

Figure 6.30: Data drawn from beta distributions with different parameters

6.3 Tables

Tables are critical for telling a compelling story. Tables can communicate less information than a graph, but they can do so at a high fidelity. We primarily use tables in three ways:

  1. To show some of our actual dataset, for which we use kable() from knitr (Xie 2021), alongside kableExtra (Zhu 2020).
  2. To communicate summary statistics, for which we use gt (Iannone, Cheng, and Schloerke 2020) and modelsummary (Arel-Bundock 2021a).
  3. To display regression results, for which we use modelsummary (Arel-Bundock 2021a).

6.3.1 Showing part of a dataset

We will illustrate showing part of a dataset using kable() from knitr and drawing on kableExtra for enhancement. We will use the World Bank dataset on inflation and GDP from earlier.

library(knitr)
head(world_bank_data)
#> # A tibble: 6 × 6
#>   country    year inflation gdp_growth population
#>   <chr>     <dbl>     <dbl>      <dbl>      <dbl>
#> 1 Australia  1960     3.73       NA      10276477
#> 2 Australia  1961     2.29        2.48   10483000
#> 3 Australia  1962    -0.319       1.29   10742000
#> 4 Australia  1963     0.641       6.21   10950000
#> 5 Australia  1964     2.87        6.98   11167000
#> 6 Australia  1965     3.41        5.98   11388000
#> # … with 1 more variable: unemployment_rate <dbl>

To begin, we can display the first ten rows with the default kable() settings.

world_bank_data |> 
  slice(1:10) |> 
  kable() 
country year inflation gdp_growth population unemployment_rate
Australia 1960 3.7288136 NA 10276477 NA
Australia 1961 2.2875817 2.483271 10483000 NA
Australia 1962 -0.3194888 1.294468 10742000 NA
Australia 1963 0.6410256 6.214949 10950000 NA
Australia 1964 2.8662420 6.978540 11167000 NA
Australia 1965 3.4055728 5.980893 11388000 NA
Australia 1966 3.2934132 2.381966 11651000 NA
Australia 1967 3.4782609 6.303650 11799000 NA
Australia 1968 2.5210084 5.095103 12009000 NA
Australia 1969 3.2786885 7.043526 12263000 NA

In order to be able to cross-reference it in text, we need to add a caption with ‘caption.’ We can also make the column names more information with ‘col.names’ and specify the number of digits to be displayed (Table 6.3).

world_bank_data |> 
  slice(1:10) |> 
  kable(
    caption = "First ten rows of a dataset of economic indicators for Australia, Ethiopia, India, and the US",
    col.names = c("Country", "Year", "Inflation", "GDP growth", "Population", "Unemployment rate"),
    digits = 1
  )
Table 6.3: First ten rows of a dataset of economic indicators for Australia, Ethiopia, India, and the US
Country Year Inflation GDP growth Population Unemployment rate
Australia 1960 3.7 NA 10276477 NA
Australia 1961 2.3 2.5 10483000 NA
Australia 1962 -0.3 1.3 10742000 NA
Australia 1963 0.6 6.2 10950000 NA
Australia 1964 2.9 7.0 11167000 NA
Australia 1965 3.4 6.0 11388000 NA
Australia 1966 3.3 2.4 11651000 NA
Australia 1967 3.5 6.3 11799000 NA
Australia 1968 2.5 5.1 12009000 NA
Australia 1969 3.3 7.0 12263000 NA

When producing PDFs, the ‘booktabs’ option makes a host of small changes to the default display and results in tables that look better (Table 6.4). When using ‘booktabs’ we additionally should specify ‘linesep’ otherwise kable() adds a small space every five lines. (None of this will show up for html output.)

world_bank_data |> 
  slice(1:10) |> 
  kable(
    caption = "First ten rows of a dataset of economic indicators for Australia, Ethiopia, India, and the US",
    col.names = c("Country", "Year", "Inflation", "GDP growth", "Population", "Unemployment rate"),
    digits = 1,
    booktabs = TRUE, 
    linesep = ""
  )
Table 6.4: First ten rows of a dataset of economic indicators for Australia, Ethiopia, India, and the US
Country Year Inflation GDP growth Population Unemployment rate
Australia 1960 3.7 NA 10276477 NA
Australia 1961 2.3 2.5 10483000 NA
Australia 1962 -0.3 1.3 10742000 NA
Australia 1963 0.6 6.2 10950000 NA
Australia 1964 2.9 7.0 11167000 NA
Australia 1965 3.4 6.0 11388000 NA
Australia 1966 3.3 2.4 11651000 NA
Australia 1967 3.5 6.3 11799000 NA
Australia 1968 2.5 5.1 12009000 NA
Australia 1969 3.3 7.0 12263000 NA
world_bank_data |> 
  slice(1:10) |> 
  kable(
    caption = "First ten rows of a dataset of economic indicators for Australia, Ethiopia, India, and the US",
    col.names = c("Country", "Year", "Inflation", "GDP growth", "Population", "Unemployment rate"),
    digits = 1,
    booktabs = TRUE
  )
Table 6.5: First ten rows of a dataset of economic indicators for Australia, Ethiopia, India, and the US
Country Year Inflation GDP growth Population Unemployment rate
Australia 1960 3.7 NA 10276477 NA
Australia 1961 2.3 2.5 10483000 NA
Australia 1962 -0.3 1.3 10742000 NA
Australia 1963 0.6 6.2 10950000 NA
Australia 1964 2.9 7.0 11167000 NA
Australia 1965 3.4 6.0 11388000 NA
Australia 1966 3.3 2.4 11651000 NA
Australia 1967 3.5 6.3 11799000 NA
Australia 1968 2.5 5.1 12009000 NA
Australia 1969 3.3 7.0 12263000 NA

We can specify the alignment of the columns using a character vector of ‘l’ (left), ‘c’ (centre), and ‘r’ (right) (Table 6.6). Additionally, (and this is not relevant for this table), we could specify groupings for numbers that are at least one thousand using ‘format.args = list(big.mark = “,”).’

world_bank_data |> 
  slice(1:10) |> 
  kable(
    caption = "First ten rows of a dataset of economic indicators for Australia, Ethiopia, India, and the US",
    col.names = c("Country", "Year", "Inflation", "GDP growth", "Population", "Unemployment rate"),
    digits = 1,
    booktabs = TRUE, 
    linesep = "",
    align = c('l', 'l', 'c', 'c', 'r', 'r'),
  )
Table 6.6: First ten rows of a dataset of economic indicators for Australia, Ethiopia, India, and the US
Country Year Inflation GDP growth Population Unemployment rate
Australia 1960 3.7 NA 10276477 NA
Australia 1961 2.3 2.5 10483000 NA
Australia 1962 -0.3 1.3 10742000 NA
Australia 1963 0.6 6.2 10950000 NA
Australia 1964 2.9 7.0 11167000 NA
Australia 1965 3.4 6.0 11388000 NA
Australia 1966 3.3 2.4 11651000 NA
Australia 1967 3.5 6.3 11799000 NA
Australia 1968 2.5 5.1 12009000 NA
Australia 1969 3.3 7.0 12263000 NA

We can use kableExtra (Zhu 2020) to add extra functionality to kable. For instance, we could add a row that groups some of the columns (Table 6.6).

library(kableExtra)

world_bank_data |> 
  slice(1:10) |> 
  kable(
    caption = "First ten rows of a dataset of economic indicators for Australia, Ethiopia, India, and the US",
    col.names = c("Country", "Year", "Inflation", "GDP growth", "Population", "Unemployment rate"),
    digits = 1,
    booktabs = TRUE, 
    linesep = "",
    align = c('l', 'l', 'c', 'c', 'r', 'r'),
  ) |> 
  add_header_above(c(" " = 2, "Economic indicators" = 4))
Table 6.7: First ten rows of a dataset of economic indicators for Australia, Ethiopia, India, and the US
Economic indicators
Country Year Inflation GDP growth Population Unemployment rate
Australia 1960 3.7 NA 10276477 NA
Australia 1961 2.3 2.5 10483000 NA
Australia 1962 -0.3 1.3 10742000 NA
Australia 1963 0.6 6.2 10950000 NA
Australia 1964 2.9 7.0 11167000 NA
Australia 1965 3.4 6.0 11388000 NA
Australia 1966 3.3 2.4 11651000 NA
Australia 1967 3.5 6.3 11799000 NA
Australia 1968 2.5 5.1 12009000 NA
Australia 1969 3.3 7.0 12263000 NA

6.3.2 Communicating summary statistics

We can use datasummary() from modelsummary to create tables of summary statistics from our dataset.

library(modelsummary)

world_bank_data |> 
  datasummary_skim()
Unique (#) Missing (%) Mean SD Min Median Max
year 61 0 1990.0 17.6 1960.0 1990.0 2020.0
inflation 238 3 6.1 6.3 −9.8 4.3 44.4
gdp_growth 220 10 4.2 3.7 −11.1 3.9 13.9
population 244 0 304177482.9 380093166.9 10276477.0 147817291.5 1380004385.0
unemployment_rate 104 52 6.0 1.9 1.2 5.7 10.9

By default it summarizes the ‘numeric’ variables, but we can ask for the ‘categorical’ variables (Table 6.8). Additionally we can add cross-references in the same way as kable(), that is, include a title and then cross-reference the name of the R chunk.

world_bank_data |> 
  datasummary_skim(type = "categorical",
                   title = "Summary of categorical variables for the inflation and GDP dataset")
Table 6.8: Summary of categorical variables for the inflation and GDP dataset
country N %
Australia 61 25.0
Ethiopia 61 25.0
India 61 25.0
United States 61 25.0

We can create a table that shows the correlation between variables using datasummary_correlation() (Table 6.9).

world_bank_data |> 
  datasummary_correlation(
    title = "Correlation between the variables for the inflation and GDP dataset"
    )
Table 6.9: Correlation between the variables for the inflation and GDP dataset
year inflation gdp_growth population unemployment_rate
year 1 . . . .
inflation .00 1 . . .
gdp_growth .10 .00 1 . .
population .24 .07 .15 1 .
unemployment_rate −.13 −.14 −.31 −.35 1

We typically need a table of descriptive statistics that we could add to our paper (Table 6.10). This is in contrast to Table 6.8 which would likely not be included in a paper. We can add a note about the source of the data using ‘notes.’

datasummary_balance(formula = ~country,
                    data = world_bank_data,
                    title = "Descriptive statistics for the inflation and GDP dataset",
                    notes = "Data source: World Bank.")
Table 6.10: Descriptive statistics for the inflation and GDP dataset
Australia (N=61)
Ethiopia (N=61)
India (N=61)
United States (N=61)
Mean Std. Dev. Mean Std. Dev. Mean Std. Dev. Mean Std. Dev.
year 1990.0 17.8 1990.0 17.8 1990.0 17.8 1990.0 17.8
inflation 4.7 3.8 8.7 10.4 7.4 5.0 3.7 2.8
gdp_growth 3.4 1.8 5.9 6.4 5.0 3.3 2.9 2.2
population 17244215.9 4328625.6 55662437.9 27626912.1 888774544.9 292997809.4 255028733.1 45603604.8
unemployment_rate 6.8 1.7 2.6 0.9 3.5 1.4 6.0 1.6
Data source: World Bank.

6.3.3 Display regression results

Finally, one common reason for needing a table is to report regression results. We will do this using modelsummary() from modelsummary (Arel-Bundock 2021a).

first_model <- lm(formula = gdp_growth ~ inflation, 
                  data = world_bank_data)

modelsummary(first_model)
Model 1
(Intercept) 4.157
(0.352)
inflation −0.002
(0.041)
Num.Obs. 218
R2 0.000
R2 Adj. −0.005
AIC 1195.1
BIC 1205.3
Log.Lik. −594.554
F 0.002

We can put a variety of different of different models together (Table 6.11).

second_model <- lm(formula = gdp_growth ~ inflation + country, 
                  data = world_bank_data)

third_model <- lm(formula = gdp_growth ~ inflation + country + population, 
                  data = world_bank_data)

modelsummary(list(first_model, second_model, third_model),
             title = "Explaining GDP as a function of inflation")
Table 6.11: Explaining GDP as a function of inflation
Model 1 Model 2 Model 3
(Intercept) 4.157 3.728 3.668
(0.352) (0.495) (0.494)
inflation −0.002 −0.075 −0.072
(0.041) (0.041) (0.041)
countryEthiopia 2.872 2.716
(0.757) (0.758)
countryIndia 1.854 −0.561
(0.655) (1.520)
countryUnited States −0.524 −1.176
(0.646) (0.742)
population 0.000
(0.000)
Num.Obs. 218 218 218
R2 0.000 0.110 0.123
R2 Adj. −0.005 0.093 0.102
AIC 1195.1 1175.7 1174.5
BIC 1205.3 1196.0 1198.2
Log.Lik. −594.554 −581.844 −580.266
F 0.002 6.587 5.939

We can adjust the number of significant digits (Table 6.12).

modelsummary(list(first_model, second_model, third_model),
             fmt = 1,
             title = "Two models of GDP as a function of inflation")
Table 6.12: Two models of GDP as a function of inflation
Model 1 Model 2 Model 3
(Intercept) 4.2 3.7 3.7
(0.4) (0.5) (0.5)
inflation 0.0 −0.1 −0.1
(0.0) (0.0) (0.0)
countryEthiopia 2.9 2.7
(0.8) (0.8)
countryIndia 1.9 −0.6
(0.7) (1.5)
countryUnited States −0.5 −1.2
(0.6) (0.7)
population 0.0
(0.0)
Num.Obs. 218 218 218
R2 0.000 0.110 0.123
R2 Adj. −0.005 0.093 0.102
AIC 1195.1 1175.7 1174.5
BIC 1205.3 1196.0 1198.2
Log.Lik. −594.554 −581.844 −580.266
F 0.002 6.587 5.939

6.4 Maps

In many ways maps can be thought of as another type of graph, where the x-axis is latitude, the y-axis is longitude, and there is some outline or a background image. We have seen this type of set-up are used to this type of set-up, for instance, in the ggplot2 setting, this is quite familiar.

ggplot() +
  geom_polygon( # First draw an outline
    data = some_data, 
    aes(x = latitude, 
        y = longitude,
        group = group
        )) +
  geom_point( # Then add points of interest
    data = some_other_data, 
    aes(x = latitude, 
        y = longitude)
    )

And while there are some small complications, for the most part it is as straight-forward as that. The first step is to get some data. There is some geographic data built into ggplot2, and there is additional information in maps.

library(maps)

canada <- map_data(database = "world", regions = "canada")
canadian_cities <- maps::canada.cities

head(canada)
#>        long      lat group order region    subregion
#> 1 -59.78760 43.93960     1     1 Canada Sable Island
#> 2 -59.92227 43.90391     1     2 Canada Sable Island
#> 3 -60.03775 43.90664     1     3 Canada Sable Island
#> 4 -60.11426 43.93911     1     4 Canada Sable Island
#> 5 -60.11748 43.95337     1     5 Canada Sable Island
#> 6 -59.93604 43.93960     1     6 Canada Sable Island

head(canadian_cities)
#>            name country.etc    pop   lat    long capital
#> 1 Abbotsford BC          BC 157795 49.06 -122.30       0
#> 2      Acton ON          ON   8308 43.63  -80.03       0
#> 3 Acton Vale QC          QC   5153 45.63  -72.57       0
#> 4    Airdrie AB          AB  25863 51.30 -114.02       0
#> 5    Aklavik NT          NT    643 68.22 -135.00       0
#> 6    Albanel QC          QC   1090 48.87  -72.42       0

With that information in hand we can then create a map of Canada that shows the cities with a population over 1,000. (The geom_polygon() function within ggplot2 draws shapes, by connecting points within groups. And the coord_map() function adjusts for the fact that we are making something that is 2D map to represent something that is 3D.)

ggplot() +
  geom_polygon(data = canada,
               aes(x = long,
                   y = lat,
                   group = group),
               fill = "white", 
               colour = "grey") +
  coord_map(ylim = c(40, 70)) +
  geom_point(aes(x = canadian_cities$long, 
                 y = canadian_cities$lat),
             alpha = 0.3,
             color = "black") +
  theme_classic() +
  labs(x = "Longitude",
       y = "Latitude")

As is often the case with R, there are many different ways to get started creating static maps. We have already seen how they can be built using only ggplot2, but ggmap brings additional functionality (Kahle and Wickham 2013).

There are two essential components to a map:

  1. some border or background image (also known as a tile); and
  2. something of interest within that border or on top of that tile.

In ggmap, we use an open-source option for our tile, Stamen Maps: maps.stamen.com. And we use plot points based on latitude and longitude.

6.4.1 Australian polling places

In Australia people go to specific locations, called booths, to vote. These booths have latitudes and longitudes and so we can plot these. One reason we may like to do this is to notice patterns over geographies.

To get started we need to get a tile. We are going to use ggmap to get a tile from Stamen Maps, which builds on OpenStreetMap (openstreetmap.org). The main argument to this function is to specify a bounding box. This requires two latitudes - one for the top of the box and one for the bottom of the box - and two longitudes - one for the left of the box and one for the right of the box. It can be useful to use Google Maps, or an alternative, to find the values of these that you need. The bounding box provides the coordinates of the edges that you are interested in. In this case we have provided it with coordinates such that it will be centered around Canberra, Australia, which is a small city that was created for the purposes of being the capital.

library(ggmap)

bbox <- c(left = 148.95, bottom = -35.5, right = 149.3, top = -35.1)

Once you have defined the bounding box, then the function get_stamenmap() will get the tiles in that area. The number of tiles that it needs to get depends on the zoom, and the type of tiles that it gets depends on the maptype. We have used a black-and-white type of map but the helpfile specifies others. At this point we can the map to maps to ggmap() and it will plot the tile! It will be actively downloading these tiles, and so it needs an internet connection.

canberra_stamen_map <- get_stamenmap(bbox, zoom = 11, maptype = "toner-lite")

ggmap(canberra_stamen_map)

Once we have a map then we can use ggmap() to plot it. (That circle in the middle of the map is where the Australian Parliament House is. Yes, the Australian parliament is surrounded by circular roads, which Australians call ‘roundabouts,’ actually Australians thought this was such a great idea, that we surrounded it by two of them.)

Now we want to get some data that we plot on top of our tiles. We will just plot the location of the polling places, based on which ‘division’ (the Australian equivalent to ‘ridings’ in Canada) it is. This is available here: https://results.aec.gov.au/20499/Website/Downloads/HouseTppByPollingPlaceDownload-20499.csv. The Australian Electoral Commission (AEC) is the official government agency that is responsible for elections in Australia.

# Read in the booths data for each year
booths <-
  readr::read_csv(
    "https://results.aec.gov.au/24310/Website/Downloads/GeneralPollingPlacesDownload-24310.csv",
    skip = 1,
    guess_max = 10000
  )

head(booths)
#> # A tibble: 6 × 15
#>   State DivisionID DivisionNm PollingPlaceID
#>   <chr>      <dbl> <chr>               <dbl>
#> 1 ACT          318 Bean                93925
#> 2 ACT          318 Bean                93927
#> 3 ACT          318 Bean                11877
#> 4 ACT          318 Bean                11452
#> 5 ACT          318 Bean                 8761
#> 6 ACT          318 Bean                 8763
#> # … with 11 more variables: PollingPlaceTypeID <dbl>,
#> #   PollingPlaceNm <chr>, PremisesNm <chr>,
#> #   PremisesAddress1 <chr>, PremisesAddress2 <chr>,
#> #   PremisesAddress3 <chr>, PremisesSuburb <chr>,
#> #   PremisesStateAb <chr>, PremisesPostCode <chr>,
#> #   Latitude <dbl>, Longitude <dbl>

This dataset is for the whole of Australia, but as we are just going to plot the area around Canberra we filter to that and only to booths that are geographic (the AEC has various options for people who are in hospital, or not able to get to a booth, etc, and these are still ‘booths’ in this dataset).

# Reduce the booths data to only rows with that have latitude and longitude
booths_reduced <-
  booths |>
  filter(State == "ACT") |> 
  select(PollingPlaceID, DivisionNm, Latitude, Longitude) |> 
  filter(!is.na(Longitude)) |> # Remove rows that do not have a geography
  filter(Longitude < 165) # Remove Norfolk Island

Now we can use ggmap in the same way as before to plot our underlying tiles, and then build on that using geom_point() to add our points of interest.

ggmap(canberra_stamen_map,
      extent = "normal",
      maprange = FALSE) +
  geom_point(data = booths_reduced,
             aes(x = Longitude,
                 y = Latitude,
                 colour = DivisionNm),) +
  scale_color_brewer(name = "2019 Division", palette = "Set1") +
  coord_map(
    projection = "mercator",
    xlim = c(attr(map, "bb")$ll.lon, attr(map, "bb")$ur.lon),
    ylim = c(attr(map, "bb")$ll.lat, attr(map, "bb")$ur.lat)
  ) +
  labs(x = "Longitude",
       y = "Latitude") +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

We may like to save the map so that we do not have to draw it every time, and we can do that in the same way as any other graph, using ggsave().

ggsave("map.pdf", width = 20, height = 10, units = "cm")

Finally, the reason that we used Stamen Maps and OpenStreetMap is because it is open source, but you can also use Google Maps. This requires you to first register a credit card with Google, and specify a key, but with low usage should be free. Using Google Maps, get_googlemap(), brings some advantages over get_stamenmap(), for instance it will attempt to find a placename, rather than needing to specify a bounding box.

6.4.2 Toronto bike parking

Let us see another example of a static map, this time using Toronto data accessed using opendatatoronto (Gelfand 2020). The dataset that we are going to plot is available here: https://open.toronto.ca/dataset/street-furniture-bicycle-parking/.

# Based on: https://open.toronto.ca/dataset/street-furniture-bicycle-parking/.
library(opendatatoronto)
# (The string identifies the package.)
resources <- list_package_resources("71e6c206-96e1-48f1-8f6f-0e804687e3be")
# In this case there is only one dataset within this resource so just need the first one    
bike_parking_locations <- filter(resources, row_number()==1) |> get_resource()
head(bike_parking_locations)
#> # A tibble: 6 × 20
#>     `_id` OBJECTID ID       ADDRESSNUMBERTEXT ADDRESSSTREET 
#>     <dbl>    <dbl> <chr>    <chr>             <chr>         
#> 1 3613143        4 BP-11699 70                The Pond Rd   
#> 2 3613144        9 BP-11900 8                 Assiniboine Rd
#> 3 3613145       56 BP-11338 1495              Queen St W    
#> 4 3613146       65 BP-03501 8                 Kensington Ave
#> 5 3613147      100 BP-03280 87                Avenue Rd     
#> 6 3613148      109 BP-12883 21                Canniff St    
#> # … with 15 more variables: FRONTINGSTREET <chr>,
#> #   SIDE <chr>, FROMSTREET <chr>, DIRECTION <chr>,
#> #   SITEID <lgl>, WARD <chr>, BIA <chr>, ASSETTYPE <chr>,
#> #   STATUS <chr>, SDE_STATE_ID <dbl>, X <dbl>, Y <dbl>,
#> #   LONGITUDE <dbl>, LATITUDE <dbl>, geometry <chr>

First, we need to clean the data are little. There are some clear errors in the ADDRESSNUMBERTEXT field, but not too many, so we just ignore it and focus on the data that we are interested in.

bike_data <- tibble(
  ward = bike_parking_locations$WARD,
  id = bike_parking_locations$ID,
  status = bike_parking_locations$STATUS,
  street_address = paste(
    bike_parking_locations$ADDRESSNUMBERTEXT,
    bike_parking_locations$ADDRESSSTREET
  ),
  latitude = bike_parking_locations$LATITUDE,
  longitude = bike_parking_locations$LONGITUDE
)
rm(bike_parking_locations)

Some of the bike racks were temporary so remove them and also let us just look at the area around the University of Toronto, which is Ward 11

# Only keep ones that still exist
bike_data <- 
  bike_data |>
  filter(status == "Existing") |> 
  select(-status)

bike_data <- bike_data |> 
  filter(ward == 11) |> 
  select(-ward)

If you look at the dataset at this point, then you will notice that there is a row for every bike parking spot. But we do not really need to know that, because sometimes there are lots right next to each other. Instead, we’d just like the one point. So, we want to create a count by address, and then just get one instance per address.

bike_data <- 
  bike_data |>
  group_by(street_address) |> 
  mutate(number_of_spots = n(),
         running_total = row_number()
         ) |> 
  ungroup() |> 
  filter(running_total == 1) |> 
  select(-id, -running_total)

head(bike_data)
#> # A tibble: 6 × 4
#>   street_address   latitude longitude number_of_spots
#>   <chr>               <dbl>     <dbl>           <int>
#> 1 8 Kensington Ave     43.7     -79.4               1
#> 2 87 Avenue Rd         43.7     -79.4               4
#> 3 162 Mc Caul St       43.7     -79.4               1
#> 4 147 Baldwin St       43.7     -79.4               2
#> 5 888 Yonge St         43.7     -79.4               1
#> 6 180 Elizabeth St     43.7     -79.4              10

Now we can grab our tile and add our bike rack data onto it.

bbox <-
  c(
    left = -79.420390,
    bottom = 43.642658,
    right = -79.383354,
    top = 43.672557
  )

toronto_stamen_map <-
  get_stamenmap(bbox, zoom = 14, maptype = "toner-lite")

ggmap(toronto_stamen_map,  maprange = FALSE) +
  geom_point(data = bike_data,
             aes(x = longitude,
                 y = latitude),
             alpha = 0.3) +
  labs(x = "Longitude",
       y = "Latitude") +
  theme_minimal() 

6.4.3 Geocoding

To this point we assumed that we already had geocoded data. But the places ‘Canberra, Australia,’ or ‘Ottawa, Canada,’ are just names, they do not actually inherently have a location. In order to plot them we need to get a latitude and longitude for them. The process of going from names to coordinates is called geocoding.

There are a range of options to geocode data in R, but tidygeocoder is especially useful (Cambon and Belanger 2021). To get started using the package we need a dataframe of locations.

some_locations <- 
  tibble(city = c('Canberra', 'Ottawa'),
         country = c('Australia', 'Canada'))
tidygeocoder::geo(city = some_locations$city, 
                  country = some_locations$country, 
                  method = 'osm')
#> Passing 2 addresses to the Nominatim single address geocoder
#> Query completed in: 2 seconds
#> # A tibble: 2 × 4
#>   city     country     lat  long
#>   <chr>    <chr>     <dbl> <dbl>
#> 1 Canberra Australia -35.3 149. 
#> 2 Ottawa   Canada     45.4 -75.7

6.5 Exercises and tutorial

6.5.1 Exercises

  1. I have a dataset that contains measurements of height (in cm) for a sample of 300 penguins, who are either the Adeline or Emperor species. I am interested in visualizing the distribution of heights by species in a graphical way. Please discuss whether a pie chart is an appropriate type of graph to use. What about a box and whisker plot? Finally, what are some considerations if you made a histogram? [Please write a paragraph or two for each aspect.]
  2. Assume the dataset and columns exist. Would this code work? data |> ggplot(aes(x = col_one)) |> geom_point() (pick one)?
    1. Yes
    2. No
  3. If I have categorical data, which geom should I use to plot it (pick one)?
    1. geom_bar()
    2. geom_point()
    3. geom_abline()
    4. geom_boxplot()
  4. Why are boxplots often inappropriate (pick one)?
    1. They hide the full distribution of the data.
    2. They are hard to make.
    3. They are ugly.
    4. The mode is clearly displayed.
  5. Which of the following, if any, are elements of the layered grammar of graphics (Wickham 2010) (select all that apply)?
    1. A default dataset and set of mappings from variables to aesthetics.
    2. One or more layers, with each layer having one geometric object, one statistical transformation, one position adjustment, and optionally, one dataset and set of aesthetic mappings.
    3. Colors that enable the reader to understand the main point.
    4. A coordinate system.
    5. The facet specification.
    6. One scale for each aesthetic mapping used.

6.5.2 Tutorial

Using R Markdown, please create a graph using ggplot2 and a map using ggmap and add explanatory text to accompany both. This should take one or two pages for each of them.

For the graph, please reflect on Vanderplas, Cook, and Hofmann (2020) and add a few paragraphs about the different options that you considered that the graph more effective.

For the map, please reflect on the following quote from Heather Krause: ‘maps only show people who aren’t invisible to the makers’ as well as Chapter 3 from D’Ignazio and Klein (2020) and add a few paragraphs related to this.