# 6 Static communication

STATUS: Under construction.

• (The) Economist, 2013, ‘Johnson: Those six little rules,’ Prospero, 29 July 2013, available at: https://www.economist.com/prospero/2013/07/29/johnson-those-six-little-rules.
• Alexander, Monica, 2019, ‘The concentration and uniqueness of baby names in Australia and the US,’ https://www.monicaalexander.com/posts/2019-20-01-babynames/. (Look at how Monica explains concepts, especially the Gini coefficient, in a way that you can understand even if you’ve never heard of it before.)
• Bronner, Laura, 2020, ‘Quant Editing,’ http://www.laurabronner.com/quant-editing. (Read these points and evaluate your own writing against them. It’s fine to not comply with them if you have a good reason, but you need to know that you’re not complying with them).
• Girouard, Dave, 2020, ‘A Founder’s Guide to Writing Well,’ First Round Review, 4 August, https://firstround.com/review/a-founders-guide-to-writing-well/.
• Graham, Paul, 2020, ‘How to Write Usefully,’ http://paulgraham.com/useful.html. (Graham is good at writing for a programmer, but if you have a similar background then you may like this.)
• Healy, Kieran, 2019, Data Visualization: A Practical Introduction, Princeton University Press, Chapters 3, 4, and 7, https://socviz.co/.
• Hodgetts, Paul, 2020, ‘The ggfortify Package,’ 31 December, https://www.hodgettsp.com/posts/r-ggfortify/.
• Wickham, Hadley, and Garrett Grolemund, 2017, R for Data Science, Chapter 28, https://r4ds.had.co.nz/.
• Zinsser, William, 1976 [2016], On Writing Well. (Any edition is fine. This book is included because if you’re serious about improving your writing then you should start with this book. It only takes a few hours to read. You’ll go onto other books, but start with this one.)
• Zinsser, William, 2009, ‘Writing English as a Second Language,’ Lecture, Columbia Graduate School of Journalism, 11 August, https://theamericanscholar.org/writing-english-as-a-second-language/. (I’m realistic enough to realise that requiring a book, even though I’ve said it’s great and it’s short, is a bit of a stretch. If you really don’t want to commit to reading the Zinsser, then please at least read this ‘crib notes’ version of it.)

Required viewing

Examples of well-written papers

• Barron, Alexander TJ, Jenny Huang, Rebecca L. Spang, and Simon DeDeo. “Individuals, institutions, and innovation in the debates of the French Revolution.” Proceedings of the National Academy of Sciences 115, no. 18 (2018): 4607-4612. r
• Chambliss, Daniel F. “The Mundanity of Excellence: An Ethnographic Report on Stratification and Olympic Swimmers.” Sociological Theory 7, no. 1 (1989): 70-86. doi:10.2307/202063.
• Joyner, Michael J. “Modeling: optimal marathon performance on the basis of physiological factors.” Journal of Applied Physiology, 70, no. 2 (1991): 683-687.
• Kharecha, Pushker A., and James E. Hansen, 2013, ‘Prevented mortality and greenhouse gas emissions from historical and projected nuclear power,’ Environmental science & technology, 47, no. 9, pp. 4889-4895.
• Samuel, Arthur L., 1959, ‘Some studies in machine learning using the game of checkers,’ IBM Journal of research and development, 3, no. 3, pp. 210-229.
• Wardrop, Robert L., 1995, ‘Simpson’s paradox and the hot hand in basketball,’ The American Statistician, 49, no. 1, 24-28.

Key concepts/skills/etc

• Show the reader your raw data, or as close as you can come to it.
• Use either geom_point or geom_bar initially.
• Writing efficiently and effectively is a requirement if you want your work to be convincing.
• A good title says what the paper is about, a great title says what the paper found.
• For a six-page paper, a good abstract is a three to five sentence paragraph. For a longer paper your abstract can be slightly longer.
• Thinking of maps as a (often fiddly, but strangely enjoyable) variant of a usual ggplot.
• Broadening the data that we make available via interactive maps, while still telling a clear story.
• Becoming comfortable with (and excited about) creating static maps.

Key libraries

• ggplot
• patchwork
• ggmap
• maps

Key functions/etc

• ggplot::geom_point()
• ggplot::geom_bar()
• canada.cities
• geom_polygon()
• ggmap()
• map()
• map_data()

## 6.1 Introduction

[T]he duty of a scientist is not only to find new things, bu to communicate them successfully in at least three forms: 1) Writing papers and books. 2) Prepared public talks. 3) Impromptu talks.

In order to convince someone of your story, your paper must be well-written, well-organized, and easy to follow. It should flow easily from one point to the next. It should have proper sentence structure, spelling, vocabulary, and grammar. Each point should be articulated clearly and completely without being overly verbose. Papers should demonstrate your understanding of the topics you are writing about and your confidence in discussing the terms, techniques and issues that are relevant. References must be included and properly cited because this enhances your credibility.

People who need to write: founders, VCs, lawyers, software engineers, designers, painters, data scientists, musicians, filmmakers, creative directors, physical trainers, teachers, writers. Learn to write.

This is great advice. Writing well has done just as much for me as knowing how to code. I’d add that if you’re intimidated by writing, start a blog and write often about something you’re interested in. You’ll get better. At least that’s what I’ve done for the past 10 years. :)

This chapter is about writing. By the end of it you will have a better idea of how to write short, detailed, quantitative papers that communicate exactly what you want them to and don’t waste the time of your reader.

One critical part of telling stories with data, is that it’s ultimately the data that has to convince them. You’re the medium, but the data are the message. To that end, the easiest way to try to convince someone of your story is to show them the data that allowed you to come to that story. Plot your raw data, or as close to it as possible.

While ggplot is a fantastic tool for doing this, there is a lot to that package and so it can be difficult to know where to start. My recommendation is that you start with either a scatter plot or a bar chart. What is critical is that you show the reader your raw data. These notes run through how to do that. It then discusses some more advanced options, but the important thing is that you show the reader your raw data (or as close to it as you can). Students seem to get confused what ‘raw’ means; I’m using it to refer to as close to the original dataset as possible, so no sums, or averages, etc, if possible. Sometimes your data are too disperse for that or you’ve got other constraints, so there needs to be an element of manipulation. The main point is that you, at the very least, need to plot the data that you’re going to be modelling. If you are dealing with larger datasets then just take a 10/1/0.1/etc per cent sample.

## 6.2 Graphs

Graphs are critical to tell a compelling story. And the most important thing with your graphs is to plot your raw data. Again: Plot. Your. Raw. Data.

Figure 6.1 provides invaluable advice (thank you to Thomas William Rosenthal).

Let’s look at a somewhat fun example from the datasauRus package .

library(datasauRus)

# Code from: https://juliasilge.com/blog/datasaurus-multiclass/
datasaurus_dozen %>%
filter(dataset %in% c("dino", "star", "away", "bullseye")) %>%
group_by(dataset) %>%
summarise(across(c(x, y), list(mean = mean, sd = sd)),
x_y_cor = cor(x, y)
) %>%
ungroup()
#> # A tibble: 4 × 6
#>   dataset  x_mean  x_sd y_mean  y_sd x_y_cor
#>   <chr>     <dbl> <dbl>  <dbl> <dbl>   <dbl>
#> 1 away       54.3  16.8   47.8  26.9 -0.0641
#> 2 bullseye   54.3  16.8   47.8  26.9 -0.0686
#> 3 dino       54.3  16.8   47.8  26.9 -0.0645
#> 4 star       54.3  16.8   47.8  26.9 -0.0630

And despite these similarities at a summary statistic level, they’re actually very different, well, beasts, when you plot the raw data.

datasaurus_dozen %>%
filter(dataset %in% c("dino", "star", "away", "bullseye")) %>%
ggplot(aes(x=x, y=y, colour=dataset)) +
geom_point() +
theme_minimal() +
facet_wrap(vars(dataset), nrow = 2, ncol = 2) +
labs(colour = "Dataset")

### 6.2.1 Bar chart

Bar charts are useful when you have one variable that you want to focus on. Hint: you almost always have one variable that you want to focus on. Hence, you should almost always include at least one (and likely many) bar charts. Bar charts go by a variety of names, depending on their specifics. I recommend the R Studio Data Viz Cheat Sheet.

To get started, let’s simulate some data.

set.seed(853)

number_of_observation <- 10000

example_data <- tibble(person = c(1:number_of_observation),
smoker = sample(x = c("Smoker", "Non-smoker"),
size = number_of_observation,
replace = TRUE),
age_died = runif(number_of_observation,
min = 0,
max = 100) %>% round(digits = 0),
height = sample(x = c(50:220),
size =  number_of_observation,
replace = TRUE),
num_children = sample(x = c(0:5),
size = number_of_observation,
replace = TRUE,
prob = c(0.1, 0.2, 0.40, 0.15, 0.1, 0.05))
)

First, let’s have a look at the data.

head(example_data)
#> # A tibble: 6 × 5
#>   person smoker     age_died height num_children
#>    <int> <chr>         <dbl>  <int>        <int>
#> 1      1 Smoker           55     80            3
#> 2      2 Non-smoker       54     78            2
#> 3      3 Non-smoker       84    109            1
#> 4      4 Smoker           75    114            4
#> 5      5 Smoker           32    135            1
#> 6      6 Smoker           37    220            0

Now let’s plot the age distribution. Based on our simulated data, we’re expecting a fairly uniform plot.

example_data %>%
ggplot(mapping = aes(x = age_died)) +
geom_bar()

Now let’s make it look a little better. There are themes that are built into ggplot, or you can install other themes from other packages, or you can edit aspects yourself. I’d recommend starting with the ggthemes package for some fun ones, but I tend to just use classic or minimal. Remember that you must always refer to your graphs in your text (Figure 6.2).

example_data %>%
ggplot(mapping = aes(x = age_died)) +
geom_bar() +
theme_minimal() +
labs(x = "Age died",
y = "Number",
title = "Number of people who died at each age",
caption = "Source: Simulated data.")

We may want to facet by some variable, in this case whether the person is a smoker (Figure 6.3).

example_data %>%
ggplot(mapping = aes(x = age_died)) +
geom_bar() +
theme_minimal() +
facet_wrap(vars(smoker)) +
labs(x = "Age died",
y = "Number",
title = "Number of people who died at each age, by whether they smoke",
caption = "Source: Simulated data.")

Alternatively, we may wish to colour by that instead (Figure 6.4). I’ll filter to just a handful of age-groups to keep it tractable.

example_data %>%
filter(age_died < 25) %>%
ggplot(mapping = aes(x = age_died, fill = smoker)) +
geom_bar(position = "dodge") +
theme_minimal() +
labs(x = "Age died",
y = "Number",
fill = "Smoker",
title = "Number of people who died at each age, by whether they smoke",
caption = "Source: Simulated data.")

It’s important to recognise that a boxplot hides the full distribution of a variable. Unless you need to communicate the general distribution of many variables at once then you should not use them. The same box plot can apply to very different distributions.

### 6.2.2 Scatter plot

Often, we are also interested in the relationship between two series. We’ll do that with a scatter plot. A scatter plot is almost always your best choice . In this case, let’s simulate some data, say years of education and income.

set.seed(853)

number_of_observation <- 500

scatter_data <-
tibble(years_of_education = runif(n = number_of_observation, min = 10, max = 25),
error = rnorm(n= number_of_observation, mean = 0, sd = 10000),
) %>%
mutate(income = years_of_education * 5000 + error,
income = if_else(income < 0, 0, income))

#> # A tibble: 6 × 3
#>   years_of_education   error income
#>                <dbl>   <dbl>  <dbl>
#> 1               15.4 -13782. 63180.
#> 2               11.8   7977. 66985.
#> 3               17.3  -9787. 76498.
#> 4               14.7  12999. 86689.
#> 5               10.6  -1500. 51302.
#> 6               16.1   1911. 82202.

Now let’s look at income as a function of years of education (Figure 6.5).

scatter_data %>%
ggplot(mapping = aes(x = years_of_education, y = income)) +
geom_point() +
theme_minimal() +
labs(x = "Years of education",
y = "Income",
title = "Relationship between income and years of education",
caption = "Source: Simulated data.")

### 6.2.3 Never use box plots

Box plots are almost never appropriate because they hide the distribution of data. To see this, consider some data from a beta distribution.

left <- rbeta(10000,5,2)
right <- rbeta(10000,2,5)
middle <- rbeta(10000,5,5)

tricky_data <-
tibble(left_and_right =
c(
rbeta(10000,5,2),
rbeta(10000,2,5)
),
middle =
rbeta(20000,1,1))

Then compare the box plots.

boxplot(tricky_data$left_and_right) boxplot(tricky_data$middle)

hist(tricky_data$left_and_right) hist(tricky_data$middle)

### 6.2.4 Other

#### 6.2.4.1 Best fit

If we’re interested in quickly adding a line of best fit then, continuing with the earlier income example, we can do that with geom_smooth() (Figure 6.6).

scatter_data %>%
ggplot(mapping = aes(x = years_of_education, y = income)) +
geom_point() +
geom_smooth(method = lm, color = "black") +
theme_minimal() +
labs(x = "Years of education",
y = "Income",
title = "Relationship between income and years of education",
caption = "Source: Simulated data.")
#> geom_smooth() using formula 'y ~ x'

#### 6.2.4.2 Histogram

If we want to get counts by groups, then we may want to use a histogram. Figure 6.7 shows the counts for our simulated incomes.

scatter_data %>%
ggplot(mapping = aes(x = income)) +
geom_histogram() +
theme_minimal() +
labs(x = "Income",
y = "Number",
title = "Distribution of income",
caption = "Source: Simulated data.")
#> stat_bin() using bins = 30. Pick better value with
#> binwidth.

#### 6.2.4.3 Multiple plots

Finally, let’s try putting them together. We’re going to use the patchwork package and the penguins package for data. Don’t forget install.packages("palmerpenguins") as this is probably the first time you’ve used the package.

library(patchwork)
library(palmerpenguins)

p1 <-
ggplot(palmerpenguins::penguins) +
geom_point(aes(bill_length_mm, bill_depth_mm)) +
labs(x = "Bill length (mm)",
y = "Bill depth (mm)")
p2 <-
ggplot(palmerpenguins::penguins) +
geom_bar(aes(species)) +
labs(x = "Species",
y = "Number")

p1 + p2

And we can make things fairly involved fairly quickly.


(p1 | p2) /
p2

## 6.3 Tables

Tables are also critical to tell a compelling story. We may prefer a table to a graph when there are only a few features that we want to focus on. We’ll use knitr::kable() alongside the ‘kableExtra’ package and also the gt package.

Let’s start with the kable package and the summary dinosaur data from earlier.

example_data <-
datasaurus_dozen %>%
filter(dataset %in% c("dino", "star", "away")) %>%
group_by(dataset) %>%
summarize(
Mean    = mean(x),
Std_dev = sd(x),
)

example_data %>%
knitr::kable()
dataset Mean Std_dev
away 54.26610 16.76983
dino 54.26327 16.76514
star 54.26734 16.76896

Even the defaults are pretty good, but we can add a few tweaks to make the table better. The first is that this many significant digits is inappropriate, we may also like to add a caption, make the column names consistent, and change the alignment.

example_data %>%
knitr::kable(digits = 2,
caption = "My first table.",
col.names = c("Dataset", "Mean", "Standard deviation"),
align = c('l', 'l', 'l')
)
Table 6.1: My first table.
Dataset Mean Standard deviation
away 54.27 16.77
dino 54.26 16.77
star 54.27 16.77

The ‘’kableExtra’ package builds extra functionality (Zhu 2020).

The gt package is a newer package that brings a lot of exciting features. However, being newer it sometimes has issues with PDF output.

library(gt)

example_data %>%
gt()
dataset Mean Std_dev
away 54.26610 16.76982
dino 54.26327 16.76514
star 54.26734 16.76896

We could add sub-titles easily.

example_data %>%
gt() %>%
title = "Summary stats can be misleading",
subtitle = "With an example from a dinosaur!"
)
Summary stats can be misleading
With an example from a dinosaur!
dataset Mean Std_dev
away 54.26610 16.76982
dino 54.26327 16.76514
star 54.26734 16.76896

One common reason for needing a table is to report regression results. You should consider gtsummary, stargazer, and modelsummary. But at the moment, my favourite is modelsummary .

library(modelsummary)

mod <- lm(y ~ x, datasaurus_dozen)
modelsummary(mod)
Model 1
(Intercept) 53.590
(2.119)
x -0.106
(0.037)
Num.Obs. 1846
R2 0.004
AIC 17383.0
BIC 17399.6
Log.Lik. -8688.506
F 8.072

## 6.4 Maps

In many ways maps can be thought of as a fancy graph, where the x-axis is latitude, the y-axis is longitude, and there is some outline or a background image. We are used to this type of set-up, for instance, in a ggplot setting that is quite familiar. Static maps will be useful for printed output, such as a PDF or Word report, or where there is something in particular that you want to illustrate.

ggplot() +
geom_polygon( # First draw an outline
data = some_data,
aes(x = latitude,
y = longitude,
group = group
)) +
geom_point( # Then add points of interest
data = some_other_data,
aes(x = latitude,
y = longitude)
)

And while there are some small complications, for the most part it is as straight-forward as that. The first step is to get some data. And helpfully, there is some geographic data built into ggplot, and there is some other information built into a package called maps.

library(maps)
library(tidyverse)

canada <- map_data(database = "world", regions = "canada")

#>        long      lat group order region    subregion
#> 1 -59.78760 43.93960     1     1 Canada Sable Island
#> 2 -59.92227 43.90391     1     2 Canada Sable Island
#> 3 -60.03775 43.90664     1     3 Canada Sable Island
#> 4 -60.11426 43.93911     1     4 Canada Sable Island
#> 5 -60.11748 43.95337     1     5 Canada Sable Island
#> 6 -59.93604 43.93960     1     6 Canada Sable Island

#>            name country.etc    pop   lat    long capital
#> 1 Abbotsford BC          BC 157795 49.06 -122.30       0
#> 2      Acton ON          ON   8308 43.63  -80.03       0
#> 3 Acton Vale QC          QC   5153 45.63  -72.57       0
#> 4    Airdrie AB          AB  25863 51.30 -114.02       0
#> 5    Aklavik NT          NT    643 68.22 -135.00       0
#> 6    Albanel QC          QC   1090 48.87  -72.42       0

With that information in hand we can then create a map of Canada that shows the cities with a population over 1,000. (The geom_polygon() function within ggplot draws shapes, by connecting points within groups. And the coord_map() function adjusts for the fact that we are making something that is 2D map to represent something that is 3D.)

ggplot() +
aes(x = long,
y = lat,
group = group),
fill = "white",
colour = "grey") +
coord_map(ylim = c(40, 70)) +
geom_point(aes(x = canadian_cities$long, y = canadian_cities$lat),
alpha = 0.3,
color = "black") +
theme_classic() +
labs(x = "Longitude",
y = "Latitude")
# If I'm being honest, this 'simple example' took me six hours to work out. Firstly
# to find Canada and then to find Canadian cities.

As is often the case with R, there are many different ways to get started creating static maps. We’ve already seen how they can be built using simply ggplot, but here we’ll explore one package that has a bunch of functionalities built in that will make things easier: ggmap.

There are two essential components to a map: 1) some border or background image (also known as a tile); and 2) something of interest within that border or on top of that tile. In ggmap, we will use an open-source option for our tile, Stamen Maps (maps.stamen.com), and we will use plot points based on latitude and longitude.

### 6.4.1 Australian polling places

Like Canada, in Australia people go to specific locations, called booths, to vote. These booths have latitudes and longitudes and so we can plot these. One reason we may like to do this is to notice patterns over geographies.

To get started we need to get a tile. We are going to use ggmap to get a tile from Stamen Maps, which builds on OpenStreetMap (openstreetmap.org). The main argument to this function is to specify a bounding box. This requires two latitudes - one for the top of the box and one for the bottom of the box - and two longitudes - one for the left of the box and one for the right of the box. (It can be useful to use Google Maps, or an alternative, to find the values of these that you need.) The bounding box provides the coordinates of the edges that you are interested in. In this case I have provided it with coordinates such that it will be centered around Canberra, Australia (our equivalent of Ottawa - a small city that was created for the purposes of being the capital).

library(ggmap)

bbox <- c(left = 148.95, bottom = -35.5, right = 149.3, top = -35.1)

Once you have defined the bounding box, then the function get_stamenmap() will get the tiles in that area. The number of tiles that it needs to get depends on the zoom, and the type of tiles that it gets depends on the maptype. I’ve chosen the maptype that I like here - the black and white option - but the helpfile specifies a few others that you may like. At this point you can pass your maps to ggmap and it will plot the tile! It will be actively downloading these tiles, so you need an internet connection.

canberra_stamen_map <- get_stamenmap(bbox, zoom = 11, maptype = "toner-lite")

ggmap(canberra_stamen_map)

Once we have a map then we can use ggmap() to plot it. (That circle in the middle of the map is where the Australian Parliament House is… yes, our parliament is surrounded by circular roads (we call them ‘roundabouts’), actually it’s surrounded by two of them.)

Now we want to get some data that we will plot on top of our tiles. We will just plot the location of the polling places, based on which ‘division’ (the Australian equivalent to ‘ridings’ in Canada) it is. This is available here: https://results.aec.gov.au/20499/Website/Downloads/HouseTppByPollingPlaceDownload-20499.csv. (The Australian Electoral Commission (AEC) is the official government agency that is responsible for elections in Australia.)

# Read in the booths data for each year
skip = 1,
guess_max = 10000)

#> # A tibble: 6 × 15
#>   State DivisionID DivisionNm PollingPlaceID
#>   <chr>      <dbl> <chr>               <dbl>
#> 1 ACT          318 Bean                93925
#> 2 ACT          318 Bean                93927
#> 3 ACT          318 Bean                11877
#> 4 ACT          318 Bean                11452
#> 5 ACT          318 Bean                 8761
#> 6 ACT          318 Bean                 8763
#> # … with 11 more variables: PollingPlaceTypeID <dbl>,
#> #   PollingPlaceNm <chr>, PremisesNm <chr>,
#> #   PremisesAddress3 <chr>, PremisesSuburb <chr>,
#> #   PremisesStateAb <chr>, PremisesPostCode <chr>,
#> #   Latitude <dbl>, Longitude <dbl>

This dataset is for the whole of Australia, but as we are just going to plot the area around Canberra we will filter to that and only to booths that are geographic (the AEC has various options for people who are in hospital, or not able to get to a booth, etc, and these are still ‘booths’ in this dataset).

# Reduce the booths data to only rows with that have latitude and longitude
booths_reduced <-
booths %>%
filter(State == "ACT") %>%
select(PollingPlaceID, DivisionNm, Latitude, Longitude) %>%
filter(!is.na(Longitude)) %>% # Remove rows that don't have a geography
filter(Longitude < 165) # Remove Norfolk Island

Now we can use ggmap in the same way as before to plot our underlying tiles, and then build on that using geom_point() to add our points of interest.

ggmap(canberra_stamen_map,
extent = "normal",
maprange = FALSE) +
geom_point(data = booths_reduced,
aes(x = Longitude,
y = Latitude,
colour = DivisionNm),
) +
scale_color_brewer(name = "2019 Division", palette = "Set1") +
coord_map(projection="mercator",
xlim=c(attr(map, "bb")$ll.lon, attr(map, "bb")$ur.lon),
ylim=c(attr(map, "bb")$ll.lat, attr(map, "bb")$ur.lat)) +
labs(x = "Longitude",
y = "Latitude") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())

We may like to save the map so that we don’t have to draw it every time, and we can do that in the same way as any other graph, using ggsave().

ggsave("outputs/figures/map.pdf", width = 20, height = 10, units = "cm")

Finally, the reason that I used Stamen Maps and OpenStreetMap is because it is open source, however you can also use Google Maps if you want. This requires you to first register a credit card with Google, and specify a key, but with low usage should be free. The get_googlemap() function with ggmap, brings some nice features that get_stamenmap() does not have. For instance, you can enter a placename and it’ll do it’s best to find it rather than needing to specify a bounding box.

### 6.4.2 Toronto bike parking

Let’s see another example of a static map, this time using Toronto data accessed via the opendatatoronto package. The dataset that we are going to plot is available here: https://open.toronto.ca/dataset/street-furniture-bicycle-parking/.

# This code is based on code from: https://open.toronto.ca/dataset/street-furniture-bicycle-parking/.
library(opendatatoronto)
# (The string identifies the package.)
resources <- list_package_resources("71e6c206-96e1-48f1-8f6f-0e804687e3be")
# In this case there is only one dataset within this resource so just need the first one
raw_data <- filter(resources, row_number()==1) %>% get_resource()
write_csv(raw_data, "inputs/data/bike_racks.csv")
head(raw_data)

Now that we’ve saved a copy of the data, we can use that one. First, we need to clean it up a bit. There are some clear errors in the ADDRESSNUMBERTEXT field, but not too many, so we’ll just ignore it.

raw_data <- read_csv("inputs/data/bike_racks.csv")
# We'll just focus on the data that we want
bike_data <- tibble(ward = raw_data$WARD, id = raw_data$ID,
status = raw_data$STATUS, street_address = paste(raw_data$ADDRESSNUMBERTEXT, raw_data$ADDRESSSTREET), latitude = raw_data$LATITUDE,
longitude = raw_data$LONGITUDE) rm(raw_data) Some of the bike racks were temporary so remove them and also let’s just look at the area around the university, which is Ward 11 # Only keep ones that still exist bike_data <- bike_data %>% filter(status == "Existing") %>% select(-status) bike_data <- bike_data %>% filter(ward == 11) %>% select(-ward) If you look at the dataset at this point, then you’ll notice that there is a row for every bike parking spot. But we don’t really need to know that, because sometimes there are lots right next to each other. Instead, we’d just like the one point (we’ll take advantage of this in an interactive graph in a moment). So, we want to create a count by address, and then just get one instance per address. bike_data <- bike_data %>% group_by(street_address) %>% mutate(number_of_spots = n(), running_total = row_number() ) %>% ungroup() %>% filter(running_total == 1) %>% select(-id, -running_total) head(bike_data) #> # A tibble: 6 × 4 #> street_address latitude longitude number_of_spots #> <chr> <dbl> <dbl> <int> #> 1 8 Kensington Ave 43.7 -79.4 1 #> 2 87 Avenue Rd 43.7 -79.4 4 #> 3 162 Mc Caul St 43.7 -79.4 1 #> 4 147 Baldwin St 43.7 -79.4 2 #> 5 888 Yonge St 43.7 -79.4 1 #> 6 180 Elizabeth St 43.7 -79.4 10 write_csv(bike_data, "outputs/data/bikes.csv") Now we can grab our tile and add our bike rack data onto it. bbox <- c(left = -79.420390, bottom = 43.642658, right = -79.383354, top = 43.672557) toronto_stamen_map <- get_stamenmap(bbox, zoom = 14, maptype = "toner-lite") ggmap(toronto_stamen_map, maprange = FALSE) + geom_point(data = bike_data, aes(x = longitude, y = latitude), alpha = 0.3 ) + labs(x = "Longitude", y = "Latitude") + theme_minimal()  ### 6.4.3 Geocoding To this point we just assumed that we already had geocoded data. The places ‘Canberra, Australia,’ or ‘Ottawa, Canada,’ are just names, they don’t actually inherently have a location. In order to plot them we need to get a latitude and longitude for them. The process of going from names to coordinates is called geocoding. There are a range of options to geocode data in R, but one good package is tidygeocoder . To get started using the package we need a dataframe of locations. So we’ll just quickly make one here. some_locations <- tibble(city = c('Canberra', 'Ottawa'), country = c('Australia', 'Canada')) tidygeocoder::geo(city = some_locations$city,
country = some_locations\$country,
method = 'osm')
#> # A tibble: 2 × 4
#>   city     country     lat  long
#>   <chr>    <chr>     <dbl> <dbl>
#> 1 Canberra Australia -35.3 149.
#> 2 Ottawa   Canada     45.4 -75.7

## 6.5 Exercises and tutorial

### 6.5.1 Exercises

1. I have a dataset that contains measurements of height (in cm) for a sample of 300 penguins, who are either the Adeline or Emperor species. I am interested in visualizing the distribution of heights by species in a graphical way. Please discuss whether a pie chart is an appropriate type of graph to use. What about a box and whisker plot? Finally, what are some considerations if you made a histogram? [Please write a paragraph or two for each aspect.]
2. Assume the dataset and columns exist. Would this code work? data %>% ggplot(aes(x = col_one)) %>% geom_point() (pick one)?
1. Yes
2. No
3. If I have categorical data, which geom should I use to plot it (pick one)?
1. geom_bar()
2. geom_point()
3. geom_abline()
4. geom_boxplot()
4. Why are box plots often inappropriate (pick one)?
1. They hide the full distribution of the data.
2. They are hard to make.
3. They are ugly.
4. The mode is clearly displayed.
5. Which of the following, if any, are elements of the layered grammar of graphics (select all that apply)?
1. A default dataset and set of mappings from variables to aesthetics.
2. One or more layers, with each layer having one geometric object, one statistical transformation, one position adjustment, and optionally, one dataset and set of aesthetic mappings.
3. Colours that enable the reader to understand the main point.
4. A coordinate system.
5. The facet specification.
6. One scale for each aesthetic mapping used.

### 6.5.2 Tutorial

Discuss, in a page or two, the layered grammar of and how it relates to telling stories with data.