# Graphs

• Healy, Kieran, 2019, Data Visualization: A Practical Introduction, Princeton University Press, https://socviz.co/.
• Wickham, Hadley, and Garrett Grolemund, 2017, ‘R for Data Science’, Chapters 28, freely available at: https://r4ds.had.co.nz/.
• Vanderplas, Susan, Dianne Cook, and Heike Hofmann, 2020, ‘Testing Statistical Charts: What Makes a Good Graph?’, Annual Review of Statistics and Its Application.

Key concepts/skills/etc

• Show the reader your raw data, or as close as you can come to it.
• Use either `geom_point` or `geom_bar` initially

Key libraries

• `ggplot`

Key functions/etc

• `geom_point()`
• `geom_bar()`

# Introduction

The most essential task when trying to convince someone of your story is to show them the data that allowed you to come to that story. But while ggplot is a fantastic tool for doing this, because there is so much going on, sometimes people don’t know where to start. My recommendation is that you start with either a scatter plot or a bar chart. These notes run through how to do that. It then discusses some more advanced options, but the important thing is that you show the reader your raw data (or as close to it as you can).

Plot. Your. Raw. Data.

Read Chapter 3 of Kieran Healy’s book: https://socviz.co/makeplot.html#makeplot.

# Bars

Bar charts are useful when you have one variable that you want to focus on. Hint: you almost always have one variable that you want to focus on. Hence, you should almost always include at least one (and likely many) bar charts. Bar charts go by a variety of names, depending on their specifics. I recommend the R Studio Data Viz Cheat Sheet.

To get started, let’s simulate some data.

``````
set.seed(853)

number_of_observation <- 10000

example_data <- tibble(person = c(1:number_of_observation),
smoker = sample(x = c("Smoker", "Non-smoker"),
size = number_of_observation,
replace = TRUE),
age_died = runif(number_of_observation,
min = 0,
max = 100) %>% round(digits = 0),
height = sample(x = c(50:220),
size =  number_of_observation,
replace = TRUE),
number_of_children = sample(x = c(0:5),
size = number_of_observation,
replace = TRUE,
prob = c(0.1, 0.2, 0.40, 0.15, 0.1, 0.05))
)``````

First, let’s have a look at the data.

``````
``````
# A tibble: 6 x 5
person smoker     age_died height number_of_children
<int> <chr>         <dbl>  <int>              <int>
1      1 Smoker           55     80                  3
2      2 Non-smoker       54     78                  2
3      3 Non-smoker       84    109                  1
4      4 Smoker           75    114                  4
5      5 Smoker           32    135                  1
6      6 Smoker           37    220                  0``````

Now let’s plot the age distribution. Based on our simulated data, we’re expecting a fairly uniform plot.

``````
example_data %>%
ggplot(mapping = aes(x = age_died)) +
geom_bar()``````

Now let’s make it look a little better. There are themes that are build into ggplot, or you can install other themes from other packages, or you can edit aspects yourself. I’d recommend starting with the `ggthemes` package for some fun ones, but I tend to just use classic or minimal. Remember that you must always refer to your graphs in your text (Figure 2).

``````
example_data %>%
ggplot(mapping = aes(x = age_died)) +
geom_bar() +
theme_minimal() +
labs(x = "Age died",
y = "Number",
title = "Number of people who died at each age",
caption = "Source: Simulated data.")``````

Finally, we may want to facet by some variable, in this case whether the person is a smoker (Figure 3).

``````
example_data %>%
ggplot(mapping = aes(x = age_died)) +
geom_bar() +
theme_minimal() +
facet_wrap(vars(smoker)) +
labs(x = "Age died",
y = "Number",
title = "Number of people who died at each age, by whether they smoke",
caption = "Source: Simulated data.")``````

# Points

Often we are also interested in the relationship between two series. We’ll do that with a scatter plot. In this case, let’s try age died by number of children (Figure 4).

``````
example_data %>%
ggplot(mapping = aes(x = age_died, y = height)) +
geom_point() +
theme_minimal() +
facet_wrap(vars(smoker)) +
labs(x = "Age died",
y = "Height",
title = "Relationship between height and age of death, by smoking status",
caption = "Source: Simulated data.")``````

# Other

### Patchwork

Finally, let’s try putting them together. We’re going to use the `patchwork` package and the `penguins` package for data.

``````
library(patchwork)
library(palmerpenguins)

p1 <- ggplot(palmerpenguins::penguins) + geom_point(aes(bill_length_mm, bill_depth_mm))
p2 <- ggplot(palmerpenguins::penguins) + geom_bar(aes(species))

p1 + p2``````