**Required reading**

- Healy, Kieran, 2019,
*Data Visualization: A Practical Introduction*, Princeton University Press, https://socviz.co/. - Wickham, Hadley, and Garrett Grolemund, 2017, ‘R for Data Science’, Chapters 28, freely available at: https://r4ds.had.co.nz/.
- Vanderplas, Susan, Dianne Cook, and Heike Hofmann, 2020, ‘Testing Statistical Charts: What Makes a Good Graph?’,
*Annual Review of Statistics and Its Application*.

**Recommended reading**

- Patrick, Cameron, 2020, ‘Making beautiful bar charts with ggplot’, 15 March, freely available at: https://cameronpatrick.com/post/2020/03/beautiful-bar-charts-ggplot/.
- Patrick, Cameron, 2019, ‘Plotting multiple variables at once using ggplot2 and tidyr’, 26 November, freely available at: https://cameronpatrick.com/post/2019/11/plotting-multiple-variables-ggplot2-tidyr/.

**Key concepts/skills/etc**

- Show the reader your raw data, or as close as you can come to it.
- Use either
`geom_point`

or`geom_bar`

initially

**Key libraries**

`ggplot`

**Key functions/etc**

`geom_point()`

`geom_bar()`

The most essential task when trying to convince someone of your story is to show them the data that allowed you to come to that story. But while ggplot is a fantastic tool for doing this, because there is so much going on, sometimes people don’t know where to start. My recommendation is that you start with either a scatter plot or a bar chart. These notes run through how to do that. It then discusses some more advanced options, but the important thing is that you show the reader your raw data (or as close to it as you can).

Source: YouTube screenshot.

Plot. Your. Raw. Data.

Read Chapter 3 of Kieran Healy’s book: https://socviz.co/makeplot.html#makeplot.

Bar charts are useful when you have one variable that you want to focus on. Hint: you almost always have one variable that you want to focus on. Hence, you should almost always include at least one (and likely many) bar charts. Bar charts go by a variety of names, depending on their specifics. I recommend the R Studio Data Viz Cheat Sheet.

To get started, let’s simulate some data.

```
set.seed(853)
number_of_observation <- 10000
example_data <- tibble(person = c(1:number_of_observation),
smoker = sample(x = c("Smoker", "Non-smoker"),
size = number_of_observation,
replace = TRUE),
age_died = runif(number_of_observation,
min = 0,
max = 100) %>% round(digits = 0),
height = sample(x = c(50:220),
size = number_of_observation,
replace = TRUE),
number_of_children = sample(x = c(0:5),
size = number_of_observation,
replace = TRUE,
prob = c(0.1, 0.2, 0.40, 0.15, 0.1, 0.05))
)
```

First, let’s have a look at the data.

```
head(example_data)
```

```
# A tibble: 6 x 5
person smoker age_died height number_of_children
<int> <chr> <dbl> <int> <int>
1 1 Smoker 55 80 3
2 2 Non-smoker 54 78 2
3 3 Non-smoker 84 109 1
4 4 Smoker 75 114 4
5 5 Smoker 32 135 1
6 6 Smoker 37 220 0
```

Now let’s plot the age distribution. Based on our simulated data, we’re expecting a fairly uniform plot.

```
example_data %>%
ggplot(mapping = aes(x = age_died)) +
geom_bar()
```

Now let’s make it look a little better. There are themes that are build into ggplot, or you can install other themes from other packages, or you can edit aspects yourself. I’d recommend starting with the `ggthemes`

package for some fun ones, but I tend to just use classic or minimal. Remember that you must always refer to your graphs in your text (Figure 2).

```
example_data %>%
ggplot(mapping = aes(x = age_died)) +
geom_bar() +
theme_minimal() +
labs(x = "Age died",
y = "Number",
title = "Number of people who died at each age",
caption = "Source: Simulated data.")
```

Finally, we may want to facet by some variable, in this case whether the person is a smoker (Figure 3).

```
example_data %>%
ggplot(mapping = aes(x = age_died)) +
geom_bar() +
theme_minimal() +
facet_wrap(vars(smoker)) +
labs(x = "Age died",
y = "Number",
title = "Number of people who died at each age, by whether they smoke",
caption = "Source: Simulated data.")
```

Often we are also interested in the relationship between two series. We’ll do that with a scatter plot. In this case, let’s try age died by number of children (Figure 4).

```
example_data %>%
ggplot(mapping = aes(x = age_died, y = height)) +
geom_point() +
theme_minimal() +
facet_wrap(vars(smoker)) +
labs(x = "Age died",
y = "Height",
title = "Relationship between height and age of death, by smoking status",
caption = "Source: Simulated data.")
```

Finally, let’s try putting them together. We’re going to use the `patchwork`

package and the `penguins`

package for data.

```
library(patchwork)
library(palmerpenguins)
p1 <- ggplot(palmerpenguins::penguins) + geom_point(aes(bill_length_mm, bill_depth_mm))
p2 <- ggplot(palmerpenguins::penguins) + geom_bar(aes(species))
p1 + p2
```