Graphs

Table of Contents


Required reading

Recommended reading

Key concepts/skills/etc

Key libraries

Key functions/etc

Introduction

The most essential task when trying to convince someone of your story is to show them the data that allowed you to come to that story. But while ggplot is a fantastic tool for doing this, because there is so much going on, sometimes people don’t know where to start. My recommendation is that you start with either a scatter plot or a bar chart. These notes run through how to do that. It then discusses some more advanced options, but the important thing is that you show the reader your raw data (or as close to it as you can).

Show me the data!

Figure 1: Show me the data!

Source: YouTube screenshot.

Plot. Your. Raw. Data.

Read Chapter 3 of Kieran Healy’s book: https://socviz.co/makeplot.html#makeplot.

Bars

Bar charts are useful when you have one variable that you want to focus on. Hint: you almost always have one variable that you want to focus on. Hence, you should almost always include at least one (and likely many) bar charts. Bar charts go by a variety of names, depending on their specifics. I recommend the R Studio Data Viz Cheat Sheet.

To get started, let’s simulate some data.


set.seed(853)

number_of_observation <- 10000

example_data <- tibble(person = c(1:number_of_observation),
                       smoker = sample(x = c("Smoker", "Non-smoker"),
                                       size = number_of_observation, 
                                       replace = TRUE),
                       age_died = runif(number_of_observation,
                                        min = 0,
                                        max = 100) %>% round(digits = 0),
                       height = sample(x = c(50:220), 
                                       size =  number_of_observation, 
                                       replace = TRUE),
                       number_of_children = sample(x = c(0:5),
                                                   size = number_of_observation, 
                                                   replace = TRUE,
                                                   prob = c(0.1, 0.2, 0.40, 0.15, 0.1, 0.05))
                       )

First, let’s have a look at the data.


head(example_data)

# A tibble: 6 x 5
  person smoker     age_died height number_of_children
   <int> <chr>         <dbl>  <int>              <int>
1      1 Smoker           55     80                  3
2      2 Non-smoker       54     78                  2
3      3 Non-smoker       84    109                  1
4      4 Smoker           75    114                  4
5      5 Smoker           32    135                  1
6      6 Smoker           37    220                  0

Now let’s plot the age distribution. Based on our simulated data, we’re expecting a fairly uniform plot.


example_data %>% 
  ggplot(mapping = aes(x = age_died)) +
  geom_bar()

Now let’s make it look a little better. There are themes that are build into ggplot, or you can install other themes from other packages, or you can edit aspects yourself. I’d recommend starting with the ggthemes package for some fun ones, but I tend to just use classic or minimal. Remember that you must always refer to your graphs in your text (Figure 2).


example_data %>% 
  ggplot(mapping = aes(x = age_died)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Age died",
       y = "Number",
       title = "Number of people who died at each age",
       caption = "Source: Simulated data.")
Number of people who died at each age

Figure 2: Number of people who died at each age

Finally, we may want to facet by some variable, in this case whether the person is a smoker (Figure 3).


example_data %>% 
  ggplot(mapping = aes(x = age_died)) +
  geom_bar() +
  theme_minimal() +
  facet_wrap(vars(smoker)) +
  labs(x = "Age died",
       y = "Number",
       title = "Number of people who died at each age, by whether they smoke",
       caption = "Source: Simulated data.")
Number of people who died at each age, by whether they smoke

Figure 3: Number of people who died at each age, by whether they smoke

Points

Often we are also interested in the relationship between two series. We’ll do that with a scatter plot. In this case, let’s try age died by number of children (Figure 4).


example_data %>% 
  ggplot(mapping = aes(x = age_died, y = height)) +
  geom_point() +
  theme_minimal() +
  facet_wrap(vars(smoker)) +
  labs(x = "Age died",
       y = "Height",
       title = "Relationship between height and age of death, by smoking status",
       caption = "Source: Simulated data.")
Relationship between height and age of death, by smoking status

Figure 4: Relationship between height and age of death, by smoking status

Other

Patchwork

Finally, let’s try putting them together. We’re going to use the patchwork package and the penguins package for data.


library(patchwork)
library(palmerpenguins)


p1 <- ggplot(palmerpenguins::penguins) + geom_point(aes(bill_length_mm, bill_depth_mm))
p2 <- ggplot(palmerpenguins::penguins) + geom_bar(aes(species))

p1 + p2