R Essentials

Table of Contents


Aspects of the ‘Tidyverse essentials’ section were written with Monica Alexander.

Required reading

Alternative reading

There are a lot of great alternative ‘getting started with R’ type materials. Depending on your background and interests you may find some of the following useful:

Recommended reading

Key libraries

Key functions

Pre-quiz

  1. In your own words, what is data science?
  2. What do you see is the role of causality in data science?
  3. Imagine that you have a job in which including race as an explanatory variable improves the performance of your model. What types of issues would you consider when deciding whether to include this variable in production? What if the variable was sexuality?
  4. What role do you see for reproducibility in data science? How about replicability?
  5. What are three advantages of R? What are three disadvantages?

Introduction

This chapter is the basics of using R. Some of it may not make sense at first, but these are commands that we will come back to throughout these notes. You should initially just go through this chapter quickly, noting aspects that you don’t understand. Then start to play around with some of the initial case studies. Then maybe come back to this chapter. That way you will see how the various bits fit into context, and hopefully be more motivated to pick up various aspects. We will come back to everything in this chapter in more detail at some point in these notes.

R is an open source language that is useful for statistical programming

You can download R for free here: http://cran.utstat.utoronto.ca/, and you can download R Studio Desktop for free here: https://rstudio.com/products/rstudio/download/#download.

When you are using R you will run into trouble at some point. To work through that trouble:

  1. Look at the help file for the function by putting ? before the function e.g. ?pivot_wider.
  2. Check the class of your data, by class(data_set$data_column).
  3. Check for typos.
  4. Google the error.
  5. Google what you are trying to do.
  6. Restart R (Session -> Restart R and Clear Output).
  7. Try to make a small example and see if you have the same issues.
  8. Restart your computer.

The general workflow that we will use involves:

  1. Import
  2. Tidy
  3. Transforming, descriptive
  4. Plot
  5. Model
  6. Repeat 3/4

R, R Studio, and R Studio Cloud

R

R - https://www.r-project.org/ - is an open source and free programming language that is focused on general statistics. (Free in this context doesn’t refer to a price of zero, but instead to ‘freedom’, but it also does have a price of zero). This is in contrast with a open source programming language that is designed for general purpose, such as Python, or an open source programming language that is focused on probability, such as Stan. It was created by Ross Ihaka and Robert Gentleman at the University of Auckland in New Zealand. It is maintained by the R Core Team and changes to this ‘base’ of code occur methodically and with concern given to a variety of different priorities.

If you are in Canada then you can download R here: http://cran.utstat.utoronto.ca/, if you are in Australia then you can download R here: https://cran.csiro.au/, otherwise you should go here - https://cran.r-project.org/mirrors.html - and find a location that suits you. (It doesn’t really matter where you get it from, it’s just that it may be slightly faster to use a closer option.)

Many people build on this stable base, to extend the capabilities of R to better and more quickly suit their needs. They do this by creating packages. Typically, although not always, a package is a collection R code, and this allows you to more easily do things that you want to do. These packages are managed by the Comprehensive R Archive Network (CRAN) - https://cran.r-project.org/, and other repositories. CRAN is built into the download of R that you just got, so you can use it straight away.

If you want to use a package then you need to firstly install it in your computer, and then you need to load it when you want to use it. Di Cook, who is a Professor Business Analytics at Monash University in Australia, describes this as analogous to a lightbulb: if you want light in your house, first you need to screw in the lightbulb, and you need to turn the switch on. You only need to screw in the lightbulb once per house, but you need to turn the switch on every time you want to use the light.

To install a package on your computer (again, you’ll need to do this only once per computer) you use the code:


install.packages("tidyverse")

Then when you want to use a package, you need to call it with this code:


library(tidyverse)

You can open R and use it on your computer. It is primarily designed to be interacted with through the command line. This is how I had to start with R, and it’s fine, but it can be useful to have a richer environment than the command line provides. In particular, it can be useful to install an Integrated Development Environment (IDE), which is an application that brings together various bits and pieces that you’ll use all the time. The one that we will use is R Studio.

R Studio

R Studio is distinct to R and they are different entities. R Studio builds on top of R to make it easier for you to use R. This is in the same way that you can use the internet from the command line, but most of us use a browser such as Chrome, Firefox, or Safari.

R Studio is free in the sense that you don’t pay anything for it. It is also free in the sense of being able to take the code, modify it, and distribute that code provided others are similarly allowed to take your code and modify it and distribute, etc. However, it is important to recognise that R Studio is an entity and so it is possible that in the future the current situation could change.

You can download R Studio here: https://rstudio.com/products/rstudio/download/#download.

When you open R Studio it will look like Figure 1.

Opening R Studio for the first time

Figure 1: Opening R Studio for the first time

The left pane is a console in which you can type and execute R code line by line. Try it with 2+2 by clicking next to the prompt ‘>’ and typing that out then pressing enter. The code that you type should be:


2 + 2

[1] 4

And hopefully you get the same answer printed in the console.

The pane on the top right has information about your environment. For instance, when we create variables a list of their names and some properties will appear there. Try to type the following code, replacing my name with your name, next to the prompt, and again press enter:


my_name <- "Rohan"

You should notice a new value in the environment pane with the variable name and its value.

The pane in the bottom right is a file manager. At the moment it should just have two files - an R History file and a R Project file. We’ll get to what these are later, but for now we will create and save a file.

Type out the following code (don’t worry too much about the details for now):


saveRDS(object = my_name, file = "my_first_file.rds")

And you should see a new ‘.rds’ file in your list of files.

R Studio Cloud

While you can download R Studio to your own computer, initially we will us R Studio Cloud, which is an online version that is provided by R Studio. We will use this so that you can focus on getting comfortable with R and R Studio in an environment that is consistent. This way you don’t have to worry about what computer you have or installation permissions while you are still getting used to the basics.

The R Studio Cloud - https://rstudio.cloud/ - is as easy as it gets in terms of moving to the cloud. The trade-off is that it is not very powerful and it is sometimes slow, but for the purposes of the initial sections of these notes that will be fine.

To get started, go to https://rstudio.cloud/ and create an account. If you are going to be a student for a while then it might be worthwhile using a university email account, because although they don’t yet charge for it, they will probably start charging soon, but with some luck they will offer education discounts.

Once you have an account and log in, then it should look something like Figure 2.

Opening R Studio Cloud for the first time

Figure 2: Opening R Studio Cloud for the first time

(You’ll be in ‘Your Workspace’, and you won’t have a ‘Example Workspace’.) From here you should start a ‘New Project’. You can give the project a name by clicking on ‘Untitled Project’ and replacing it. We can now use R Studio in the cloud.

While working line-by-line in the console is fine, it is easier to write out a whole script that can then be executed. We will do this by making an R Script. To do this go to: File -> New File -> R Script, or use the shortcut Command + Shift + N. The console pane will fall to the bottom left and an R Script will open in the top left. Let’s write some code that will grab all of the Australian politicians and then construct a small table about the genders of the prime ministers.

(Some of this code won’t make sense at this stage, but just type it all out to get in the habit and then run it, by selecting all of the code and clicking ‘Run’ (or using the keyboard shortcut: Command + Return)


# Install the packages that we'll need
install.packages("devtools")
install.packages("tidyverse")

# Load the packages that we need to use this time
library(devtools)
library(tidyverse)

# Grab the data on Australian politicians
install_github("RohanAlexander/AustralianPoliticians")

# Make a table of the counts of genders of the prime ministers
AustralianPoliticians::all %>% 
  as_tibble() %>% 
  count(gender, wasPrimeMinister)

# A tibble: 4 x 3
  gender wasPrimeMinister     n
  <chr>             <int> <int>
1 female                1     1
2 female               NA   235
3 male                  1    29
4 male                 NA  1511

You can save your R Script as ‘my_first_r_script.R’ using File -> Save As (or the keyboard shortcut: Command + S). When you’re done your workspace should look something like Figure 3.

After running an R Script

Figure 3: After running an R Script

One thing to be aware of is that each R Studio Cloud workspace is essentially a new computer. Because of this, you’ll need to install any package that you want to use for each workspace. For instance, before you can use the tidyverse, you need to install.packages(“tidyverse”). This is in contrast to when you use your own computer.

A few final notes on R Studio Cloud for you to keep in the back of your mind:

  1. In the Australian politicians example we got our data from the website GitHub, but you can get data into your workspace from your local computer in a variety of ways. One way is to use the ‘upload’ button in the Files panel.
  2. R Studio Cloud allows some degree of collaboration. For instance, you can give someone else access to a workspace that you create. This could be useful for collaborating on an assignment, although it is not quite full featured yet and you cannot both be in the workspace at the same time (in contrast to, say, Google Docs).
  3. There are a variety of weaknesses of R Studio Cloud, in particular at the moment there is a 1GB limit on RAM. Additionally, it is still under-developed and things break from time to time. The R Studio Community page that is focused on R Studio Cloud can be helpful sometimes: https://community.rstudio.com/c/rstudio-cloud.

RStan

Please see https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started for getting Stan installed.

Tidyverse essentials

One of the key packages that we use in these notes is the tidyverse Wickham et al. (2019). The tidyverse is actually a package of packages (i.e. when you install tidyverse, you are actually installing a whole bunch of different packages). The key package in the tidyverse in terms of manipulating data is dplyr Wickham et al. (2020), and the key package in the tidyverse in terms of creating graphs is ggplot2 Wickham (2016).

In this section we are going to cycle through some essentials from the Tidyverse. You’ll come back to the functions in this section regularly.

I want to keep this section self-contained, so let’s start by installing the tidyverse (again, to use Di Cook’s analogy, this is the equivalent of screwing in the light-bulb). If you just did it, then you don’t need to do it again.


install.packages("tidyverse")

Now we can load the tidyverse (again, to use Di Cook’s analogy, the equivalent of turning on the light-switch).


library(tidyverse)

Here we are going to download the data about Australian politicians using the function read_csv().


australian_politicians <- 
  read_csv(
    file = 
      "https://raw.githubusercontent.com/RohanAlexander/telling_stories_with_data/master/inputs/data/australian_politicians.csv"
    )

We will now cover the pipe and six functions that are useful to know and that we will use all the time:

The pipe

One key tidyverse helper is the ‘pipe’: %>%. Read it as “and then” (keyboard shortcut: Command + Shift + M). This takes the output of a line of code and uses it as an input to the next line of code. You don’t have to use it, but it tends to make your code more readable.

The idea of the pipe is that you take your dataset, and then, do something to it. In this case, we will look at the first few lines of our dataset by piping australian_politicians through to the head() function.


australian_politicians %>% 
  head()

# A tibble: 6 x 20
  uniqueID surname allOtherNames firstName commonName displayName
  <chr>    <chr>   <chr>         <chr>     <chr>      <chr>      
1 Abbott1… Abbott  Richard Hart… Richard   <NA>       Abbott, Ri…
2 Abbott1… Abbott  Percy Phipps  Percy     <NA>       Abbott, Pe…
3 Abbott1… Abbott  Macartney     Macartney Mac        Abbott, Mac
4 Abbott1… Abbott  Charles Lydi… Charles   Aubrey     Abbott, Au…
5 Abbott1… Abbott  Joseph Palmer Joseph    <NA>       Abbott, Jo…
6 Abbott1… Abbott  Anthony John  Anthony   Tony       Abbott, To…
# … with 14 more variables: earlierOrLaterNames <chr>, title <chr>,
#   gender <chr>, birthDate <date>, birthYear <dbl>,
#   birthPlace <chr>, deathDate <date>, member <dbl>, senator <dbl>,
#   wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>,
#   adb <chr>, comments <chr>

Selecting

The select() function is used to get a particular column of a dataset. For instance, we might like to select the first names column.


australian_politicians %>% 
  select(firstName) %>% 
  head()

# A tibble: 6 x 1
  firstName
  <chr>    
1 Richard  
2 Percy    
3 Macartney
4 Charles  
5 Joseph   
6 Anthony  

In R, there are many ways to do things. Another way to get a particular column of a dataset is to use the dollar sign. This is from base R (as opposed to select() which is from the tidyverse package).


australian_politicians$firstName %>% 
  head()

[1] "Richard"   "Percy"     "Macartney" "Charles"   "Joseph"   
[6] "Anthony"  

The two are almost equivalent and differ only in the class of what they return (we’ll talk more about class later in the notes).

For the sake of completeness, if you combine select() with pull() then you will get the same class of output as if you use the dollar sign.


australian_politicians %>% 
  select(firstName) %>% 
  pull() %>% 
  head()

[1] "Richard"   "Percy"     "Macartney" "Charles"   "Joseph"   
[6] "Anthony"  

You can also use select to get rid of columns, by selecting in a negative sense.


australian_politicians %>% 
  select(-firstName)

# A tibble: 1,776 x 19
   uniqueID surname allOtherNames commonName displayName
   <chr>    <chr>   <chr>         <chr>      <chr>      
 1 Abbott1… Abbott  Richard Hart… <NA>       Abbott, Ri…
 2 Abbott1… Abbott  Percy Phipps  <NA>       Abbott, Pe…
 3 Abbott1… Abbott  Macartney     Mac        Abbott, Mac
 4 Abbott1… Abbott  Charles Lydi… Aubrey     Abbott, Au…
 5 Abbott1… Abbott  Joseph Palmer <NA>       Abbott, Jo…
 6 Abbott1… Abbott  Anthony John  Tony       Abbott, To…
 7 Abel1939 Abel    John Arthur   <NA>       Abel, John 
 8 Abetz19… Abetz   Eric          <NA>       Abetz, Eric
 9 Adams19… Adams   Judith Anne   <NA>       Adams, Jud…
10 Adams19… Adams   Dick Godfrey… <NA>       Adams, Dick
# … with 1,766 more rows, and 14 more variables:
#   earlierOrLaterNames <chr>, title <chr>, gender <chr>,
#   birthDate <date>, birthYear <dbl>, birthPlace <chr>,
#   deathDate <date>, member <dbl>, senator <dbl>,
#   wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>,
#   adb <chr>, comments <chr>

Finally, you can select, based on conditions. For instance, selecting all all of the columns that start with something, for instance, ‘birth’.


australian_politicians %>% 
  select(starts_with("birth"))

# A tibble: 1,776 x 3
   birthDate  birthYear birthPlace  
   <date>         <dbl> <chr>       
 1 NA              1859 Bendigo     
 2 1869-05-14      1869 Hobart      
 3 1877-07-03      1877 Murrurundi  
 4 1886-01-04      1886 St Leonards 
 5 1891-10-18      1891 North Sydney
 6 1957-11-04      1957 London      
 7 1939-06-25      1939 Sydney      
 8 1958-01-25      1958 Stuttgart   
 9 1943-04-11      1943 Picton      
10 1951-04-29      1951 Launceston  
# … with 1,766 more rows

Filtering

The filter() function is used to get particular rows from a dataset. For instance, we might like to filter to only politicians that became prime minister.


australian_politicians %>% 
  filter(wasPrimeMinister == 1) 

# A tibble: 30 x 20
   uniqueID surname allOtherNames firstName commonName displayName
   <chr>    <chr>   <chr>         <chr>     <chr>      <chr>      
 1 Abbott1… Abbott  Anthony John  Anthony   Tony       Abbott, To…
 2 Barton1… Barton  Edmund        Edmund    <NA>       Barton, Ed…
 3 Bruce18… Bruce   Stanley Melb… Stanley   <NA>       Bruce, Sta…
 4 Chifley… Chifley Joseph Bened… Joseph    Ben        Chifley, B…
 5 Cook1860 Cook    Joseph        Joseph    <NA>       Cook, Jose…
 6 Curtin1… Curtin  John Joseph … John      <NA>       Curtin, Jo…
 7 Deakin1… Deakin  Alfred        Alfred    <NA>       Deakin, Al…
 8 Fadden1… Fadden  Arthur Willi… Arthur    Arthur     Fadden, Ar…
 9 Fisher1… Fisher  Andrew        Andrew    <NA>       Fisher, An…
10 Forde18… Forde   Francis Mich… Francis   Frank      Forde, Fra…
# … with 20 more rows, and 14 more variables:
#   earlierOrLaterNames <chr>, title <chr>, gender <chr>,
#   birthDate <date>, birthYear <dbl>, birthPlace <chr>,
#   deathDate <date>, member <dbl>, senator <dbl>,
#   wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>,
#   adb <chr>, comments <chr>

The filter() function also accepts two conditions. For instance, we can look at politicians who were prime minister and were named Joseph.


australian_politicians %>% 
  filter(wasPrimeMinister == 1 & firstName == "Joseph")

# A tibble: 3 x 20
  uniqueID surname allOtherNames firstName commonName displayName
  <chr>    <chr>   <chr>         <chr>     <chr>      <chr>      
1 Chifley… Chifley Joseph Bened… Joseph    Ben        Chifley, B…
2 Cook1860 Cook    Joseph        Joseph    <NA>       Cook, Jose…
3 Lyons18… Lyons   Joseph Aloys… Joseph    <NA>       Lyons, Jos…
# … with 14 more variables: earlierOrLaterNames <chr>, title <chr>,
#   gender <chr>, birthDate <date>, birthYear <dbl>,
#   birthPlace <chr>, deathDate <date>, member <dbl>, senator <dbl>,
#   wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>,
#   adb <chr>, comments <chr>

We would get the same result if we use a comma instead of an ampersand.


australian_politicians %>% 
  filter(wasPrimeMinister == 1, firstName == "Joseph")

# A tibble: 3 x 20
  uniqueID surname allOtherNames firstName commonName displayName
  <chr>    <chr>   <chr>         <chr>     <chr>      <chr>      
1 Chifley… Chifley Joseph Bened… Joseph    Ben        Chifley, B…
2 Cook1860 Cook    Joseph        Joseph    <NA>       Cook, Jose…
3 Lyons18… Lyons   Joseph Aloys… Joseph    <NA>       Lyons, Jos…
# … with 14 more variables: earlierOrLaterNames <chr>, title <chr>,
#   gender <chr>, birthDate <date>, birthYear <dbl>,
#   birthPlace <chr>, deathDate <date>, member <dbl>, senator <dbl>,
#   wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>,
#   adb <chr>, comments <chr>

Similarly, we can look at politicians who were named Myles or Ruth.


australian_politicians %>% 
  filter(firstName == "Ruth" | firstName == "Myles")

# A tibble: 3 x 20
  uniqueID surname allOtherNames firstName commonName displayName
  <chr>    <chr>   <chr>         <chr>     <chr>      <chr>      
1 Coleman… Coleman Ruth Nancy    Ruth      <NA>       Coleman, R…
2 Ferrick… Ferric… Myles Aloysi… Myles     <NA>       Ferricks, …
3 Webber1… Webber  Ruth Stephan… Ruth      <NA>       Webber, Ru…
# … with 14 more variables: earlierOrLaterNames <chr>, title <chr>,
#   gender <chr>, birthDate <date>, birthYear <dbl>,
#   birthPlace <chr>, deathDate <date>, member <dbl>, senator <dbl>,
#   wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>,
#   adb <chr>, comments <chr>

We can also pipe the results, for instance, pipe from the filter() to select()


australian_politicians %>% 
  filter(firstName == "Ruth" | firstName == "Myles") %>% 
  select(firstName, surname)

# A tibble: 3 x 2
  firstName surname 
  <chr>     <chr>   
1 Ruth      Coleman 
2 Myles     Ferricks
3 Ruth      Webber  

Finally, we can filter() to a particular row number, for instance, in this case row 853.


australian_politicians %>% 
  filter(row_number() == 853)

# A tibble: 1 x 20
  uniqueID surname allOtherNames firstName commonName displayName
  <chr>    <chr>   <chr>         <chr>     <chr>      <chr>      
1 Jarman1… Jarman  Alan William  Alan      <NA>       Jarman, Al…
# … with 14 more variables: earlierOrLaterNames <chr>, title <chr>,
#   gender <chr>, birthDate <date>, birthYear <dbl>,
#   birthPlace <chr>, deathDate <date>, member <dbl>, senator <dbl>,
#   wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>,
#   adb <chr>, comments <chr>

But there is also a dedicated function to do this, which is slice()


australian_politicians %>% 
  slice(853)

# A tibble: 1 x 20
  uniqueID surname allOtherNames firstName commonName displayName
  <chr>    <chr>   <chr>         <chr>     <chr>      <chr>      
1 Jarman1… Jarman  Alan William  Alan      <NA>       Jarman, Al…
# … with 14 more variables: earlierOrLaterNames <chr>, title <chr>,
#   gender <chr>, birthDate <date>, birthYear <dbl>,
#   birthPlace <chr>, deathDate <date>, member <dbl>, senator <dbl>,
#   wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>,
#   adb <chr>, comments <chr>

Arranging

We can change the order of the dataset based on the values in a particular column using the arrange() function. For instance, we may like to arrange the data by year of birth.


australian_politicians %>% 
  arrange(surname)

# A tibble: 1,776 x 20
   uniqueID surname allOtherNames firstName commonName displayName
   <chr>    <chr>   <chr>         <chr>     <chr>      <chr>      
 1 Abbott1… Abbott  Richard Hart… Richard   <NA>       Abbott, Ri…
 2 Abbott1… Abbott  Percy Phipps  Percy     <NA>       Abbott, Pe…
 3 Abbott1… Abbott  Macartney     Macartney Mac        Abbott, Mac
 4 Abbott1… Abbott  Charles Lydi… Charles   Aubrey     Abbott, Au…
 5 Abbott1… Abbott  Joseph Palmer Joseph    <NA>       Abbott, Jo…
 6 Abbott1… Abbott  Anthony John  Anthony   Tony       Abbott, To…
 7 Abel1939 Abel    John Arthur   John      <NA>       Abel, John 
 8 Abetz19… Abetz   Eric          Eric      <NA>       Abetz, Eric
 9 Adams19… Adams   Judith Anne   Judith    <NA>       Adams, Jud…
10 Adams19… Adams   Dick Godfrey… Dick      <NA>       Adams, Dick
# … with 1,766 more rows, and 14 more variables:
#   earlierOrLaterNames <chr>, title <chr>, gender <chr>,
#   birthDate <date>, birthYear <dbl>, birthPlace <chr>,
#   deathDate <date>, member <dbl>, senator <dbl>,
#   wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>,
#   adb <chr>, comments <chr>

We can also use the desc() function to arrange in descending order.


australian_politicians %>% 
  arrange(desc(surname))

# A tibble: 1,776 x 20
   uniqueID surname allOtherNames firstName commonName displayName
   <chr>    <chr>   <chr>         <chr>     <chr>      <chr>      
 1 Zimmerm… Zimmer… Trent Moir    Trent     <NA>       Zimmerman,…
 2 Zeal1830 Zeal    William Aust… William   <NA>       Zeal, Will…
 3 Zappia1… Zappia  Antonio       Antonio   Tony       Zappia, To…
 4 Zammit1… Zammit  Paul John     Paul      <NA>       Zammit, Pa…
 5 Zakharo… Zakhar… Alice Olive   Alice     Olive      Zakharov, …
 6 Zahra19… Zahra   Christian Jo… Christian <NA>       Zahra, Chr…
 7 Young19… Young   Harold Willi… Harold    <NA>       Young, Har…
 8 Young19… Young   Michael Jero… Michael   Mick       Young, Mick
 9 Young19… Young   Terry James   Terry     <NA>       Young, Ter…
10 Yates18… Yates   George Edwin  George    Gunner     Yates, Gun…
# … with 1,766 more rows, and 14 more variables:
#   earlierOrLaterNames <chr>, title <chr>, gender <chr>,
#   birthDate <date>, birthYear <dbl>, birthPlace <chr>,
#   deathDate <date>, member <dbl>, senator <dbl>,
#   wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>,
#   adb <chr>, comments <chr>

We can also arrange based on more than one column.


australian_politicians %>% 
  arrange(firstName, surname)

# A tibble: 1,776 x 20
   uniqueID surname allOtherNames firstName commonName displayName
   <chr>    <chr>   <chr>         <chr>     <chr>      <chr>      
 1 Blain18… Blain   Adair Macali… Adair     <NA>       Blain, Ada…
 2 Armstro… Armstr… Adam Alexand… Adam      Bill       Armstrong,…
 3 Bandt19… Bandt   Adam Paul     Adam      <NA>       Bandt, Adam
 4 Dein1889 Dein    Adam Kemball  Adam      Dick       Dein, Dick 
 5 Ridgewa… Ridgew… Aden Derek    Aden      <NA>       Ridgeway, …
 6 Bennett… Bennett Adrian Frank  Adrian    <NA>       Bennett, A…
 7 Gibson1… Gibson  Adrian        Adrian    <NA>       Gibson, Ad…
 8 Wynne18… Wynne   Agar          Agar      <NA>       Wynne, Agar
 9 Roberts… Robert… Agnes Robert… Agnes     <NA>       Robertson,…
10 Bird1906 Bird    Alan Charles  Alan      <NA>       Bird, Alan 
# … with 1,766 more rows, and 14 more variables:
#   earlierOrLaterNames <chr>, title <chr>, gender <chr>,
#   birthDate <date>, birthYear <dbl>, birthPlace <chr>,
#   deathDate <date>, member <dbl>, senator <dbl>,
#   wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>,
#   adb <chr>, comments <chr>

We can pipe arrange() to another arrange().


australian_politicians %>% 
  arrange(firstName) %>% 
  arrange(surname)

# A tibble: 1,776 x 20
   uniqueID surname allOtherNames firstName commonName displayName
   <chr>    <chr>   <chr>         <chr>     <chr>      <chr>      
 1 Abbott1… Abbott  Anthony John  Anthony   Tony       Abbott, To…
 2 Abbott1… Abbott  Charles Lydi… Charles   Aubrey     Abbott, Au…
 3 Abbott1… Abbott  Joseph Palmer Joseph    <NA>       Abbott, Jo…
 4 Abbott1… Abbott  Macartney     Macartney Mac        Abbott, Mac
 5 Abbott1… Abbott  Percy Phipps  Percy     <NA>       Abbott, Pe…
 6 Abbott1… Abbott  Richard Hart… Richard   <NA>       Abbott, Ri…
 7 Abel1939 Abel    John Arthur   John      <NA>       Abel, John 
 8 Abetz19… Abetz   Eric          Eric      <NA>       Abetz, Eric
 9 Adams19… Adams   Dick Godfrey… Dick      <NA>       Adams, Dick
10 Adams19… Adams   Judith Anne   Judith    <NA>       Adams, Jud…
# … with 1,766 more rows, and 14 more variables:
#   earlierOrLaterNames <chr>, title <chr>, gender <chr>,
#   birthDate <date>, birthYear <dbl>, birthPlace <chr>,
#   deathDate <date>, member <dbl>, senator <dbl>,
#   wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>,
#   adb <chr>, comments <chr>

It is just important to be clear about the precedence of each.


australian_politicians %>% 
  arrange(surname, firstName)

# A tibble: 1,776 x 20
   uniqueID surname allOtherNames firstName commonName displayName
   <chr>    <chr>   <chr>         <chr>     <chr>      <chr>      
 1 Abbott1… Abbott  Anthony John  Anthony   Tony       Abbott, To…
 2 Abbott1… Abbott  Charles Lydi… Charles   Aubrey     Abbott, Au…
 3 Abbott1… Abbott  Joseph Palmer Joseph    <NA>       Abbott, Jo…
 4 Abbott1… Abbott  Macartney     Macartney Mac        Abbott, Mac
 5 Abbott1… Abbott  Percy Phipps  Percy     <NA>       Abbott, Pe…
 6 Abbott1… Abbott  Richard Hart… Richard   <NA>       Abbott, Ri…
 7 Abel1939 Abel    John Arthur   John      <NA>       Abel, John 
 8 Abetz19… Abetz   Eric          Eric      <NA>       Abetz, Eric
 9 Adams19… Adams   Dick Godfrey… Dick      <NA>       Adams, Dick
10 Adams19… Adams   Judith Anne   Judith    <NA>       Adams, Jud…
# … with 1,766 more rows, and 14 more variables:
#   earlierOrLaterNames <chr>, title <chr>, gender <chr>,
#   birthDate <date>, birthYear <dbl>, birthPlace <chr>,
#   deathDate <date>, member <dbl>, senator <dbl>,
#   wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>,
#   adb <chr>, comments <chr>

Grouping

We can group variables using the function group_by() and then apply some other function within those groups. For instance, we could arrange by first name within gender, and then get the first three results.


australian_politicians %>% 
  group_by(gender) %>% 
  arrange(firstName) %>% 
  slice(1:3)

# A tibble: 6 x 20
# Groups:   gender [2]
  uniqueID surname allOtherNames firstName commonName displayName
  <chr>    <chr>   <chr>         <chr>     <chr>      <chr>      
1 Roberts… Robert… Agnes Robert… Agnes     <NA>       Robertson,…
2 MacTier… MacTie… Alannah Joan… Alannah   <NA>       MacTiernan…
3 Zakharo… Zakhar… Alice Olive   Alice     Olive      Zakharov, …
4 Blain18… Blain   Adair Macali… Adair     <NA>       Blain, Ada…
5 Armstro… Armstr… Adam Alexand… Adam      Bill       Armstrong,…
6 Bandt19… Bandt   Adam Paul     Adam      <NA>       Bandt, Adam
# … with 14 more variables: earlierOrLaterNames <chr>, title <chr>,
#   gender <chr>, birthDate <date>, birthYear <dbl>,
#   birthPlace <chr>, deathDate <date>, member <dbl>, senator <dbl>,
#   wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>,
#   adb <chr>, comments <chr>

Mutating

The mutate() function is used to make a new column. For instance, perhaps we want to make a new column that is 1 if a person was a member and a senator and 0 otherwise.


australian_politicians <- 
  australian_politicians %>% 
  mutate(was_both = if_else(member == 1 & senator == 1, 1, 0))

australian_politicians %>% 
  select(member, senator, was_both) %>% 
  head()

# A tibble: 6 x 3
  member senator was_both
   <dbl>   <dbl>    <dbl>
1      0       1        0
2      1       1        1
3      0       1        0
4      1       0        0
5      1       0        0
6      1       0        0

Summarise

The function summarise() is used to create new summary variables. For instance, looking at the maximum of birth year to find who the most recently born politicians are.


australian_politicians %>% 
  summarise(youngest_politicians_birth_year = max(birthYear, na.rm = TRUE))

# A tibble: 1 x 1
  youngest_politicians_birth_year
                            <dbl>
1                            1994

And we can check that using arrange().


australian_politicians %>% 
  arrange(-birthYear) %>% 
  select(uniqueID, surname, allOtherNames, birthYear) %>% 
  slice(1:3)

# A tibble: 3 x 4
  uniqueID       surname     allOtherNames    birthYear
  <chr>          <chr>       <chr>                <dbl>
1 SteeleJohn1994 Steele-John Jordon Alexander      1994
2 Chandler1990   Chandler    Claire                1990
3 Roy1990        Roy         Wyatt Beau            1990

The summarise() function is particularly powerful in conjunction with group_by(). For instance, let’s look at the year of birth of the youngest by gender.


australian_politicians %>%
  group_by(gender) %>% 
  summarise(youngest_politician_birth_year = max(birthYear, na.rm = TRUE))

# A tibble: 2 x 2
  gender youngest_politician_birth_year
  <chr>                           <dbl>
1 female                           1990
2 male                             1994

Let’s look at mean of age at death by gender.


australian_politicians %>%
  mutate(days_lived = deathDate - birthDate) %>% 
  filter(!is.na(days_lived)) %>% 
  group_by(gender) %>% 
  summarise(mean_days_lived = round(mean(days_lived), 2)) %>% 
  arrange(-mean_days_lived)

# A tibble: 2 x 2
  gender mean_days_lived
  <chr>  <drtn>         
1 female 28857.30 days  
2 male   27372.89 days  

We can use group_by() for more than one group for instance, looking again at average number of days lived by gender and by which house.


australian_politicians %>%
  mutate(days_lived = deathDate - birthDate) %>% 
  filter(!is.na(days_lived)) %>% 
  group_by(gender, wasPrimeMinister) %>% 
  summarise(mean_days_lived = round(mean(days_lived), 2)) %>% 
  arrange(-mean_days_lived)

# A tibble: 3 x 3
# Groups:   gender [2]
  gender wasPrimeMinister mean_days_lived
  <chr>             <dbl> <drtn>         
1 female               NA 28857.30 days  
2 male                  1 28446.61 days  
3 male                 NA 27345.20 days  

Counting

We can use the function count() to create counts by groups. For instance, the number of politicians by gender.


australian_politicians %>% 
  group_by(gender) %>% 
  count()

# A tibble: 2 x 2
# Groups:   gender [2]
  gender     n
  <chr>  <int>
1 female   236
2 male    1540

Proportions

Finally, often calculating proportions is a combination of summarise() and mutate() (and group_by()).

Let’s calculate the proportion of genders.

Note here, that we needed to ungroup() the data before mutating.


australian_politicians %>% 
  group_by(gender) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(prop = n/(sum(n)))

# A tibble: 2 x 3
  gender     n  prop
  <chr>  <int> <dbl>
1 female   236 0.133
2 male    1540 0.867

ggplot essentials

The ggplot package is the plotting package that is part of the tidyverse collection of packages.

In a similar way to piping, it works in layers. But instead of using the pipe (%>%) ggplot uses +.

Main features

There are three key aspects:

  1. data;
  2. aesthetics / mapping; and
  3. type.

For instances, let’s build up a histogram of age of death with increasing complexity.

Starts with a grey box:


australian_politicians %>% 
  mutate(days_lived = as.integer(deathDate - birthDate)) %>% 
  filter(!is.na(days_lived)) %>% 
  ggplot(mapping = aes(x = days_lived))

We need to tell it what we want to plot. This is where geom comes in


australian_politicians %>% 
  mutate(days_lived = as.integer(deathDate - birthDate)) %>% 
  filter(!is.na(days_lived)) %>% 
  ggplot(mapping = aes(x = days_lived)) +
  geom_histogram(binwidth = 365)

Now let’s color the bars by gender, which means adding an aesthetic.


australian_politicians %>% 
  mutate(days_lived = as.integer(deathDate - birthDate)) %>% 
  filter(!is.na(days_lived)) %>% 
  ggplot(mapping = aes(x = days_lived, fill = gender)) +
  geom_histogram(binwidth = 365)

We can add some labels, change the color, and background.


australian_politicians %>% 
  mutate(days_lived = as.integer(deathDate - birthDate)) %>% 
  filter(!is.na(days_lived)) %>% 
  ggplot(mapping = aes(x = days_lived, fill = gender)) +
  geom_histogram(binwidth = 365) +
  labs(title = "Length of life of Australian politicians", 
       x = "Age of deaths (days)", 
       y = "Number") +
  theme_classic() +
  scale_fill_brewer(palette = "Set1")

Facets

Facets are subplots and are invaluable because they allow you to add another variable to your plot without having to make a 3D plot.


australian_politicians %>% 
  mutate(days_lived = as.integer(deathDate - birthDate)) %>% 
  filter(!is.na(days_lived)) %>% 
  ggplot(mapping = aes(x = days_lived)) +
  geom_histogram(binwidth = 365) +
  labs(title = "Length of life of Australian politicians", 
       x = "Age of deaths (days)", 
       y = "Number") +
  theme_classic() +
  scale_fill_brewer(palette = "Set1") +
  facet_wrap(~gender)

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2020. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.