Hello world

This hello world is roughly based on code originally written by Sharla Gelfand and broadly follows Sharla’s original presentation which is available here.

Required reading


In this hello world we will get some data from the wild, make a graph with it, and then use this to tell a story. Some of the code may be a bit unfamiliar to you if it’s your first time using R. It’ll all soon be familiar. But the only way to learn how to code is to code. Please try to get this working on your own computer, typing out (not copy/pasting) all the code that you need.

One of the great things about graphs is that sometimes this is all you need to have a convincing story, as Figure 1 from Snow (1854), and Figure 2 from Ganesh (2020) show.

A famous example where a map helped identify the cause of Cholera

Figure 1: A famous example where a map helped identify the cause of Cholera

How footballers measure against athletes in major US sports

Figure 2: How footballers measure against athletes in major US sports

In this chapter we are going to focus on making a table and a graph from our data. Although you will be guided thoroughly to achieve this, hopefully by seeing the power of quantitative analysis with R you will be motivated to stick with it when you run into difficulties later on.

Getting started

To get started you should open a new R Markdown file (File -> New File -> R Markdown). As this is our first attempt at using R in the wild, we will just have everything in the one R Markdown document. (In later projects we will move to a more robust set-up.) Then you should create a new R code chunk (keyboard shortcut: Command + Option + I) and add some preamble documentation. I like to specify the purpose of the document, who the author is and their contact details, when the file was written or last updated, and pre-requisites that the file relies on. You may also like to include a license, and list outstanding issues or todos. Remember that in R, lines that start with ‘#’ are comments - they won’t run.

#### Preamble ####
# Purpose: Read in voting data from the 2019 Canadian Election and output a 
# dataset that can be used for analysis.
# Author: Rohan Alexander
# Email: rohan.alexander@utoronto.ca
# Date: 9 January 2019
# Prerequisites: Need the text file from the Canadian elections website
# Issues: 
# To do:

After this I typically set-up my workspace. This usually involves installing and/or reading in any packages, and possibly updating them. Remember that you only need to install a package once for each computer. But you need to call it every time you want to use it. (Here I’ve added excessive comments so that you know what is going on and why - in general I wouldn’t explain what tidyverse is.)

#### Workspace set-up ####
install.packages("tidyverse") # Only need to do this once
install.packages("janitor") # Only need to do this once

In this case we are going to use tidyverse Wickham (2017), janitor Firke (2020), and here Müller (2017).

#### Workspace set-up ####
# tidyverse is a collection of packages
# Try ?tidyverse to see more
library(tidyverse) # Calls the tidyverse - need to do this each time.
library(janitor) # janitor helps us clean datasets
library(here) # here helps us to know where files are
# update.packages() # You can uncomment this if you want to update your packages. 

Get the data

We read in the dataset from the Elections Canada website. We can actually pass a website to the read_tsv() function, which saves a lot of time.

#### Read in the data ####
# Read in the data using read_tsv from the readr package (part of the tidyverse)
# The <- is assing the output of readr::read_tsv to a object called raw_data. 
raw_2019_elections_data <- readr::read_tsv(file = "http://enr.elections.ca/DownloadResults.aspx",
                            skip = 1) 
# There is some debris on the first line so we skip them.
# We have read the data from the Elections Canada website. We may like to save 
# it just in case something happens and they move it. 
write_csv(raw_2019_elections_data, here("inputs/data/canadian_2019_voting.csv"))

Clean the data

Now we’d like to clean the data so that we can use it.

#### Basic cleaning ####
raw_2019_elections_data <- read_csv(here("inputs/data/canadian_2019_voting.csv"))
# If you called the library (as we did) then you don't need to use this set-up 
# of janitor::clean_names, you could just use clean_names, but I'm making it 
# explicit here, but won't in the future.
cleaned_2019_elections_data <- janitor::clean_names(raw_2019_elections_data)
# One thing to notice for those who have a stata background is that we just 
# overwrote the name - that's fine in R.

# The pipe operator - %>% - pushes the output from one line to be an input to the 
# next line.
cleaned_2019_elections_data <- 
  cleaned_2019_elections_data %>% 
  # Filter to only have certain rows
  filter(type_of_results == "validated") %>% 
  # Select only certain columns
         ) %>% 
  # Rename the columns to be a bit shorter
  rename(riding_number = electoral_district_number_numero_de_la_circonscription,
         riding = electoral_district_name,
         party = political_affiliation,
         surname = surname_nom_de_famille,
         votes = percent_votes_obtained_votes_obtenus_percent)

# A tibble: 6 x 5
  riding_number riding               party               surname votes
          <dbl> <chr>                <chr>               <chr>   <dbl>
1         10001 Avalon               Conservative        Chapman  31.1
2         10001 Avalon               Green Party         Malone    5.4
3         10001 Avalon               Liberal             McDona…  46.3
4         10001 Avalon               NDP-New Democratic… Movelle  17.3
5         10002 Bonavista--Burin--T… NDP-New Democratic… Cooper   12  
6         10002 Bonavista--Burin--T… Green Party         Reichel   2.9

Finally we may like to save our cleaned dataset.

#### Save ####
readr::write_csv(cleaned_2019_elections_data, "outputs/data/cleaned_canadian_2019_voting.csv")

Make a graph

First we need to read in the dataset, when then filter the number of parties to a smaller number, and filter to only ridings in Ontario.

#### Read in the data ####
cleaned_2019_elections_data <- 

# Make a graph just considers Toronto riding
toronto_ridings <- c("Beaches--East York", "Davenport", "Don Valley East", 
                     "Don Valley North", "Don Valley West", "Eglinton--Lawrence", 
                     "Etobicoke Centre", "Etobicoke North", "Etobicoke--Lakeshore", 
                     "Humber River--Black Creek", "Parkdale--High Park", "Scarborough Centre", 
                     "Scarborough North", "Scarborough Southwest", "Scarborough--Agincourt", 
                     "Scarborough--Guildwood", "Scarborough--Rouge Park", "Spadina--Fort York", 
                     "Toronto Centre", "Toronto--Danforth", "Toronto--St. Paul's", 
                     "University--Rosedale", "Willowdale", "York Centre", "York South--Weston")

cleaned_2019_elections_data %>% 
  filter(party %in% c("Bloc Québécois", 
                      "NDP-New Democratic Party")
  ) %>% 
  filter(riding_number > 35000,
         riding_number < 36000,
  ) %>% 
  filter(riding %in% toronto_ridings) %>% 
  ggplot(aes(x = riding, y = votes, color = party)) +
  geom_point() +
  theme_minimal() + # Make the theme neater
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + # Change the angle
  labs(x = "Toronto riding",
       y = "Votes (%)",
       color = "Party")
# Save the graph
ggsave("outputs/figures/toronto_results.pdf", width = 40, height = 20, units = "cm")

Make a table

There are an awful lot of ways to make a table in R. First we’ll try the built-in function summary().

#### Read in the data ####
cleaned_2019_elections_data <- 

#### Make some tables ####
# Try some different default summary table
 riding_number      riding             party          
 Min.   :10001   Length:2146        Length:2146       
 1st Qu.:24051   Class :character   Class :character  
 Median :35056   Mode  :character   Mode  :character  
 Mean   :35584                                        
 3rd Qu.:47007                                        
 Max.   :62001                                        
   surname              votes      
 Length:2146        Min.   : 0.00  
 Class :character   1st Qu.: 1.50  
 Mode  :character   Median : 7.80  
                    Mean   :15.75  
                    3rd Qu.:26.48  
                    Max.   :85.50  

Now we can try a group_by() and summarise().

# Make our own
cleaned_2019_elections_data %>% 
  # Using group_by and summarise means that whatever summary statistics we 
  # construct will be on a party basis. We could group_by multiple variables and
  # similarly, we could create a bunch of different other summary statistics.
  group_by(party) %>% 
  summarise(min = min(votes),
            mean = mean(votes),
            max = max(votes))
# A tibble: 23 x 4
   party                         min   mean   max
   <chr>                       <dbl>  <dbl> <dbl>
 1 Animal Protection Party       0.3  0.447   0.7
 2 Bloc Québécois                4   32.1    58.2
 3 CFF - Canada's Fourth Front   0.1  0.186   0.3
 4 Christian Heritage Party      0.2  0.737   3.3
 5 Communist                     0.1  0.24    0.7
 6 Conservative                  2.3 34.0    85.5
 7 Green Party                   1.3  6.46   49.1
 8 Independent                   0.1  1.16   32.6
 9 Liberal                       4.1 33.5    62.2
10 Libertarian                   0.3  0.604   2.4
# … with 13 more rows

Firke, Sam. 2020. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.

Ganesh, Janan. 2020. “Will Liverpool’s Machine Football Conquer America?” Financial Times. https://www.ft.com/content/d61f94ba-53cb-11ea-8841-482eed0038b1.

Müller, Kirill. 2017. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.

Snow, John. 1854. On the Mode of Communication of Cholera. C.F. Cheffins.

Wickham, Hadley. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.