Chapter 3 R Essentials
Required reading
- Bryan, Jennifer and Jim Hester, 2020, What they Forgot to Teach You About R, Chapters 1 to 5, https://rstats.wtf/debugging-r-code.html.
- Wickham, Hadley, and Garrett Grolemund, 2017, R for Data Science, Chapters 3 - 6, 8, 10, 11, 13, 14, 15, and 18, https://r4ds.had.co.nz/.
Required viewing
- Kuriwaki, Shiro, 2020, ‘Defining Custom Functions in R’, Vimeo, 2 February, https://vimeo.com/388825332.
Alternative reading
There are a lot of great alternative ‘getting started with R’ type materials. Depending on your background and interests you may find some of the following useful:
- Arnold, Taylor, and Lauren Tilton, 2015, Humanities Data in R, Springer, Chapters 1 to 5.
- Hall, Megan, 2019, ‘An Introduction to R With Hockey Data’, https://hockey-graphs.com/2019/12/11/an-introduction-to-r-with-hockey-data/.
- Hanretty, Chris, 2020, ‘ConveRt’, slides http://chrishanretty.co.uk/conveRt/#1.
- Phillips, Nathaniel D., 2018, YaRrr! The Pirate’s Guide to R, Chapter 2, https://bookdown.org/ndphillips/YaRrr/started.html.
Recommended reading
- Alexander, Monica, 2019, ‘The concentration and uniqueness of baby names in Australia and the US’, https://www.monicaalexander.com/posts/2019-20-01-babynames/.
- Hvitfeldt, Emil, 2020, ‘Emoji in ggplot2’, https://www.hvitfeldt.me/blog/real-emojis-in-ggplot2/.
- Pavlik, Kaylin, 2018, ‘Dairy Queen Deserts in Minnesota’, https://www.kaylinpavlik.com/dairy-queen-deserts/.
- ‘R Studio Cloud Guide’, https://rstudio.cloud/learn/guide.
- Scherer, Cédric, 2019, ‘Best TidyTuesday 2019’, https://cedricscherer.netlify.com/2019/12/30/best-tidytuesday-2019/.
- Silge, Julia, 2019, ‘Reordering and facetting for ggplot2’, https://juliasilge.com/blog/reorder-within/.
- Smale, David, 2019, ‘Happy Days’, https://davidsmale.netlify.com/portfolio/happy-days/.
Key libraries
ggplot2
tidyverse
Key concepts/skills/etc
- Tibbles
- Importing data
- Joining data
- Strings
- Factors
- Dates
- Pivot
Key functions
class()
dplyr::case_when()
ggplot::facet_wrap()
ggplot::geom_density()
ggplot::geom_histogram()
ggplot::geom_point()
janitor::clean_names()
skimr::skim()
tidyr::pivot_longer()
tidyr::pivot_wider()
Quiz
- If I had a dataset with the following columns:
name
,age
and wanted to focus onname
, then which verb should I use (pick one)?tidyverse::select()
.tidyverse::mutate()
.tidyverse::filter()
.tidyverse::rename()
.
- If I want to cite R then how do I find a recommended citation (pick one)?
cite('R')
.cite()
.citation('R')
.citation()
.
- What are three advantages of R? What are three disadvantages?
- What is R Studio?
- An integrated development environment (IDE)
- A closed source paid program
- A programming language created by Guido van Rossum
- A statistical programming language
- What is R?
- A open source statistical programming language
- A programming language created by Guido van Rossum
- A closed source statistical programming language
- An integrated development environment (IDE)
- Which of the following are not tidyverse verbs (pick one)?
- select().
- filter().
- arrange().
- mutate().
- visualize().
- If I wanted to make a new column which verb should I use (pick one)?
- select().
- filter().
- arrange().
- mutate().
- visualize().
- If I wanted to focus on particular rows which verb should I use (pick one)?
- select().
- filter().
- arrange().
- mutate().
- summarise()
- If I wanted a summary of the data that gave me the mean by sex, which two verbs should I use (pick one)?
- summarise().
- filter().
- arrange().
- mutate().
- group_by().
- What are the three key aspects of the grammar of graphics (select all)?
- data.
- aesthetics.
- type.
- geom_histogram().
- What is not one of the four challenges for mitigating bias mentioned in Hao 2019 (pick one)?
- Unknown unknowns.
- Imperfect processes.
- The definitions of fairness.
- Lack of social context.
- Disinterest given profit considerations.
- What would be the output of
class('edward')
(pick one)?- “numeric”.
- “character”.
- “data.frame”.
- “vector”.
- How can I simulate 10,000 draws from a normal distribution with a mean of 27 and a standard deviation of 3 (pick one)?
rnorm(10000, mean = 27, sd = 3)
.rnorm(27, mean = 10000, sd = 3)
.rnorm(3, mean = 10000, sd = 27)
.rnorm(27, mean = 3, sd = 1000)
.
3.1 R essentials
This section is the basics of using R. Some of it may not make sense at first, but these are commands that we will come back to throughout these notes. You should initially just go through this chapter quickly, noting aspects that you don’t understand. Then start to play around with some of the initial case studies. Then maybe come back to this chapter. That way you will see how the various bits fit into context, and hopefully be more motivated to pick up various aspects. We will come back to everything in this chapter in more detail at some point in these notes.
R is an open source language that is useful for statistical programming
You can download R for free here: http://cran.utstat.utoronto.ca/, and you can download R Studio Desktop for free here: https://rstudio.com/products/rstudio/download/#download.
When you are using R you will run into trouble at some point. To work through that trouble:
- Look at the help file for the function by putting ? before the function e.g.
?pivot_wider
. - Check the class of your data, by
class(data_set$data_column)
. - Check for typos.
- Google the error.
- Google what you are trying to do.
- Restart R (
Session
->Restart R and Clear Output
). - Try to make a small example and see if you have the same issues.
- Restart your computer.
The past ten years or so of R have been characterised by the rise of the tidyverse. This is ‘… an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.’ Wickham (2020b). There are three distinctions here: the original R language, typically referred to as ‘base’; the ‘tidyverse’ which is a collection of packages that build on top of the R language; and other packages.
Pretty much everything that you can do in the tidyverse, you can also do in base. However, as the tidyverse was built especially for modern data science it is usually easier to use the tidyverse, especially when you are setting out. Additionally, pretty much everything that you can do in the tidyverse, you can also do with other packages. However, as the tidyverse is a coherent collection of packages, it is often easier to use the tidyverse, especially when you are setting out. Eventually you will start to see cases where it makes sense to trade-off the convenience and coherence of the tidyverse for some features of base or other packages. Indeed you’ll see that at various points in these notes. For instance, the tidyverse can be slow, and so if you need to import thousands of CSVs then it can make sense to switch away from read_csv()
. That is great and the appropriate use of base and non-tidyverse packages, rather than dogmatic insistence on a solution, is a sign of your development as an applied statistician.
Get started by loading the tidyverse
package.
The general workflow that we will use involves:
- Import
- Tidy
- Transforming, descriptive
- Plot
- Model
- Repeat 3/4
People like Keyes have tried to tell us this for a long time, but COVID-19 make it very clear to everyone - most of the data that we use will have humans at the heart of it. It’s vitally important that you keep that in mind and grapple with it in everything that you do with R. It can be really easy to forget that almost every point in our dataset is likely a person.
3.3 R, R Studio, and R Studio Cloud
My colleague Liza Bolton has a lovely analogy here on the relationship between R and R Studio which I really like. R is like a car engine and R Studio is like the car. Although some of us can use a car engine directly, most of us use a car to interact with the engine.
3.3.1 R
R - https://www.r-project.org/ - is an open source and free programming language that is focused on general statistics. (Free in this context doesn’t refer to a price of zero, but instead to ‘freedom’, but it also does have a price of zero). This is in contrast with a open source programming language that is designed for general purpose, such as Python, or an open source programming language that is focused on probability, such as Stan. It was created by Ross Ihaka and Robert Gentleman at the University of Auckland in New Zealand. It is maintained by the R Core Team and changes to this ‘base’ of code occur methodically and with concern given to a variety of different priorities.
If you are in Canada then you can download R here: http://cran.utstat.utoronto.ca/, if you are in Australia then you can download R here: https://cran.csiro.au/, otherwise you should go here - https://cran.r-project.org/mirrors.html - and find a location that suits you. (It doesn’t really matter where you get it from, it’s just that it may be slightly faster to use a closer option.)
Many people build on this stable base, to extend the capabilities of R to better and more quickly suit their needs. They do this by creating packages. Typically, although not always, a package is a collection R code, and this allows you to more easily do things that you want to do. These packages are managed by the Comprehensive R Archive Network (CRAN) - https://cran.r-project.org/, and other repositories. CRAN is built into the download of R that you just got, so you can use it straight away.
If you want to use a package then you need to firstly install it in your computer, and then you need to load it when you want to use it. Di Cook, who is a Professor of Business Analytics at Monash University in Australia, describes this as analogous to a lightbulb: if you want light in your house, first you need to screw in the lightbulb, and you need to turn the switch on. You only need to screw in the lightbulb once per house, but you need to turn the switch on every time you want to use the light.
To install a package on your computer (again, you’ll need to do this only once per computer) you use the code:
Then when you want to use a package, you need to call it with this code:
You can open R and use it on your computer. It is primarily designed to be interacted with through the command line. This is how I had to start with R, and it’s fine, but it can be useful to have a richer environment than the command line provides. In particular, it can be useful to install an Integrated Development Environment (IDE), which is an application that brings together various bits and pieces that you’ll use all the time. The one that we will use is R Studio.
3.3.2 R Studio
R Studio is distinct to R and they are different entities. R Studio builds on top of R to make it easier for you to use R. This is in the same way that you can use the internet from the command line, but most of us use a browser such as Chrome, Firefox, or Safari.
R Studio is free in the sense that you don’t pay anything for it. It is also free in the sense of being able to take the code, modify it, and distribute that code provided others are similarly allowed to take your code and modify it and distribute, etc. However, it is important to recognise that R Studio is an entity and so it is possible that in the future the current situation could change.
You can download R Studio here: https://rstudio.com/products/rstudio/download/#download.
When you open R Studio it will look like Figure 3.2.

Figure 3.2: Opening R Studio for the first time
The left pane is a console in which you can type and execute R code line by line. Try it with 2+2 by clicking next to the prompt ‘>’ and typing that out then pressing enter. The code that you type should be:
## [1] 4
And hopefully you get the same answer printed in the console.
The pane on the top right has information about your environment. For instance, when we create variables a list of their names and some properties will appear there. Try to type the following code, replacing my name with your name, next to the prompt, and again press enter:
You should notice a new value in the environment pane with the variable name and its value.
The pane in the bottom right is a file manager. At the moment it should just have two files - an R History file and a R Project file. We’ll get to what these are later, but for now we will create and save a file.
Type out the following code (don’t worry too much about the details for now):
And you should see a new ‘.rds’ file in your list of files.
3.3.3 R Studio Cloud
While you can download R Studio to your own computer, initially we will us R Studio Cloud, which is an online version that is provided by R Studio. We will use this so that you can focus on getting comfortable with R and R Studio in an environment that is consistent. This way you don’t have to worry about what computer you have or installation permissions while you are still getting used to the basics.
The R Studio Cloud - https://rstudio.cloud/ - is as easy as it gets in terms of moving to the cloud. The trade-off is that it is not very powerful and it is sometimes slow, but for the purposes of the initial sections of these notes that will be fine.
To get started, go to https://rstudio.cloud/ and create an account. If you are going to be a student for a while then it might be worthwhile using a university email account, because although they don’t yet charge for it, they will probably start charging soon, but with some luck they will offer education discounts.
Once you have an account and log in, then it should look something like Figure 3.3.

Figure 3.3: Opening R Studio Cloud for the first time
(You’ll be in ‘Your Workspace’, and you won’t have a ‘Example Workspace’.) From here you should start a ‘New Project’. You can give the project a name by clicking on ‘Untitled Project’ and replacing it. We can now use R Studio in the cloud.
While working line-by-line in the console is fine, it is easier to write out a whole script that can then be executed. We will do this by making an R Script. To do this go to: File -> New File -> R Script, or use the shortcut Command + Shift + N. The console pane will fall to the bottom left and an R Script will open in the top left. Let’s write some code that will grab all of the Australian politicians and then construct a small table about the genders of the prime ministers.
(Some of this code won’t make sense at this stage, but just type it all out to get in the habit and then run it, by selecting all of the code and clicking ‘Run’ (or using the keyboard shortcut: Command + Return)
# Load the packages that we need to use this time
library(devtools)
library(tidyverse)
# Grab the data on Australian politicians
install_github("RohanAlexander/AustralianPoliticians")
# Make a table of the counts of genders of the prime ministers
AustralianPoliticians::all %>%
as_tibble() %>%
count(gender, wasPrimeMinister)
## # A tibble: 4 x 3
## gender wasPrimeMinister n
## <chr> <int> <int>
## 1 female 1 1
## 2 female NA 235
## 3 male 1 29
## 4 male NA 1511
You can save your R Script as ‘my_first_r_script.R’ using File -> Save As (or the keyboard shortcut: Command + S). When you’re done your workspace should look something like Figure 3.4.

Figure 3.4: After running an R Script
One thing to be aware of is that each R Studio Cloud workspace is essentially a new computer. Because of this, you’ll need to install any package that you want to use for each workspace. For instance, before you can use the tidyverse, you need to install.packages(“tidyverse”). This is in contrast to when you use your own computer.
A few final notes on R Studio Cloud for you to keep in the back of your mind:
- In the Australian politicians example we got our data from the website GitHub, but you can get data into your workspace from your local computer in a variety of ways. One way is to use the ‘upload’ button in the Files panel.
- R Studio Cloud allows some degree of collaboration. For instance, you can give someone else access to a workspace that you create. This could be useful for collaborating on an assignment, although it is not quite full featured yet and you cannot both be in the workspace at the same time (in contrast to, say, Google Docs).
- There are a variety of weaknesses of R Studio Cloud, in particular at the moment there is a 1GB limit on RAM. Additionally, it is still under-developed and things break from time to time. The R Studio Community page that is focused on R Studio Cloud can be helpful sometimes: https://community.rstudio.com/c/rstudio-cloud.
3.4 Tidyverse I
Aspects of ‘Tidyverse I’ were written with Monica Alexander.
One of the key packages that we use in these notes is the tidyverse
Wickham, Averick, et al. (2019a). The tidyverse
is actually a package of packages (i.e. when you install tidyverse
, you are actually installing a whole bunch of different packages). The key package in the tidyverse
in terms of manipulating data is dplyr
Wickham, François, et al. (2020), and the key package in the tidyverse
in terms of creating graphs is ggplot2
Wickham (2016).
In this section we are going to cycle through some essentials from the Tidyverse. You’ll come back to the functions in this section regularly.
I want to keep this section self-contained, so let’s start by installing the tidyverse
(again, to use Di Cook’s analogy, this is the equivalent of screwing in the light-bulb). If you just did it, then you don’t need to do it again.
Now we can load the tidyverse
(again, to use Di Cook’s analogy, the equivalent of turning on the light-switch).
Here we are going to download the data about Australian politicians using the function read_csv()
.
australian_politicians <-
read_csv(
file =
"https://raw.githubusercontent.com/RohanAlexander/telling_stories_with_data/master/inputs/data/australian_politicians.csv"
)
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## birthDate = col_date(format = ""),
## birthYear = col_double(),
## deathDate = col_date(format = ""),
## member = col_double(),
## senator = col_double(),
## wasPrimeMinister = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
We will now cover the pipe and six functions that are useful to know and that we will use all the time:
select()
filter()
arrange()
mutate()
summarise()/summarize()
group_by()
3.4.1 The pipe
One key tidyverse helper is the ‘pipe’: %>%
. Read it as “and then” (keyboard shortcut: Command + Shift + M). This takes the output of a line of code and uses it as an input to the next line of code. You don’t have to use it, but it tends to make your code more readable.
The idea of the pipe is that you take your dataset, and then, do something to it. In this case, we will look at the first few lines of our dataset by piping australian_politicians
through to the head()
function.
## # A tibble: 6 x 20
## uniqueID surname allOtherNames firstName commonName displayName
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Abbott1… Abbott Richard Hart… Richard <NA> Abbott, Ri…
## 2 Abbott1… Abbott Percy Phipps Percy <NA> Abbott, Pe…
## 3 Abbott1… Abbott Macartney Macartney Mac Abbott, Mac
## 4 Abbott1… Abbott Charles Lydi… Charles Aubrey Abbott, Au…
## 5 Abbott1… Abbott Joseph Palmer Joseph <NA> Abbott, Jo…
## 6 Abbott1… Abbott Anthony John Anthony Tony Abbott, To…
## # … with 14 more variables: earlierOrLaterNames <chr>, title <chr>,
## # gender <chr>, birthDate <date>, birthYear <dbl>, birthPlace <chr>,
## # deathDate <date>, member <dbl>, senator <dbl>, wasPrimeMinister <dbl>,
## # wikidataID <chr>, wikipedia <chr>, adb <chr>, comments <chr>
3.4.2 Selecting
The select()
function is used to get a particular column of a dataset. For instance, we might like to select the first names column.
## # A tibble: 6 x 1
## firstName
## <chr>
## 1 Richard
## 2 Percy
## 3 Macartney
## 4 Charles
## 5 Joseph
## 6 Anthony
In R, there are many ways to do things. Another way to get a particular column of a dataset is to use the dollar sign. This is from base R (as opposed to select()
which is from the tidyverse
package).
## [1] "Richard" "Percy" "Macartney" "Charles" "Joseph" "Anthony"
The two are almost equivalent and differ only in the class of what they return (we’ll talk more about class later in the notes).
For the sake of completeness, if you combine select()
with pull()
then you will get the same class of output as if you use the dollar sign.
## [1] "Richard" "Percy" "Macartney" "Charles" "Joseph" "Anthony"
You can also use select
to get rid of columns, by selecting in a negative sense.
## # A tibble: 1,776 x 19
## uniqueID surname allOtherNames commonName displayName earlierOrLaterN… title
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Abbott1… Abbott Richard Hart… <NA> Abbott, Ri… <NA> <NA>
## 2 Abbott1… Abbott Percy Phipps <NA> Abbott, Pe… <NA> <NA>
## 3 Abbott1… Abbott Macartney Mac Abbott, Mac <NA> <NA>
## 4 Abbott1… Abbott Charles Lydi… Aubrey Abbott, Au… <NA> <NA>
## 5 Abbott1… Abbott Joseph Palmer <NA> Abbott, Jo… <NA> <NA>
## 6 Abbott1… Abbott Anthony John Tony Abbott, To… <NA> <NA>
## 7 Abel1939 Abel John Arthur <NA> Abel, John <NA> <NA>
## 8 Abetz19… Abetz Eric <NA> Abetz, Eric <NA> <NA>
## 9 Adams19… Adams Judith Anne <NA> Adams, Jud… nee Bird <NA>
## 10 Adams19… Adams Dick Godfrey… <NA> Adams, Dick <NA> <NA>
## # … with 1,766 more rows, and 12 more variables: gender <chr>,
## # birthDate <date>, birthYear <dbl>, birthPlace <chr>, deathDate <date>,
## # member <dbl>, senator <dbl>, wasPrimeMinister <dbl>, wikidataID <chr>,
## # wikipedia <chr>, adb <chr>, comments <chr>
Finally, you can select, based on conditions. For instance, selecting all all of the columns that start with something, for instance, ‘birth’.
## # A tibble: 1,776 x 3
## birthDate birthYear birthPlace
## <date> <dbl> <chr>
## 1 NA 1859 Bendigo
## 2 1869-05-14 1869 Hobart
## 3 1877-07-03 1877 Murrurundi
## 4 1886-01-04 1886 St Leonards
## 5 1891-10-18 1891 North Sydney
## 6 1957-11-04 1957 London
## 7 1939-06-25 1939 Sydney
## 8 1958-01-25 1958 Stuttgart
## 9 1943-04-11 1943 Picton
## 10 1951-04-29 1951 Launceston
## # … with 1,766 more rows
3.4.3 Filtering
The filter()
function is used to get particular rows from a dataset. For instance, we might like to filter to only politicians that became prime minister.
## # A tibble: 30 x 20
## uniqueID surname allOtherNames firstName commonName displayName
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Abbott1… Abbott Anthony John Anthony Tony Abbott, To…
## 2 Barton1… Barton Edmund Edmund <NA> Barton, Ed…
## 3 Bruce18… Bruce Stanley Melb… Stanley <NA> Bruce, Sta…
## 4 Chifley… Chifley Joseph Bened… Joseph Ben Chifley, B…
## 5 Cook1860 Cook Joseph Joseph <NA> Cook, Jose…
## 6 Curtin1… Curtin John Joseph … John <NA> Curtin, Jo…
## 7 Deakin1… Deakin Alfred Alfred <NA> Deakin, Al…
## 8 Fadden1… Fadden Arthur Willi… Arthur Arthur Fadden, Ar…
## 9 Fisher1… Fisher Andrew Andrew <NA> Fisher, An…
## 10 Forde18… Forde Francis Mich… Francis Frank Forde, Fra…
## # … with 20 more rows, and 14 more variables: earlierOrLaterNames <chr>,
## # title <chr>, gender <chr>, birthDate <date>, birthYear <dbl>,
## # birthPlace <chr>, deathDate <date>, member <dbl>, senator <dbl>,
## # wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>, adb <chr>,
## # comments <chr>
The filter()
function also accepts two conditions. For instance, we can look at politicians who were prime minister and were named Joseph.
## # A tibble: 3 x 20
## uniqueID surname allOtherNames firstName commonName displayName
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Chifley… Chifley Joseph Bened… Joseph Ben Chifley, B…
## 2 Cook1860 Cook Joseph Joseph <NA> Cook, Jose…
## 3 Lyons18… Lyons Joseph Aloys… Joseph <NA> Lyons, Jos…
## # … with 14 more variables: earlierOrLaterNames <chr>, title <chr>,
## # gender <chr>, birthDate <date>, birthYear <dbl>, birthPlace <chr>,
## # deathDate <date>, member <dbl>, senator <dbl>, wasPrimeMinister <dbl>,
## # wikidataID <chr>, wikipedia <chr>, adb <chr>, comments <chr>
We would get the same result if we use a comma instead of an ampersand.
## # A tibble: 3 x 20
## uniqueID surname allOtherNames firstName commonName displayName
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Chifley… Chifley Joseph Bened… Joseph Ben Chifley, B…
## 2 Cook1860 Cook Joseph Joseph <NA> Cook, Jose…
## 3 Lyons18… Lyons Joseph Aloys… Joseph <NA> Lyons, Jos…
## # … with 14 more variables: earlierOrLaterNames <chr>, title <chr>,
## # gender <chr>, birthDate <date>, birthYear <dbl>, birthPlace <chr>,
## # deathDate <date>, member <dbl>, senator <dbl>, wasPrimeMinister <dbl>,
## # wikidataID <chr>, wikipedia <chr>, adb <chr>, comments <chr>
Similarly, we can look at politicians who were named Myles or Ruth.
## # A tibble: 3 x 20
## uniqueID surname allOtherNames firstName commonName displayName
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Coleman… Coleman Ruth Nancy Ruth <NA> Coleman, R…
## 2 Ferrick… Ferric… Myles Aloysi… Myles <NA> Ferricks, …
## 3 Webber1… Webber Ruth Stephan… Ruth <NA> Webber, Ru…
## # … with 14 more variables: earlierOrLaterNames <chr>, title <chr>,
## # gender <chr>, birthDate <date>, birthYear <dbl>, birthPlace <chr>,
## # deathDate <date>, member <dbl>, senator <dbl>, wasPrimeMinister <dbl>,
## # wikidataID <chr>, wikipedia <chr>, adb <chr>, comments <chr>
We can also pipe the results, for instance, pipe from the filter()
to select()
australian_politicians %>%
filter(firstName == "Ruth" | firstName == "Myles") %>%
select(firstName, surname)
## # A tibble: 3 x 2
## firstName surname
## <chr> <chr>
## 1 Ruth Coleman
## 2 Myles Ferricks
## 3 Ruth Webber
Finally, we can filter()
to a particular row number, for instance, in this case row 853.
## # A tibble: 1 x 20
## uniqueID surname allOtherNames firstName commonName displayName
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Jarman1… Jarman Alan William Alan <NA> Jarman, Al…
## # … with 14 more variables: earlierOrLaterNames <chr>, title <chr>,
## # gender <chr>, birthDate <date>, birthYear <dbl>, birthPlace <chr>,
## # deathDate <date>, member <dbl>, senator <dbl>, wasPrimeMinister <dbl>,
## # wikidataID <chr>, wikipedia <chr>, adb <chr>, comments <chr>
But there is also a dedicated function to do this, which is slice()
## # A tibble: 1 x 20
## uniqueID surname allOtherNames firstName commonName displayName
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Jarman1… Jarman Alan William Alan <NA> Jarman, Al…
## # … with 14 more variables: earlierOrLaterNames <chr>, title <chr>,
## # gender <chr>, birthDate <date>, birthYear <dbl>, birthPlace <chr>,
## # deathDate <date>, member <dbl>, senator <dbl>, wasPrimeMinister <dbl>,
## # wikidataID <chr>, wikipedia <chr>, adb <chr>, comments <chr>
3.4.4 Arranging
We can change the order of the dataset based on the values in a particular column using the arrange()
function. For instance, we may like to arrange the data by year of birth.
## # A tibble: 1,776 x 20
## uniqueID surname allOtherNames firstName commonName displayName
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Abbott1… Abbott Richard Hart… Richard <NA> Abbott, Ri…
## 2 Abbott1… Abbott Percy Phipps Percy <NA> Abbott, Pe…
## 3 Abbott1… Abbott Macartney Macartney Mac Abbott, Mac
## 4 Abbott1… Abbott Charles Lydi… Charles Aubrey Abbott, Au…
## 5 Abbott1… Abbott Joseph Palmer Joseph <NA> Abbott, Jo…
## 6 Abbott1… Abbott Anthony John Anthony Tony Abbott, To…
## 7 Abel1939 Abel John Arthur John <NA> Abel, John
## 8 Abetz19… Abetz Eric Eric <NA> Abetz, Eric
## 9 Adams19… Adams Judith Anne Judith <NA> Adams, Jud…
## 10 Adams19… Adams Dick Godfrey… Dick <NA> Adams, Dick
## # … with 1,766 more rows, and 14 more variables: earlierOrLaterNames <chr>,
## # title <chr>, gender <chr>, birthDate <date>, birthYear <dbl>,
## # birthPlace <chr>, deathDate <date>, member <dbl>, senator <dbl>,
## # wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>, adb <chr>,
## # comments <chr>
We can also use the desc()
function to arrange in descending order.
## # A tibble: 1,776 x 20
## uniqueID surname allOtherNames firstName commonName displayName
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Zimmerm… Zimmer… Trent Moir Trent <NA> Zimmerman,…
## 2 Zeal1830 Zeal William Aust… William <NA> Zeal, Will…
## 3 Zappia1… Zappia Antonio Antonio Tony Zappia, To…
## 4 Zammit1… Zammit Paul John Paul <NA> Zammit, Pa…
## 5 Zakharo… Zakhar… Alice Olive Alice Olive Zakharov, …
## 6 Zahra19… Zahra Christian Jo… Christian <NA> Zahra, Chr…
## 7 Young19… Young Harold Willi… Harold <NA> Young, Har…
## 8 Young19… Young Michael Jero… Michael Mick Young, Mick
## 9 Young19… Young Terry James Terry <NA> Young, Ter…
## 10 Yates18… Yates George Edwin George Gunner Yates, Gun…
## # … with 1,766 more rows, and 14 more variables: earlierOrLaterNames <chr>,
## # title <chr>, gender <chr>, birthDate <date>, birthYear <dbl>,
## # birthPlace <chr>, deathDate <date>, member <dbl>, senator <dbl>,
## # wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>, adb <chr>,
## # comments <chr>
We can also arrange based on more than one column.
## # A tibble: 1,776 x 20
## uniqueID surname allOtherNames firstName commonName displayName
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Blain18… Blain Adair Macali… Adair <NA> Blain, Ada…
## 2 Armstro… Armstr… Adam Alexand… Adam Bill Armstrong,…
## 3 Bandt19… Bandt Adam Paul Adam <NA> Bandt, Adam
## 4 Dein1889 Dein Adam Kemball Adam Dick Dein, Dick
## 5 Ridgewa… Ridgew… Aden Derek Aden <NA> Ridgeway, …
## 6 Bennett… Bennett Adrian Frank Adrian <NA> Bennett, A…
## 7 Gibson1… Gibson Adrian Adrian <NA> Gibson, Ad…
## 8 Wynne18… Wynne Agar Agar <NA> Wynne, Agar
## 9 Roberts… Robert… Agnes Robert… Agnes <NA> Robertson,…
## 10 Bird1906 Bird Alan Charles Alan <NA> Bird, Alan
## # … with 1,766 more rows, and 14 more variables: earlierOrLaterNames <chr>,
## # title <chr>, gender <chr>, birthDate <date>, birthYear <dbl>,
## # birthPlace <chr>, deathDate <date>, member <dbl>, senator <dbl>,
## # wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>, adb <chr>,
## # comments <chr>
We can pipe arrange()
to another arrange()
.
## # A tibble: 1,776 x 20
## uniqueID surname allOtherNames firstName commonName displayName
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Abbott1… Abbott Anthony John Anthony Tony Abbott, To…
## 2 Abbott1… Abbott Charles Lydi… Charles Aubrey Abbott, Au…
## 3 Abbott1… Abbott Joseph Palmer Joseph <NA> Abbott, Jo…
## 4 Abbott1… Abbott Macartney Macartney Mac Abbott, Mac
## 5 Abbott1… Abbott Percy Phipps Percy <NA> Abbott, Pe…
## 6 Abbott1… Abbott Richard Hart… Richard <NA> Abbott, Ri…
## 7 Abel1939 Abel John Arthur John <NA> Abel, John
## 8 Abetz19… Abetz Eric Eric <NA> Abetz, Eric
## 9 Adams19… Adams Dick Godfrey… Dick <NA> Adams, Dick
## 10 Adams19… Adams Judith Anne Judith <NA> Adams, Jud…
## # … with 1,766 more rows, and 14 more variables: earlierOrLaterNames <chr>,
## # title <chr>, gender <chr>, birthDate <date>, birthYear <dbl>,
## # birthPlace <chr>, deathDate <date>, member <dbl>, senator <dbl>,
## # wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>, adb <chr>,
## # comments <chr>
It is just important to be clear about the precedence of each.
## # A tibble: 1,776 x 20
## uniqueID surname allOtherNames firstName commonName displayName
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Abbott1… Abbott Anthony John Anthony Tony Abbott, To…
## 2 Abbott1… Abbott Charles Lydi… Charles Aubrey Abbott, Au…
## 3 Abbott1… Abbott Joseph Palmer Joseph <NA> Abbott, Jo…
## 4 Abbott1… Abbott Macartney Macartney Mac Abbott, Mac
## 5 Abbott1… Abbott Percy Phipps Percy <NA> Abbott, Pe…
## 6 Abbott1… Abbott Richard Hart… Richard <NA> Abbott, Ri…
## 7 Abel1939 Abel John Arthur John <NA> Abel, John
## 8 Abetz19… Abetz Eric Eric <NA> Abetz, Eric
## 9 Adams19… Adams Dick Godfrey… Dick <NA> Adams, Dick
## 10 Adams19… Adams Judith Anne Judith <NA> Adams, Jud…
## # … with 1,766 more rows, and 14 more variables: earlierOrLaterNames <chr>,
## # title <chr>, gender <chr>, birthDate <date>, birthYear <dbl>,
## # birthPlace <chr>, deathDate <date>, member <dbl>, senator <dbl>,
## # wasPrimeMinister <dbl>, wikidataID <chr>, wikipedia <chr>, adb <chr>,
## # comments <chr>
3.4.5 Grouping
We can group variables using the function group_by()
and then apply some other function within those groups. For instance, we could arrange by first name within gender, and then get the first three results.
## # A tibble: 6 x 20
## # Groups: gender [2]
## uniqueID surname allOtherNames firstName commonName displayName
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Roberts… Robert… Agnes Robert… Agnes <NA> Robertson,…
## 2 MacTier… MacTie… Alannah Joan… Alannah <NA> MacTiernan…
## 3 Zakharo… Zakhar… Alice Olive Alice Olive Zakharov, …
## 4 Blain18… Blain Adair Macali… Adair <NA> Blain, Ada…
## 5 Armstro… Armstr… Adam Alexand… Adam Bill Armstrong,…
## 6 Bandt19… Bandt Adam Paul Adam <NA> Bandt, Adam
## # … with 14 more variables: earlierOrLaterNames <chr>, title <chr>,
## # gender <chr>, birthDate <date>, birthYear <dbl>, birthPlace <chr>,
## # deathDate <date>, member <dbl>, senator <dbl>, wasPrimeMinister <dbl>,
## # wikidataID <chr>, wikipedia <chr>, adb <chr>, comments <chr>
3.4.6 Mutating
The mutate()
function is used to make a new column. For instance, perhaps we want to make a new column that is 1 if a person was a member and a senator and 0 otherwise.
australian_politicians <-
australian_politicians %>%
mutate(was_both = if_else(member == 1 & senator == 1, 1, 0))
australian_politicians %>%
select(member, senator, was_both) %>%
head()
## # A tibble: 6 x 3
## member senator was_both
## <dbl> <dbl> <dbl>
## 1 0 1 0
## 2 1 1 1
## 3 0 1 0
## 4 1 0 0
## 5 1 0 0
## 6 1 0 0
3.4.7 Summarise
The function summarise()
is used to create new summary variables. For instance, looking at the maximum of birth year to find who the most recently born politicians are.
australian_politicians %>%
summarise(youngest_politicians_birth_year = max(birthYear, na.rm = TRUE))
## # A tibble: 1 x 1
## youngest_politicians_birth_year
## <dbl>
## 1 1994
And we can check that using arrange()
.
australian_politicians %>%
arrange(-birthYear) %>%
select(uniqueID, surname, allOtherNames, birthYear) %>%
slice(1:3)
## # A tibble: 3 x 4
## uniqueID surname allOtherNames birthYear
## <chr> <chr> <chr> <dbl>
## 1 SteeleJohn1994 Steele-John Jordon Alexander 1994
## 2 Chandler1990 Chandler Claire 1990
## 3 Roy1990 Roy Wyatt Beau 1990
The summarise()
function is particularly powerful in conjunction with group_by()
. For instance, let’s look at the year of birth of the youngest by gender.
australian_politicians %>%
group_by(gender) %>%
summarise(youngest_politician_birth_year = max(birthYear, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## gender youngest_politician_birth_year
## <chr> <dbl>
## 1 female 1990
## 2 male 1994
Let’s look at mean of age at death by gender.
australian_politicians %>%
mutate(days_lived = deathDate - birthDate) %>%
filter(!is.na(days_lived)) %>%
group_by(gender) %>%
summarise(mean_days_lived = round(mean(days_lived), 2)) %>%
arrange(-mean_days_lived)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## gender mean_days_lived
## <chr> <drtn>
## 1 female 28857.30 days
## 2 male 27372.89 days
We can use group_by()
for more than one group for instance, looking again at average number of days lived by gender and by which house.
australian_politicians %>%
mutate(days_lived = deathDate - birthDate) %>%
filter(!is.na(days_lived)) %>%
group_by(gender, wasPrimeMinister) %>%
summarise(mean_days_lived = round(mean(days_lived), 2)) %>%
arrange(-mean_days_lived)
## `summarise()` regrouping output by 'gender' (override with `.groups` argument)
## # A tibble: 3 x 3
## # Groups: gender [2]
## gender wasPrimeMinister mean_days_lived
## <chr> <dbl> <drtn>
## 1 female NA 28857.30 days
## 2 male 1 28446.61 days
## 3 male NA 27345.20 days
3.4.8 Counting
We can use the function count()
to create counts by groups. For instance, the number of politicians by gender.
## # A tibble: 2 x 2
## # Groups: gender [2]
## gender n
## <chr> <int>
## 1 female 236
## 2 male 1540
3.4.9 Proportions
Finally, often calculating proportions is a combination of summarise()
and mutate()
(and group_by()
).
Let’s calculate the proportion of genders.
Note here, that we needed to ungroup()
the data before mutating.
## # A tibble: 2 x 3
## gender n prop
## <chr> <int> <dbl>
## 1 female 236 0.133
## 2 male 1540 0.867
3.5 Base
3.5.1 Class
A class is the broader type of object that something is. For instance, your class is probably ‘human’, which is itself a ‘animal’. Similarly, if we create a number in R we can use class()
to work out its class, which in this case will be numeric.
## [1] "numeric"
Or we could make it a character.
## [1] "character"
Finally, we can often coerce classes to be something else.
## [1] "character"
There are many ways for your code to not run, but having an issue with the classes is the almost always the first thing to check.
3.5.2 Simulating data
Simulating data is a key skill for statistics. We will use the following functions all the time: rnorm()
, sample()
, and runif()
. Arguably the most important function is set.seed()
, which we need because while we want our data to be random, we want it to be repeatable.
Let’s get 10 observations from the standard normal.
set.seed(853)
number_of_observations <- 10
simulated_data <- tibble(person = c(1:number_of_observations),
observation = rnorm(number_of_observations,
mean = 0,
sd = 1)
)
Then let’s add 10 draws from the uniform distribution between 0 and 10.
Finally, let’s use sample, which allows use to pick from a list of items, to add a favourite colour to each observation.
simulated_data$fav_colour <- sample(x = c("blue", " white "),
size = number_of_observations,
replace = TRUE)
We set the option replace
to TRUE
because we are only choosing between two items, but we want ten outcomes. Depending on the simulation you should think about whether you need it TRUE
or FALSE
. Also, there is another useful option to adjust the probability with which each item is drawn. In particular, the default is that both options are equally likely, but perhaps we might like to have 10 per cent blue
with 90 per cent white
. The way to do this is to set the option prob
. As always with functions, you can find more in the help with ?sample
.
3.5.3 Functions
There are a lot of functions in R, and almost any common task that you might need to do is likely already done. But you will need to write your own functions. The way to do this is to define a function and give it a name. Your function will probably have some inputs (note that these inputs can have default values). Your function will then do something with these inputs and then return something.
## [1] "rohan" "monica"
3.6 ggplot essentials
The ggplot
package is the plotting package that is part of the tidyverse
collection of packages.
In a similar way to piping, it works in layers. But instead of using the pipe (%>%
) ggplot uses +
.
3.6.1 Main features
There are three key aspects:
- data;
- aesthetics / mapping; and
- type.
For instances, let’s build up a histogram of age of death with increasing complexity.
Starts with a grey box:
australian_politicians %>%
mutate(days_lived = as.integer(deathDate - birthDate)) %>%
filter(!is.na(days_lived)) %>%
ggplot(mapping = aes(x = days_lived))
We need to tell it what we want to plot. This is where geom
comes in
australian_politicians %>%
mutate(days_lived = as.integer(deathDate - birthDate)) %>%
filter(!is.na(days_lived)) %>%
ggplot(mapping = aes(x = days_lived)) +
geom_histogram(binwidth = 365)
Now let’s color the bars by gender, which means adding an aesthetic.
australian_politicians %>%
mutate(days_lived = as.integer(deathDate - birthDate)) %>%
filter(!is.na(days_lived)) %>%
ggplot(mapping = aes(x = days_lived, fill = gender)) +
geom_histogram(binwidth = 365)
We can add some labels, change the color, and background.
australian_politicians %>%
mutate(days_lived = as.integer(deathDate - birthDate)) %>%
filter(!is.na(days_lived)) %>%
ggplot(mapping = aes(x = days_lived, fill = gender)) +
geom_histogram(binwidth = 365) +
labs(title = "Length of life of Australian politicians",
x = "Age of deaths (days)",
y = "Number") +
theme_classic() +
scale_fill_brewer(palette = "Set1")
I forget who said this but, ‘ggplot
makes it so easy to have nicely labelled axes, there’s no real excuse not to’.
3.6.2 Facets
Facets are subplots and are invaluable because they allow you to add another variable to your plot without having to make a 3D plot.
australian_politicians %>%
mutate(days_lived = as.integer(deathDate - birthDate)) %>%
filter(!is.na(days_lived)) %>%
ggplot(mapping = aes(x = days_lived)) +
geom_histogram(binwidth = 365) +
labs(title = "Length of life of Australian politicians",
x = "Age of deaths (days)",
y = "Number") +
theme_classic() +
scale_fill_brewer(palette = "Set1") +
facet_wrap(~gender)
3.7 Tidyverse II
3.7.1 Tibbles
A tibble is a data frame, but it is a data frame with some particular changes that make it easier to work with. You should read Chapter 10 of Wickham and Grolemund (2017) for more detail. The main difference is that compared with a dataframe, a tibble doesn’t convert strings to factors, and it prints nicely, including letting you know the class of a column.
You can make a tibble manually if you need, for instance this can be handy for simulating data, but usually we will just import data as a tibble.
people <-
tibble(names = c("rohan", "monica"),
website = c("rohanalexander.com", "monicaalexander.com"),
fav_colour = c("blue", " white "),
)
people
## # A tibble: 2 x 3
## names website fav_colour
## <chr> <chr> <chr>
## 1 rohan rohanalexander.com "blue"
## 2 monica monicaalexander.com " white "
3.7.2 Importing data
There are a variety of ways to import data. If you are dealing with CSV files then try read_csv()
in the first instance. There were examples of that in earlier sections.
3.7.3 Joining data
We can join two datasets together in a variety of ways. The most common join that I use is left_join()
, where I have one main dataset and I want to join another to it based on some common column names. Here we’ll join two datasets based on favourite colour.
## # A tibble: 10 x 6
## person observation another_observation fav_colour names website
## <int> <dbl> <dbl> <chr> <chr> <chr>
## 1 1 -0.360 9.52 "blue" rohan rohanalexander.com
## 2 2 -0.0406 0.586 " white " monica monicaalexander.com
## 3 3 -1.78 2.48 "blue" rohan rohanalexander.com
## 4 4 -1.12 5.80 " white " monica monicaalexander.com
## 5 5 -1.00 5.26 "blue" rohan rohanalexander.com
## 6 6 1.78 4.09 "blue" rohan rohanalexander.com
## 7 7 -1.39 3.97 "blue" rohan rohanalexander.com
## 8 8 -0.497 2.52 " white " monica monicaalexander.com
## 9 9 -0.558 6.29 "blue" rohan rohanalexander.com
## 10 10 -0.824 8.57 "blue" rohan rohanalexander.com
3.7.4 Strings
We’ve seen a string earlier, but it is an object that is created with single or double quotes. String manipulation is an entire book in itself, but you should start with the stringr package (Wickham 2019c).
I’ll just cover a few essentials: stringr::str_detect()
, stringr::str_replace()
, stringr::str_squish()
.
## # A tibble: 2 x 3
## names website fav_colour
## <chr> <chr> <chr>
## 1 rohan rohanalexander.com "blue"
## 2 monica monicaalexander.com " white "
people <-
people %>%
mutate(is_rohan = stringr::str_detect(names, "rohan"),
make_howlett = stringr::str_replace(website, "alexander", "howlett"),
fav_colour_trim = stringr::str_squish(fav_colour)
)
head(people)
## # A tibble: 2 x 6
## names website fav_colour is_rohan make_howlett fav_colour_trim
## <chr> <chr> <chr> <lgl> <chr> <chr>
## 1 rohan rohanalexander.com "blue" TRUE rohanhowlett.com blue
## 2 monica monicaalexander.c… " white " FALSE monicahowlett.c… white
3.7.5 Pivot
Datasets tend to be either long or wide. Generally, in the tidyverse, and certainly for ggplot, we need long data. To go from one to the other you can use the pivot_longer()
and pivot_wider()
functions.
Let’s see an example with some data on whether red team or blue team won a competition in some year.
pivot_example_data <-
tibble(year = c(2019, 2020, 2021),
blue_team = c(1, 2, 1),
red_team = c(2, 1, 2))
head(pivot_example_data)
## # A tibble: 3 x 3
## year blue_team red_team
## <dbl> <dbl> <dbl>
## 1 2019 1 2
## 2 2020 2 1
## 3 2021 1 2
This dataset is in wide format at the moment. To get it into long format, what we’d like is to have a column that specifies the team, and then another that specifies the result. We’ll use tidyr::pivot_longer
.
data_pivoted_longer <-
pivot_example_data %>%
tidyr::pivot_longer(cols = c("blue_team", "red_team"),
names_to = "team",
values_to = "position")
head(data_pivoted_longer)
## # A tibble: 6 x 3
## year team position
## <dbl> <chr> <dbl>
## 1 2019 blue_team 1
## 2 2019 red_team 2
## 3 2020 blue_team 2
## 4 2020 red_team 1
## 5 2021 blue_team 1
## 6 2021 red_team 2
Occasionally, you’ll need to go from long data to wide data. We accomplish this with tidyr::pivot_wider
.
data_pivoted_wider <-
data_pivoted_longer %>%
tidyr::pivot_wider(id_cols = c("year", "team"),
names_from = "team",
values_from = "position")
head(data_pivoted_wider)
## # A tibble: 3 x 3
## year blue_team red_team
## <dbl> <dbl> <dbl>
## 1 2019 1 2
## 2 2020 2 1
## 3 2021 1 2
3.7.6 Factors
A factor is a string that has an inherent ordering. For instance, the days of the week have an order - Monday, Tuesday, Wednesday,… - which is not alphabetical. Factors feature prominently in base, but they often add more complication than they are worth and so the tidyverse gives them a less prominent role. Nonetheless taking advantage of factors is useful in certain circumstances, for instance when plotting the days of the week we probably want them in the usual ordering than in the alphabetical ordering that would result if we had them as characters. The package that we use to deal with factors is forcats
(Wickham 2020a).
Sometimes you will have a character vector and you will want it ordered in a particular way. The default is that a character vector is ordered alphabetically, but you may not want that, for instance, the days of the week would look strange on a graph if they were alphabetically ordered: Friday, Monday, Saturday, Sunday, Thursday, Tuesday, Wednesday!
The way to change the ordering is to change the variable from a character to a factor. I would then use the forcats
package to specify an ordering by hand. The help page is here: https://forcats.tidyverse.org/reference/fct_relevel.html.
Let’s look at a concrete example.
If we plotted this then Edward would be first, because it would be alphabetical. But if instead I want to be first as I am the oldest then we could use forcats
in the following way.
library(forcats) # (BTW you'll probably have to install that one)
library(tidyverse)
my_data <-
my_data %>%
mutate(all_names = factor(all_names), # Change to factor
all_names_releveled = fct_relevel(all_names, "Rohan", "Monica")) # Change the levels
# Then compare the two
my_data$all_names
## [1] Rohan Monica Edward
## Levels: Edward Monica Rohan
## [1] Rohan Monica Edward
## Levels: Rohan Monica Edward
3.7.7 Cases
If you need to write a few conditional statements then case_when
is the way to go.
Let’s start with a tibble of dates and pretend that we want to group them into ‘pre-1950’, ‘1950-2000’, ‘2000-onwards’
case_when_example <-
tibble(some_dates = c("1909-12-31", "1919-12-31", "1929-12-31", "1939-12-31",
"1949-12-31", "1959-12-31", "1969-12-31", "1979-12-31",
"1989-12-31", "1999-12-31", "2009-12-31")
)
case_when_example <-
case_when_example %>%
mutate(some_dates = lubridate::ymd(some_dates)
)
head(case_when_example)
## # A tibble: 6 x 1
## some_dates
## <date>
## 1 1909-12-31
## 2 1919-12-31
## 3 1929-12-31
## 4 1939-12-31
## 5 1949-12-31
## 6 1959-12-31
Now we’ll use dplyr::case_when()
to group these.
case_when_example <-
case_when_example %>%
mutate(year_group =
case_when(
some_dates < lubridate::ymd("1950-01-01") ~ "pre-1950",
some_dates < lubridate::ymd("2000-01-01") ~ "1950-2000",
some_dates >= lubridate::ymd("2000-01-01") ~ "2000-onwards",
TRUE ~ "CHECK ME"
)
)
head(case_when_example)
## # A tibble: 6 x 2
## some_dates year_group
## <date> <chr>
## 1 1909-12-31 pre-1950
## 2 1919-12-31 pre-1950
## 3 1929-12-31 pre-1950
## 4 1939-12-31 pre-1950
## 5 1949-12-31 pre-1950
## 6 1959-12-31 1950-2000
We could accomplish this with a series of if_else
statements, but case_when
is just a bit cleaner. The only thing to be aware of is that statements are evaluated in order. So as soon as something matches it doesn’t continue down the list of conditions. That’s why we have that catch-all at the end - if the date doesn’t fit any of the earlier conditions then we’ve got a problem and want to know about it.
References
Keyes, Os. 2019. “Counting the Countless.” Real Life. https://reallifemag.com/counting-the-countless/.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
Wickham, Hadley. 2019c. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.
Wickham, Hadley. 2020a. Forcats: Tools for Working with Categorical Variables (Factors). https://CRAN.R-project.org/package=forcats.
Wickham, Hadley. 2020b. Tidyverse. https://www.tidyverse.org/.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019a. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2020. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science. https://r4ds.had.co.nz/.
3.2 Social impact
Although the term ‘data science’ is ubiquitous in academia, industry, and even more generally, it is difficult to define. One deliberately antagonistic definition of data science is ‘[t]he inhumane reduction of humanity down to what can be counted’ (Keyes 2019). While purposefully controversial, this definition highlights one reason for the increased demand for data science and quantitative methods over the past decade—individuals and their behaviour are now at the heart of it. Many of the techniques have been around for many decades, but what makes them popular now is this human focus.
Unfortunately, even though much of the work may be focused on individuals, issues of privacy and consent, and ethical concerns more broadly, rarely seem front of mind. While there are some exceptions, in general, even at the same time as claiming that AI, machine learning, and data science are going to revolutionise society, consideration of these types of issues appears to have been largely treated as something that would be nice to have, rather than something that we may like to think of before we embrace the revolution.
For the most part, these are not new issues. In the sciences, there has been considerable recent ethical consideration around CRISPR technology and gene editing, but in an earlier time similar conversations were had, for instance, about Wernher von Braun being allowed to building rockets for the US. In medicine, of course, these concerns have been front-of-mind for some time. Data science seems determined to have its own Tuskegee syphilis experiment moment rather than think about and deal appropriately with these issues, based on the experiences of other fields, before they occur.
That said, there is some evidence that data scientists are beginning to be more concerned about the ethics surrounding the practice. For instance, NeurIPS, the most prestigious machine learning conference, now requires a statement on ethics to accompany all submissions.
Ethical considerations will be mentioned throughout these notes rather than clumped in one easily ignorable part that can be thrown away after ‘ethics week’. The purpose is not to prescriptively rule things in or out, but to provide an opportunity to raise some issues that should be front of mind. The variety of data science applications, the relative youth of the field, and the speed of change, mean that ethical considerations can sometimes be set aside when it comes to data science. This is in contrast to fields such as science, medicine, engineering, and accounting where there is a long history. Nonetheless it can helpful to think through some ethical considerations that you may encounter in the content of a usual data science project.
Figure 3.1: Probability, from https://xkcd.com/881/.