R Desirables

Table of Contents


Required reading

Required viewing

Key concepts/skills/etc

Key libraries/functions/etc

Introduction

This chapter builds on what was covered in the ‘R Essentials’ chapter by covering some of the more specific functions and circumstances that we’ve seen or will seen. Similarly to that chapter, you should have a quick look at this chapter, note anything that doesn’t make sense, but not get too worried about. After going on for a few more case studies you should come back to this chapter and try to fill in your knowledge.

The past ten years or so of R have been characterised by the rise of the tidyverse. This is ‘… an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.’ (???). There are three distinctions here: the original R language, typically referred to as ‘base’; the ‘tidyverse’ which is a collection of packages that build on top of the R language; and other packages.

Pretty much everything that you can do in the tidyverse, you can also do in base. However, as the tidyverse was built especially for modern data science it is usually easier to use the tidyverse, especially when you are setting out. Additionally, pretty much everything that you can do in the tidyverse, you can also do with other packages. However, as the tidyverse is a coherent collection of packages, it is often easier to use the tidyverse, especially when you are setting out. Eventually you will start to see cases where it makes sense to trade-off the convenience and coherence of the tidyverse for some features of base or other packages. Indeed you’ll see that at various points in these notes. For instance, the tidyverse can be slow, and so if you need to import thousands of CSVs then it can make sense to switch away from read_csv(). That is great and the appropriate use of base and non-tidyverse packages, rather than dogmatic insistence on a solution, is a sign of your development as a data scientist.


library(tidyverse)

Base desirables

Class

A class is the broader type of object that something is. For instance, your class is probably ‘human’, which is itself a ‘animal’. Similarly, if we create a number in R we can use class() to work out its class, which in this case will be numeric.


my_number <- 8
class(my_number)

[1] "numeric"

Or we could make it a character.


my_name <- "rohan"
class(my_name)

[1] "character"

Finally, we can often coerce classes to be something else.


my_number_as_character <- as.character(my_number)
class(my_number_as_character)

[1] "character"

There are many ways for your code to not run, but having an issue with the classes is the almost always the first thing to check.

Simulating data

Simulating data is a key skill for statistics. We will use the following functions all the time: rnorm(), sample(), and runi(). Arguably the most important function is set.seed(), which we need because while we want our data to be random, we want it to be repeatable.

Let’s get 10 observations from the standard normal.


set.seed(853)

number_of_observations <- 10

simulated_data <- tibble(person = c(1:number_of_observations),
                         observation = rnorm(number_of_observations, 
                                             mean = 0, 
                                             sd = 1)
                         )

Then let’s add 10 draws from the uniform distribution between 0 and 10.


simulated_data$another_observation <- runif(number_of_observations, 
                                            min = 0, 
                                            max = 10)

Finally, let’s use sample, which allows use to pick from a list of items, to add a favourite colour to each observation.


simulated_data$favourite_colour <- sample(x = c("blue", "white"), 
                                          size = number_of_observations,
                                          replace = TRUE)

We set the option replace to TRUE because we are only choosing between two items, but we want ten outcomes. Depending on the simulation you should think about whether you need it TRUE or FALSE. Also, there is another useful option to adjust the probability with which each item is drawn. In particular, the default is that both options are equally likely, but perhaps we might like to have 10 per cent blue with 90 per cent white. The way to do this is to set the option prob. As always with functions, you can find more in the help with ?sample.

Functions

There are a lot of functions in R, and almost any common task that you might need to do is likely already done. But you will need to write your own functions. The way to do this is to define a function and give it a name. Your function will probably have some inputs (note that these inputs can have default values). Your function will then do something with these inputs and then return something.


my_function <- function(some_names) {
  print(some_names)
}

my_function(c("rohan", "monica"))

[1] "rohan"  "monica"

Tidyverse desirables

Tibbles

A tibble is a data frame, but it is a data frame with some particular changes that make it easier to work with. You should Chapter 10 of Wickham and Grolemund (2017) for more detail. The main difference is that compared with a dataframe, a tibble doesn’t convert strings to factors, and it prints nicely, including letting you know the class of a column.

You can make a tibble manually if you need, for instance this can be handy for simulating data, but usually we will just import data as a tibble.


people <- 
  tibble(names = c("rohan", "monica"),
         website = c("rohanalexander.com", "monicaalexander.com"),
         favourite_colour = c("blue", "white")
         )
people

# A tibble: 2 x 3
  names  website             favourite_colour
  <chr>  <chr>               <chr>           
1 rohan  rohanalexander.com  blue            
2 monica monicaalexander.com white           

Importing data

There are a variety of ways to import data. If you are dealing with CSV files then try read_csv() in the first instance. There were examples of that in earlier sections.

Joining data

We can join two datasets together in a variety of ways. The most common join that I use is left_join(), where I have one main dataset and I want to join another to it based on some common column names. Here we’ll join two datasets based on favourite colour.


both <- 
  simulated_data %>% 
  left_join(people, by = "favourite_colour")

both

# A tibble: 10 x 6
   person observation another_observat… favourite_colour names website
    <int>       <dbl>             <dbl> <chr>            <chr> <chr>  
 1      1     -0.360              9.52  blue             rohan rohana…
 2      2     -0.0406             0.586 white            moni… monica…
 3      3     -1.78               2.48  blue             rohan rohana…
 4      4     -1.12               5.80  white            moni… monica…
 5      5     -1.00               5.26  blue             rohan rohana…
 6      6      1.78               4.09  blue             rohan rohana…
 7      7     -1.39               3.97  blue             rohan rohana…
 8      8     -0.497              2.52  white            moni… monica…
 9      9     -0.558              6.29  blue             rohan rohana…
10     10     -0.824              8.57  blue             rohan rohana…

Strings

We’ve seen a string earlier, but it is an object that is created with single or double quotes. String manipulation is an entire book in itself, but should start with the stringr package Wickham (2019).

Pivot

Datasets tend to be either long or wide. Generally, in the tidyverse, and certainly for ggplot, we need long data. To go from one to the other you can use the pivot_longer() and pivot_wider() functions.

Factors

A factor is a string that has an inherent ordering. For instance, the days of the week have an order - Monday, Tuesday, Wednesday,… - which is not alphabetical. Factors feature prominently in base, but they often add more complication than they are worth and so the tidyverse gives them a less prominent role. Nonetheless taking advantage of factors is useful in certain circumstances, for instance when plotting the days of the week we probably want them in the usual ordering than in the alphabetical ordering that would result if we had them as characters. The package that we use to deal with factors is forcats Wickham (2020).

Ordering factors

Sometimes you will have a character vector and you will want it ordered in a particular way. The default is that a character vector is ordered alphabetically, but you may not want that, for instance, the days of the week would look strange on a graph if they were alphabetically ordered: Friday, Monday, Saturday, Sunday, Thursday, Tuesday, Wednesday!

The way to change the ordering is to change the variable from a character to a factor. I would then use the forcats package to specify an ordering by hand. The help page is here: https://forcats.tidyverse.org/reference/fct_relevel.html.

Let’s look at a concrete example.


my_data <- tibble(all_names = c("Rohan", "Monica", "Edward"))

If we plotted this then Edward would be first, because it would be alphabetical. But if instead I want to be first as I am the oldest then we could use forcats in the following way.


library(forcats) # (BTW you'll probably have to install that one)
library(tidyverse)

my_data <-
  my_data %>%
  mutate(all_names = factor(all_names), # Change to factor
         all_names_releveled = fct_relevel(all_names, "Rohan", "Monica")) # Change the levels

# Then compare the two
my_data$all_names

[1] Rohan  Monica Edward
Levels: Edward Monica Rohan

my_data$all_names_releveled

[1] Rohan  Monica Edward
Levels: Rohan Monica Edward

Other desirables

case_when

If you need to write a few conditional statements then case_when is the way to go.

Wickham, Hadley. 2019. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.

———. 2020. Forcats: Tools for Working with Categorical Variables (Factors). https://CRAN.R-project.org/package=forcats.

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science. https://r4ds.had.co.nz/.