Chapter 4 Workflow

Last updated: 2 March 2021.

Required reading

Recommended reading

Required viewing

Recommended viewing

Recommended activity

Fun song

Key concepts/skills/etc

  • Restart R often (Session -> Restart R and Clear Output).
  • Debugging is a skill, and you will get better at it with time and practice.
  • Start with reading the error message.
  • Check the class.
  • You may get frustrated at times, this is normal.
  • There are various tools that can help. Google is your friend.
  • Make a small example and try to get the code running on that.
  • Cultivating a tenacious mentality may help.
  • Writing code that future-you can understand.
  • Developing important questions.
  • Reproducibility and replicability
  • The importance of data and code access
  • The importance of version control in a modern scientific workflow.
  • The basics of Git and GitHub, as a solo data scientist.

Key GitHub workflow with commands

  • Get the latest changes: git pull.
  • Add your updates: git add -A.
  • Check on everything: git status.
  • Commit your changes: git commit -m "Short description of changes".
  • Push your changes to GitHub: git push.

Quiz

  1. What are three features of a good research question (write a paragraph or two)?
  2. What is a counterfactual (pick one)?
    1. If-then statements in which the if didn’t happen.
    2. If-then statements in which the if happen.
    3. Statements that are either true or false.
    4. Statements that are neither true or false.
  3. How do you hide the warnings in a R Markdown R chunk (pick one)?
    1. echo = FALSE
    2. include = FALSE
    3. eval = FALSE
    4. warning = FALSE
    5. message = FALSE
  4. What is a reprex and why is it important to be able to make one (select all)?
    1. A reproducible example that enables your error to be reproduced.
    2. A reproducible example that helps others help you.
    3. A reproducible example during the construction of which you may solve your own problem.
    4. A reproducible example that demonstrates you’ve actually tried to help yourself.
  5. Why are R Projects important (select all)?
    1. They help with reproducibility.
    2. They make it easier to share code.
    3. They make your workspace more organized.
    4. They are all that needs to be done.
  6. Consider this sequence: ‘git pull, git status, ________, git status, git commit -m "My message", git push.’ What is the missing step (pick one)?
    1. git add -A.
    2. git status.
    3. git pull.
    4. git push.

4.1 Introduction

Suppose you have cancer and you have to choose between a black box AI surgeon that cannot explain how it works but has a 90% cure rate and a human surgeon with an 80% cure rate. Do you want the AI surgeon to be illegal?

Geoffrey Hinton, 20 February 2020.

The number one thing to keep in mind about machine learning is that performance is evaluated on samples from one dataset, but the model is used in production on samples that may not necessarily follow the same characteristics… The finance industry has a saying for this: “past performance is no guarantee of future results.” Your model scoring X on your test dataset doesn’t mean it will perform at level X on the next N situations it encounters in the real world. The future may not be like the past.

So when asking the question, “would you rather use a model that was evaluated as 90% accurate, or a human that was evaluated as 80% accurate,” the answer depends on whether your data is typical per the evaluation process. Humans are adaptable, models are not. If significant uncertainty is involved, go with the human. They may have inferior pattern recognition capabilities (versus models trained on enormous amounts of data), but they understand what they do, they can reason about it, and they can improvise when faced with novelty

If every possible situation is known and you want to prioritize scalability and cost-reduction, go with the model. Models exist to encode and operationalize human cognition in well-understood situations. (“well understood” meaning either that it can be explicitly described by a programmer, or that you can amass a dataset that densely samples the distribution of possible situations – which must be static)

François Chollet, 20 February 2020.

If science is about systematically building and organising knowledge in terms of testable explanations and predictions, then data science takes this and focuses on data. Fundamentally data science is still science, and as such, building and organising knowledge is a critical aspect. Being able to do something yourself, once, does not achieve this. Hence, the focus on reproducibility and replicability.

M. Alexander (2019) says ‘Research is reproducible if it can be reproduced exactly, given all the materials used in the study.’ ‘[Hence] materials need to be provided!’ ’‘[M]aterials’ usually means data, code and software.’ The minimum requirement is to be able to ‘reproduce the data, methods and results (including figures, tables).’

Similarly, from Gelman (2016) you should have noticed that this has been an issue in other sciences. e.g. psychology. The issue with not being reproducible is that we are not contributing to knowledge. We no longer have any idea was is fact in the field of psychology. (This is coming for other fields too, including the field that I was trained in - economics.) Do this matter? Yes.

Some of the examples that Gelman (2016) cites (which turned out to be dodgy) don’t really matter e.g. ESP or the power pose. It doesn’t really matter. But increasingly the same methods are being applied in areas where they do matter e.g. ‘nudge’ units. Similarly, D. Simpson (2017) makes it clear that it’s a big problem in data science.

The ‘gay face’ paper that D. Simpson (2017) writes about has not released their dataset. We have no way of knowing what is going on with it. They have found a certain set of results based on that dataset, their methods, and what they did, but we have no way of knowing how much that matters. As D. Simpson (2017) says ‘the paper itself does some things right. It has a detailed discussion of the limitations of the data and the method.’ You must do this in everything that you write, but it is not enough.

Without the data, we don’t know what their results speak to as we don’t understand how representative the sample is. If the dataset is biased, then that undermines their claims. There’s a reason that while initial medical trials are done on mice, etc, eventually human trials are required.

In order to do the study they needed a trained dataset. They trained it using Mechanical Turk. Figure 4.1 is from Mattson (2017).

Instructions for workers to do classification piecework on the Amazon Mechanical Turk platform, p. 46. From Mattson.

Figure 4.1: Instructions for workers to do classification piecework on the Amazon Mechanical Turk platform, p. 46. From Mattson.

Mattson (2017) comments:

The problems here are legion: Barack Obama is biracial but simply “Black” by American cultural norms. “Clearly Latino” begs the question “to whom?” Latino is an ethnic category, not a racial one: many Latinos already are Caucasian, and increasingly so. By training their workers according to stereotypical American categories, WAK’s algorithm can only spit out the garbage they put in.

‘WAK’s algorithm can only spit out the garbage they put in.’ I would encourage you to print out that statement and paste it somewhere that you will see it every time you get a data science result.

What steps can we take to make our work reproducible?

  1. Ensure your entire workflow is documented. How did you get the raw data? Can you save the raw data? Will the raw data always be available? Is the raw data available to others? What steps are you taking to transform the raw data in data that can be analysed? How are you analysing the data? How are you building the report?
  2. Try to improve each time. Can you run your entire workflow again? Can ‘another person’ run your entire workflow again? Can a future you run your entire workflow again? Can a future ‘another person’ run your entire workflow again? Each of these requirements is increasingly more onerous. We are going to start with worrying about the first. The way we are going to do this is by using R Markdown.

4.2 R Markdown

4.2.1 Getting started

R Markdown is a mark-up language similar to html or LaTeX, in comparison to a WYSIWYG language, such as Word. This means that all of the aspects are consistent, for instance, all ‘main headings’ will look the same. However it means that use symbols to designate how you would like certain aspects to appear, and it is only when you compile it that you get to see it.

R Markdown is a variant of regular markdown that is specifically designed to allow R code chunks to be included. The advantage is that you can get a ‘live’ document in which code executes and is then printed to a document. The disadvantage is that it can take a while for the document to compile because all of the code needs to run.

You can create a new R Markdown document within R Studio (File -> New File -> R Markdown Document). Another advantage of R Markdown is that very similar code can compile into a variety of documents, including html pages and PDFs. R Markdown also has default options set up for including a title, author, and date sections.

4.2.2 Basic commands

If you ever need a reminder of the basics of R Markdown then this is built into R Studio (Help -> Markdown Quick Reference). This provides the code for commonly needed commands:

  • Emphasis: *italic*, **bold**, _italic_, __bold__
  • Headers (these need to go on their own line with a line before and after): # Header 1, ## Header 2, ### Header 3
  • Lists:
Unordered List
* Item 1
* Item 2
    + Item 2a
    + Item 2b
Ordered List
1. Item 1
2. Item 2
3. Item 3
    + Item 3a
    + Item 3b
  • URLs: Can just include an address: http://example.com, or can include a [linked phrase](http://example.com).
  • Basic images can just be included either from the internet: ![alt text](http://example.com/logo.png) or from a local file: ![alt text](figures/img.png).

In order to create an actual document, once you have these pieces set up, click ‘Knit.’

4.2.3 R chunks

You can include R (and a bunch of other languages) code in code chunks within your R Markdown document. Then when you knit your document, the R code will run and be included in your document.

To create an R chunk start with three backticks and then within curly braces tell markdown that this is an R chunk. Anything inside this chunk will be considered R code and run as such.

library(tidyverse)
ggplot(data = diamonds) + 
  geom_point(aes(x = price, y = carat))

There are various evaluation options that are available in chunks. You include these by putting a comma after r and then specifying any options before the closing curly brace. Helpful options include:

  • echo = FALSE: run the code and include the output, but don’t print the code in the document.
  • include = FALSE: run the code but don’t output anything and don’t print the code in the document.
  • eval = FALSE: don’t run the code, and hence don’t include the outputs, but do print the code in the document.
  • warning = FALSE: don’t display warnings.
  • message = FALSE: don’t display messages.

4.2.4 Abstracts and PDF outputs

In the default header, you can add a section for a header, so that it would look like this:

---
title: My document
author: Rohan Alexander
date: 5 January 2020
output: html_document
abstract: "This is my abstract."
---

Similarly, you can change the output from html_document to pdf_document in order to produce a PDF. This uses LaTeX in the background so you may need to install a bunch of related packages.

4.2.5 References

You can reference a bibliography by including one in the preamble and then calling it in the text when you need.

---
title: My document
author: Rohan Alexander
date: 5 January 2020
output: html_document
abstract: "This is my abstract."
bibliography: bibliography.bib
---

You need to make a separate file called bibliography.bib. In that you need an entry for the item that you want to reference. R and R packages usually provide this for you for instance, if you run citation() then it tells you the entry to put in your bibtex file:

@Manual{,
    title = {R: A Language and Environment for Statistical Computing},
    author = {{R Core Team}},
    organization = {R Foundation for Statistical Computing},
    address = {Vienna, Austria},
    year = {2020},
    url = {https://www.R-project.org/},
  }

You need to create a unique key that you’ll refer to it with in the text. This can be anything if it’s unique, but I try to use meaningful ones, so that bibtex entry could become:

@Manual{citeR,
    title = {R: A Language and Environment for Statistical Computing},
    author = {{R Core Team}},
    organization = {R Foundation for Statistical Computing},
    address = {Vienna, Austria},
    year = {2020},
    url = {https://www.R-project.org/},
  }

And to cite R you’d then include the following: @citeR, which would put the brackets around the year, like this: R Core Team (2020) or [@citeR], which would put the brackets around the whole thing, like this: (R Core Team 2020).

4.2.6 Cross-references

Finally, it can be useful to cross-reference figures, tables and equations. This makes it easier to refer to them in the text. To do this for a figure you refer to the name of the R chunk that creates/contains the figure. For instance, (Figure \@ref(fig:my_unique_name)) will produce: (Figure @ref(fig:my_unique_name)) as the name of the R chunk is my_unique_name (don’t forget to add the fig in front of the chunk name. Also, super annoyingly you need to have a ‘fig.cap’ in the R chunk, so it looks something like this:

```{r my_unique_name, fig.cap="More bills of penguins", echo = TRUE}
library(palmerpenguins)
ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar(alpha = 0.8) +
  scale_fill_manual(values = c("darkorange","purple","cyan4"),
                    guide = FALSE) +
  theme_minimal() +
  facet_wrap(~species, ncol = 1) +
  coord_flip()
More bills of penguins

(#fig:my_unique_name)More bills of penguins

You can do similar for tables and equations e.g (Table \@ref(tab:penguinhead)) will produce: (Table 4.1) (again, don’t forget to ad tab in front).

penguins %>% 
  select(species, bill_length_mm, bill_depth_mm) %>% 
  slice(1:5) %>% 
  knitr::kable(caption = "A penguin table")
Table 4.1: A penguin table
species bill_length_mm bill_depth_mm
Adelie 39.1 18.7
Adelie 39.5 17.4
Adelie 40.3 18.0
Adelie NA NA
Adelie 36.7 19.3

And finally, you can, and should, cross-reference equations also, but this time you need to add a tag (\#eq:slope) and then reference that e.g. use Equation \@ref(eq:slope). to produce Equation (4.1).

\begin{equation}
Y = a + b X (\#eq:slope)
\end{equation}

\[\begin{equation} Y = a + b X \tag{4.1} \end{equation}\]

When you are using cross-references, it’s important that your R chunks have simple labels. For instance, no underbars. In general, try to keep the names simple, and if possible avoid puncuation and just stick to letters.

4.3 R projects

RStudio has the option of creating a project, which allows you to keep all the files (data, analysis, report etc) associated with a particular project together. To create a project, click Click File > New Project, then select empty project, name your project and think about where you want to save it. For example, if you are creating a project for Problem Set 2, you might call it ps2 and save it in a sub-folder called PS2 in your INF2178 folder.

Once you have created a project, a new file with the extension .RProj will appear in that file. As an example, download the R help folder. Whenever I work on class materials, I open the project file and work from that.

The main advantage of projects is that you don’t have to set the working directory or type the whole file path to read in a file (for example, a data file). So instead of reading a csv from "~/Documents/toronto/teaching/INF2178/data/" you can just read it in from data/.

To meet even the minimal expected level of reproducibility, you must use R projects. You must not use setwd() because that ties your work to your computer.

4.4 Git and GitHub

4.4.1 Introduction

Here we introduce Git and GitHub. These are tools that:

  1. enhance the reproducibility of your work by making it easier to share your code and data;
  2. make it easier to show off your work;
  3. improve your workflow by encouraging you to think systematically about your approach; and
  4. (although we won’t take advantage of this) make it easier to work in teams.

Git is a version control system. One way you might be used to doing version control is to have various versions of your files: first_go.R, first_go-fixed.R, first_go-fixed-with-mons-edits.R. But this soon becomes cumbersome. Some of you may use dates, for instance: 2020-03-12-analysis.R, 2020-03-13-analysis.R, 2020-03-14-analysis.R, etc. While this keeps a record it can be difficult to search if you need to go back - will you really remember the date of some change in a week? How about a month or a year? It gets unwieldy fairly quickly.

Instead of this, Git allows you to only have one version of the file analysis.R and it keeps a record of the changes to that file, and a snapshot of that file at a given point in time. When it takes that snapshot is determined by you. When you want Git to take a snapshot you additionally include a message, saying what changed between this snapshot and the last. In that way, there is only ever one version of the file, but the history can be more easily searched.

The issue is that Git was designed for software developers. As such, while it works, it can be a little ungainly for non-developers (Figure 4.2).

An infamous response to the launch of Dropbox in 2007, trivialising the use-case for Dropbox, and while this actually would work, it wouldn't for most of us.

Figure 4.2: An infamous response to the launch of Dropbox in 2007, trivialising the use-case for Dropbox, and while this actually would work, it wouldn’t for most of us.

Hence, GitHub, GitLab, and various other companies offer easier-to-use services that build on Git. GitHub used to be the weapon of choice, but they were sold to Microsoft in 2018 and since then other variants such as GitLab have risen in popularity. We will introduce GitHub here because it remains the most popular, and it is built into RStudio, however you should feel free to explore other options.

One of the hardest aspects of Git, and the rest, for me was the terminology. Folders are called ‘repos.’ Saving is called a ‘commit.’ You’ll get used to it eventually, but just so you know - it’s not you, it’s Git - feeling confused it entirely normal

These are brief notes and you should refer to Jenny Bryan’s book for further detail. Frankly, I can’t improve on Bryan’s book and I use them regularly myself.

4.4.2 Git

Check if you have Git installed by opening R Studio and then going to the Terminal and typing the following and then return.

git --version

If you get a version number, then you are done (Figure 4.3).

How to access the Terminal within R Studio

Figure 4.3: How to access the Terminal within R Studio

If you have a Mac then Git should come pre-installed, if you have Windows then there’s a chance, and if you have Linux then you probably don’t need this guide. If you don’t get a version number, then you need to install it. Please go to Chapter 5 of Jenny Bryan (2020) for some instructions based on your operating system.

After you have Git, then you need to tell it your username and email. You need this because Git adds this information whenever you take a ‘snapshot,’ or to use Git’s language whenever you make a commit.

Again, within the Terminal, type the following, replacing the details with yours, and then return after each line.

git config --global user.name 'Jane Doe'
git config --global user.email 'jane@example.com'
git config --global --list

The details that you enter here will be public (there are various ways to hide your email address if you need to do this and GitHub provides instructions about this).

Again, if you have issues or need more detailed instructions please go to Chapter 7 of Jenny Bryan (2020).

4.4.3 GitHub

The first step is to create an account on GitHub (https://github.com) (Figure 4.4).

Sign up screen at GitHub

Figure 4.4: Sign up screen at GitHub

GitHub doesn’t have the most intuitive user experience in the world, but we are now going to make a new folder (which is called a ‘repo’ in Git). You are looking for a plus sign in the top right, and then select ‘New Repository’ (Figure 4.5).

Create a new repository

Figure 4.5: Create a new repository

At this point you can add a sensible name for your repo. Leave it as public (you can delete it later if you want). And check the box to initialize with a readme. In the ‘Add .gitignore’ option you can leave it for now, but if you start using GitHub more regularly then you may like to select the R option here. (That just tells Git to ignore various files.) After that, just click the button to create a new repository (Figure 4.6).

Create a new repository, really

Figure 4.6: Create a new repository, really

You’ll now be taken to a screen that is fairly empty, but the details that you need are in the green ‘Clone or Download’ button, then click the clipboard (Figure 4.7).

Get the details of your new repository

Figure 4.7: Get the details of your new repository

Now you need to open Terminal, and use cd to get to where you want to save the folder, then:

git clone https://github.com/RohanAlexander/test.git

At this point, a new folder has been created. We can now interact with it.

The first step is almost always to pull the latest changes with git pull (this is slightly pointless in this example because it’s just us but it’s a good habit to get into). We can then make a change to the folder, for instance, update the readme, and then save it as usual. Once this is done, we need to add, commit, and push. As before, use cd to navigate to your folder, then git status to see if there is anything going on (you should see some reference to the change you made). Then git add -A adds the changes to the staging area (this seems pointless, and it is in this context, but this allows you to specify specific files and similar if needed). Then git status to check what has happened. Then git commit -m "Minor update to readme", and then git status to check on everything, and finally git push.

To summarise (assuming you are in the relevant folder):

git pull
git status
git add -A
git status
git commit -m "Short commit message"
git status
git push

4.4.4 Using Git within RStudio

I promised that GitHub was built into RStudio, but so far we’ve not really taken advantage of that. The way to do it is to create a new repo in GitHub, and copy the information, as before.

At this point, you open RStudio, select Files, New Project, Version Control, Git, and paste the information for the repo. Go through the rest of it, saving the folder somewhere sensible, and clicking ‘Open in new session.’ This will then create a new folder on your computer which will be a Git folder that is linked to the GitHub repo that you created.

At this point, you’ll have a ‘Git’ tab (Figure 4.8).

The Git pane in R Studio

Figure 4.8: The Git pane in R Studio

First pull (click the blue down arrow). Now you want to tick the ‘staged’ box against the files that you want to commit. Then click ‘Commit.’ Type a message in the ‘Commit message’ box and then click ‘Commit.’ Finally ‘Push.’ Again, the details are in Jenny Bryan (2020), especially Chapter 12.

4.4.5 Next steps

We haven’t really taken advantage of GitHub’s features in terms of teams or branches. That is certainly something that you should get into once you have more confidence with these basics. As always, when it comes to Git for data scientists who use R, you should go to the relevant sections of Jenny Bryan (2020).

4.5 Using R in practice

4.5.1 Introduction

This section is what do when your code doesn’t do what you want, discusses a mindset that may help when doing quantitative analysis with R, and finally, some recommendations around how to write your code.

4.5.2 Getting help

Programming is hard and everyone struggles sometimes. At some point your code won’t run or will throw an error. This is normal, and it happens to everyone. It happens to me on a daily, sometimes hourly, basis. Everyone gets frustrated. There are a few steps that are worthwhile taking when this happens:

  • Sometimes the error messages in R are useful. Read it carefully and see if there’s anything of use in it. At the very least, if you get the same message in the future, hopefully you might remember how you solved the problem this time!
  • If you’re getting an error then try googling it, (I find it can help to include the term ‘R’ or ‘tidyverse’ or the relevant package name).
  • If there’s a particular function that seems to be giving trouble, have a look at the help file for it. Sometimes you might be putting in the arguments in the wrong order. You can do this with ‘?function’ e.g. for help with select, you would type ‘?select’ and then run that line.
  • Check the class of the object. Sometimes R is a little fussy and converting the class can help.
  • If your code just isn’t running, then try searching for what you are trying to do, e.g. ‘save PDF of graph in R made using ggplot.’ Almost always there are relevant blog posts or Stack Overflow answers that will help.
  • Try to restart R and R Studio and load everything again.
  • Try to restart your computer.

There are a few small mistakes that I often make and may be worth checking in case you make them too:

  • check the class e.g. class(my_dataset$its_column) to make sure that is what it should be;
  • when you’re using ggplot make sure you use ‘+’ not ‘%>%’; and
  • check whether you are using ‘.’ when you shouldn’t be, or vice versa.

It’s almost always helpful to take a break and come back the next day.

Asking for help is a skill that you will get better at. Try not to say ‘this doesn’t work,’ or ‘I tried everything’ or ‘here’s the error message, what do I do?’ In general, it’s not possible to help based on that because there are too many possibilities. Instead:

  1. Provide a small example of your data, and code, and detail what is going wrong.
  2. Document what you have tried so far - what Stack Overflow pages have you looked at and why are they not quite what you’re after? What RStudio Community pages have you tried?
  3. Be clear about the outcome that you would like.

As the RStudio Community welcome page says ‘your job is to make it as easy as possible for others to help you.’ To enable that to happen you need to ‘create a minimal reproducible example, or reprex for short’ You can get more information about this here, but basically what is needed is that you code is reproducible (so include things like library(), etc) and that is it minimal - that means making a very simple smaller example and reproducing the error on that small example.

Usually doing this actually allows you to solve your own problem. If it doesn’t then it’ll allow someone else a fighting chance as being able to help you. I especially recommend the reprex package (Jennifer Bryan et al. 2019).

There’s almost no chance that you’ve got a problem that someone hasn’t addressed before, it’s just a matter of finding the answer! Try to be tenacious with this and learn how to solve your own problems.

4.5.3 Mentality

(Y)ou are a real, valid, competent user and programmer no matter what IDE you develop in or what tools you use to make your work work for you

(L)et’s break down the gates, there’s enough room for everyone

Sharla Gelfand, 10 March 2020.

I’m a little hesitant to make suggestions with regard to mentality. If you write code, then you’re coder regardless of how you do it, what you’re using it for, or who you are. But I want to share a few traits that I have found have been useful to cultivate in myself. That said, entirely, whatever works for you is great, so take or leave this section.

  • Focused: I’ve found that having an aim to ‘learn R’ or something similar tends to be problematic, because there’s no real end point to that. Instead I would recommend smaller, more specific goals, such as ‘make a histogram about the 2019 Canadian Election with ggplot.’ That is something that you can focus on and achieve. With more nebulous goals it becomes easier to get lost on tangents, much more difficult to get help, and I’ve noticed that people who have nebulous goals seem to give up.
  • Curious: I’ve found that it’s useful to just have a go. In general, the worst that happens is that you waste your time and have to give up. You can rarely break something irreparably with code. If you want to know what happens if you pass a ‘vector’ instead of a ‘dataframe’ to ggplot then just try it.
  • Pragmatic: At the same time, I’ve found that it’s best to try to stick within the bounds of what I know and just make one small change each time. For instance if you’re wanting to do some regression, and curious about the tidymodels package (Kuhn and Wickham 2020) instead of lm(). Perhaps you could just use one aspect from the tidymodels package initially and then make another change next time. In my opinion ugly code that gets the job done, is better than beautiful code that is never finished.
  • Tenacious: This is a balancing act. I always find there are unexpected problems and issues with every project. On the one hand persevering despite these is a good tendency. But on the other hand I’ve learned that sometimes I need to be prepared to give up on something if it doesn’t seem like a break-through is possible. This is where I have found that mentors can be useful as they tend to have a better idea. This is also where planning comes in.
  • Planned: I have found it is very useful to plan out what you are going to do. For instance, you may want to make a histogram of the 2019 Canadian Election. I find it useful to plan the steps that are needed and even to sketch out how I might implement each step. For instance, the first step is to get the data. What packages might be useful? Where might the data be? What is our back-up plan for if we can’t find the data in that initial spot?
  • Done is better than perfect: We all have various perfectionist tendencies to a certain extent, but I recommend that you try to turn them off to a certain extent when it comes to R. In the first instance just try to write code that works, especially in the early days. You can always come back and improve aspects of it. But it is important to actually ship.

4.5.4 Code comments

Comment your code.

There is no one way to write code, especially in R. However, there are some general guidelines that will make it easier for you even if you’re just working on your own.

Comment your code.

Comments in R can be added by including the # symbol. The shortcut on Mac is ‘Command + Shift + m.’ You don’t have to put a comment at the start of the line, it can be midway through. In general, you don’t need to comment what every aspect of your code is doing but you should comment parts that are not obvious. For instance, if you read in some value then you may like to comment where it is coming from.

Comment your code.

You should comment why you are doing something. What are you trying to achieve?

Comment your code.

You must comment to explain weird things. Like if you’re removing some specific row, say row 27, then why are you removing that row? It may seem obvious in the moment, but future-you in six months won’t remember.

Comment your code.

I like to break my code into sections. For instance, setting up my workspace, reading in datasets, manipulating and cleaning the dataset, analysing the datasets, and finally producing tables and figures. While it can be difficult to speak generally, I usually separate each of those certainly with comments explaining what is going on, and sometimes into separate files, depending on the length.

Comment your code.

Additionally, at the top of each file I put basic information, such as the purpose of the file, and pre-requisites or dependencies, the date, the author and contact information, and finally and red-flags, bodies, or todos.

Comment your code.

At the very least I recommend something like the following for every R script:

#### Preamble ####
# Purpose: Brief sentence about what this script does
# Author: Your name
# Data: The date it was written
# Contact: Add your email
# License: Think about how your code may be used
# Pre-requisites: 
# - Maybe you need some data or some other script to have been run?


#### Workspace setup ####
# Don't keep the install.packages line - just comment out if need be
# Load libraries
library(tidyverse)

# Read in the raw data. 
raw_data <- readr::read_csv("inputs/data/raw_data.csv")


#### Next section ####
...

4.5.5 Learning more

One of the great aspects of R is that there is a friendly community of people who use it. There are a variety of ways that I learn about new tricks, functions, and packages including:

Another great way to learn is by exchanging your code with others. Initially, just have them read it and give you feedback about it. But after you get a bit more confident run each other’s code. The most efficiently I’ve ever improved in my R journey has been by having Monica try to run my code.

4.6 Developing research questions

Both qualitative and quantitative approaches have their place, but here we focus on quantitative approaches. (Qualitative research is important as well, and often the most interesting work has a little of both - ‘mixed methods.’) This means that we are subject to issues surrounding data quality, scales, measures, sources, etc. We are especially interested in trying to tease out causality.

Broadly there are two ways to go about research:

  1. data-first,
  2. question-first.

If you get a job somewhere typically you will initially be data-first. This means that you will need to work out the questions that you can reasonably answer with the data available to you. After you show some promise, you may be given the latitude to explore specific questions, possibly even gathering data specifically for that purpose. Contrast this with the example of the Behavioural Insights Team, (Gertler et al. 2016, 23) who got to design and then carry out experiments given the remit of the entire British government (as they were spun out of the prime minister’s office).

When deciding the questions that you can reasonably answer with the data that are available, you need to think about:

  1. Theory: Do you have a reasonable expectation that there is something causal that could be determined? Charting the stock market - maybe, but might be better with haruspex because at least that way you have something you could eat. You need a reasonable theory of how \(x\) may be affecting \(y\).
  2. Importance: There are plenty of trivial questions that you could ask, but it’s important to not waste your time. The importance of a question also helps with motivation when you are on your fourth straight week of cleaning data and de-bugging your code. It also (and this becomes important) makes it easier to get talented people to work with you, or similarly to convince people to fund you or allow you to work on this project.
  3. Availability: Can you reasonably expect get more data about this research question in the future or is this the extent of the data that could be gathered?
  4. Iteration: Is this something that you can come back to and run often or is this a once-off analysis?

The ‘FINER framework’ as a mnemonic device used in medicine.3 This framework reminds us to ask questions that are (Hulley 2007):

  • Feasible: Adequate number of subjects; adequate technical expertise; affordable in time and money; manageable in scope.
  • Interesting: Getting the answer intrigues investigator, peers and community.
  • Novel: Confirms, refutes or extends previous findings
  • Ethical: Amenable to a study that institutional review board will approve.
  • Relevant: To scientific knowledge; to clinical and health policy; to future research.

Farrugia et al. (2010) build on this in terms of developing research questions and recommend ‘PICOT’:

  • Population: What specific population are you interested in?
  • Intervention: What is your investigational intervention?
  • Comparison group: What is the main alternative to compare with the intervention?
  • Outcome of interest: What do you intend to accomplish, measure, improve or affect?
  • Time: What is the appropriate follow-up time to assess outcome

Often time will be constrained, possibly in interesting ways and these can guide your research. If we are interested in the effect of Trump’s tweets on the stock market, then that can be done just by looking at the minutes (milliseconds?) after he tweets. But what if we are interested in the effect of a cancer drug on long term outcomes? If the effect takes 20 years then we either have to wait a while, or we need to look at people who were treated in 2000, but then we have selection effects and different circumstances to if we give the drug today. Often the only reasonable thing to do is to build a statistical model, but then we need adequate sample sizes, etc.

Usually the creation of a counterfactual is crucial. We’ll discuss counterfactuals a lot more later, but briefly, a counterfactual is an if-then statement in which the ‘if’ is false. Consider the example of Humpty Dumpty from Lewis Carroll’s Through the Looking-Glass:

Humpty Dumpty example

Figure 4.9: Humpty Dumpty example

Humpty is satisfied with what would happen if he were to fall off, even though he is similarly satisfied that this would never happen. (I won’t ruin the story for you.) The comparison group often determines your results e.g. the relationship between VO2 and athletic outcomes, compared with elite athletic outcomes.

Finally, we can often dodge ethics boards in data science, especially once you leave university. Typically, ethics guides from medicine and other fields are focused on ethics boards. But we often don’t have those in data science applications. Even if your intentions are unimpeachable, I want to suggest one additional aspect to think about, and that is Bayes theorem: \[P(A|B) = \frac{P(B|A)\times P(A)}{P(B)}\] (The probability of A given B depends on the probability of B given A, the probability of A, and the probability of B.)

To see why this may be relevant, let’s go to the canonical Bayes example: There is some test for a disease that is 99 per cent accurate both ways (that is, if a person actually has the disease there is a 99 per cent chance the test says positive, and is a person does not have the disease then there is a 99 per cent chance the test says negative). Let’s just say that only 0.005 of the population has the disease. Then if we randomly pick someone from the general population then the chance that they have the disease is outstandingly low. This is even if they test positive: \[\frac{0.99\times0.005}{0.99\times0.005 + 0.01\times0.995} \approx 33.2\]

To see why this may be relevant, consider the example of Google’s AI cancer testing (Shetty and Tse 2020). Basically what they have done is to train a model that can identify breast cancer. They claim ‘greater accuracy, fewer false positives, and fewer false negatives than experts.’

I, and many others (Aschwanden 2020), would argue this is probably not where we would want these resources directed at this point. Even when perfectly healthy people go and get screened they tend to find various things that are ‘wrong’ with them. The issue is that they’re perfectly healthy and that we’ve rarely got a good idea as to whether that aspect that was flagged by the test is a big deal or not.

Given low prevalence in the community, we probably don’t want wide-spread use of a particular testing regime that only looks at one aspect (i.e. the mammogram in this case). Bayes rule guides us that the danger caused by the unnecessary ‘treatment’ would probably outweigh the benefits. The authors of that Google blog post likely have unimpeachable ethics, but they may not understand Bayes rule.


  1. Thanks to Aaron Miller for pointing me to this.↩︎