Reproducibility

Table of Contents


Required reading

Recommended reading

Recommended activity

Key concepts/skills/etc

Key GitHub workflow with commands

Introduction

Suppose you have cancer and you have to choose between a black box AI surgeon that cannot explain how it works but has a 90% cure rate and a human surgeon with an 80% cure rate. Do you want the AI surgeon to be illegal?

Geoffrey Hinton, 20 February 2020.

The number one thing to keep in mind about machine learning is that performance is evaluated on samples from one dataset, but the model is used in production on samples that may not necessarily follow the same characteristics…

The finance industry has a saying for this: “past performance is no guarantee of future results”. Your model scoring X on your test dataset doesn’t mean it will perform at level X on the next N situations it encounters in the real world. The future may not be like the past.

So when asking the question, “would you rather use a model that was evaluated as 90% accurate, or a human that was evaluated as 80% accurate”, the answer depends on whether your data is typical per the evaluation process. Humans are adaptable, models are not.

If significant uncertainty is involved, go with the human. They may have inferior pattern recognition capabilities (versus models trained on enormous amounts of data), but they understand what they do, they can reason about it, and they can improvise when faced with novelty

If every possible situation is known and you want to prioritize scalability and cost-reduction, go with the model. Models exist to encode and operationalize human cognition in well-understood situations.

(“well understood” meaning either that it can be explicitly described by a programmer, or that you can amass a dataset that densely samples the distribution of possible situations – which must be static)

François Chollet, 20 February 2020.

If science is about systematically building and organising knowledge in terms of testable explanations and predictions, then data science takes this and focuses on data. Fundamentally data science is still science, and as such, building and organising knowledge is a critical aspect. Being able to do something yourself, once, does not achieve this. Hence, the focus on reproducibility and replicability.

From the readings (Alexander), ‘Research is reproducible if it can be reproduced exactly, given all the materials used in the study.’ ‘[Hence] materials need to be provided!’. ’‘[M]aterials’ usually means data, code and software.’ The minimum requirement is to be able to ‘[r]eproduce the data, methods and results (including figures, tables)’.

Also from the readings (Gelman), you should have noticed that this has been an issue in other sciences. e.g. psychology. The issue with not being reproducible is that we are not contributing to knowledge. We no longer have any idea was is fact in the field of psychology. (This is coming for other fields too, including the field that I was trained in - economics.) Do this matter? Yes.

Some of the examples that Gelman cites (which turned out to be dodgy) don’t really matter e.g. ESP or the power pose. It doesn’t really matter. But increasingly the same methods are being applied in areas where they do matter e.g. ‘nudge’ units. Similarly, from the readings (Simpson), we can see that it’s a big problem in data science.

The ‘gay face’ paper that Simpson writes about has not released their dataset. We have no way of knowing what is going on with it. They have found a certain set of results based on that dataset, their methods, and what they did, but we have no way of knowing how much that matters. As Simpson says ‘the paper itself does some things right. It has a detailed discussion of the limitations of the data and the method’. You must do this in everything that you write, but it is not enough.

Without the data, we don’t know what their results speak to as we don’t understand how representative the sample is. If the dataset is biased, then that undermines their claims. There’s a reason that while initial medical trials are done on mice, etc, eventually human trials are required.

In order to do the study they needed a trained dataset. They trained it using Mechanical Turk. From Mattson (Figure 1.

Instructions for workers to do classification piecework on the Amazon Mechanical Turk platform, p. 46. From Mattson.

Figure 1: Instructions for workers to do classification piecework on the Amazon Mechanical Turk platform, p. 46. From Mattson.

From Mattson:

The problems here are legion: Barack Obama is biracial but simply “Black” by American cultural norms. “Clearly Latino” begs the question “to whom?” Latino is an ethnic category, not a racial one: many Latinos already are Caucasian, and increasingly so. By training their workers according to stereotypical American categories, WAK’s algorithm can only spit out the garbage they put in.

‘WAK’s algorithm can only spit out the garbage they put in.’ I would encourage you to print out that statement and paste it somewhere that you will see it every time you get a data science result.

What steps can we take to make our work reproducible?

  1. Ensure your entire workflow is documented. How did you get the raw data? Can you save the raw data? Will the raw data always be available? Is the raw data available to others? What steps are you taking to transform the raw data in data that can be analysed? How are you analysing the data? How are you building the report?
  2. Try to improve each time. Can you run you entire workflow again? Can ‘another person’ run your entire workflow again? Can a future you run your entire workflow again? Can a future ‘another person’ run your entire workflow again? Each of these requirements is increasingly more onerous. We are going to start with worrying about the first. The way we are going to do this is by using R Markdown.

Data access

From the readings:

Considering that any scientific study should be based on raw data, and that data storage space should no longer be a challenge, journals, in principle, should try to have their authors publicize raw data in a public database or journal site upon the publication of the paper to increase reproducibility of the published results and to increase public trust in science.

Tsuyoshi Miyakawa

On mistakes

Everyone makes mistakes. Own them, loudly. Fix them, quickly. Regardless of what you think about Nate Silver, but that article in the recommended readings is impressive.

Git and GitHub

Introduction

In this lab we introduce Git and GitHub. These are tools that:

  1. enhance the reproducibility of your work by making it easier to share your code and data;
  2. make it easier to show off your work;
  3. improve your workflow by encouraging you to think systematically about your approach; and
  4. (although we won’t take advantage of this) make it easier to work in teams.

Git is a version control system. One way you might be used to doing version control is to have various versions of your files: first_go.R, first_go-fixed.R, first_go-fixed-with-mons-edits.R. But this soon becomes cumbersome. Some of you may use dates, for instance: 2020-03-12-analysis.R, 2020-03-13-analysis.R, 2020-03-14-analysis.R, etc. While this keeps a record it can be difficult to search if you need to go back - will you really remember the date of some change in a week? How about a month or a year? It gets unwieldy fairly quickly.

Instead of this, Git allows you to only have one version of the file analysis.R and it keeps a record of the changes to that file, and a snapshot of that file at a given point in time. When it takes that snapshot is determined by you. When you want Git to take a snapshot you additionally include a message, saying what changed between this snapshot and the last. In that way, there is only ever one version of the file, but the history can be more easily searched.

The issue is that Git was designed for software developers. As such, while it works, it can be a little ungainly for non-developers (Figure 2).

An infamous response to the launch of Dropbox in 2007, trivialising the use-case for Dropbox.

Figure 2: An infamous response to the launch of Dropbox in 2007, trivialising the use-case for Dropbox.

Hence, GitHub, GitLab, and various other companies offer easier-to-use services that build on Git. GitHub used to be the weapon of choice, but they were sold to Microsoft in 2018 and since then other variants such as GitLab have risen in popularity. We will introduce GitHub here because it remains the most popular, and it is built into RStudio, however you should feel free to explore other options.

One of the hardest aspects of Git, and the rest, for me was the terminology. Folders are called repos. Saving is called a commit.

These are brief notes and you should refer to Jenny Bryan’s book for further detail. Frankly, I can’t improve on Bryan’s book and I use them regularly myself.

Git

Check if you have Git installed by opening R Studio and then going to the Terminal and typing the following and then return.


git --version

If you get a version number, then you are done (Figure 3).

How to access the Terminal within R Studio

Figure 3: How to access the Terminal within R Studio

If you have a Mac then Git should come pre-installed, if you have Windows then there’s a chance, and if you have Linux then you probably don’t need this guide. If you don’t get a version number, then you need to install it. Please go to Jenny Bryan’s book for some instructions based on your operating system: https://happygitwithr.com/install-git.html/.

After you have Git, then you need to tell it your username and email. You need this because Git adds this information whenever you take a ‘snapshot’, or to use Git’s language whenever you make a commit.

Again, within the Terminal, type the following, replacing the details with yours, and then return after each line.


git config --global user.name 'Jane Doe'
git config --global user.email 'jane@example.com'
git config --global --list

The details that you enter here will be public (there are various ways to hide your email address if you need to do this and GitHub provides instructions about this).

Again, if you have issues or need more detailed instructions please go to the relevant section of Jenny Bryan’s book: https://happygitwithr.com/hello-git.html.

GitHub

The first step is to create an account on GitHub (https://github.com) (Figure 4).

Sign up screen at GitHub

Figure 4: Sign up screen at GitHub

GitHub doesn’t have the most intuitive user experience in the world, but we are now going to make a new folder (which is called a ‘repo’ in Git). You are looking for a plus sign in the top right, and then select ‘New Repository’ (Figure 5).

Create a new repository

Figure 5: Create a new repository

At this point you can add a sensible name for your repo. Leave it as public (you can delete it later if you want). And check the box to initialize with a readme. In the ‘Add .gitignore’ option you can leave it for now, but if you start using GitHub more regularly then you may like to select the R option here. (That just tells Git to ignore various files.) After that, just click the button to create a new repository (Figure 6).

Create a new repository, really

Figure 6: Create a new repository, really

You’ll now be taken to a screen that is fairly empty, but the details that you need are in the green ‘Clone or Download’ button, then click the clipboard (Figure 7).

Get the details of your new repository

Figure 7: Get the details of your new repository

Now you need to open Terminal, and use cd to get to where you want to save the folder, then:


git clone https://github.com/RohanAlexander/test.git

At this point, a new folder has been created. We can now interact with it.

The first step is almost always to pull the latest changes with git pull (this is slightly pointless in this example because it’s just us but it’s a good habit to get into). We can then make a change to the folder, for instance, update the readme, and then save it as usual. Once this is done, we need to add, commit, and push. As before, use cd to navigate to your folder, then git status to see if there is anything going on (you should see some reference to the change you made). Then git add -A adds the changes to the staging area (this seems pointless, and it is in this context, but this allows you to specify specific files and similar if needed). Then git status to check what has happened. Then git commit -m "Minor update to readme", and then git status to check on everything, and finally git push.

To summarise (assuming you are in the relevant folder):


git pull
git status
git add -A
git status
git commit -m "Short commit message"
git status
git push

Using Git within RStudio

I promised that GitHub was built into RStudio, but so far we’ve not really taken advantage of that. The way to do it is to create a new repo in GitHub, and copy the information. At this point, you open RStudio, select Files, New Project, Version Control, Git, and you paste the information for the repo. Go through the rest of it, saving the folder somewhere sensible, and clicking ‘Open in new session’. This will then create a new folder on your computer which will be a Git folder that is linked to the GitHub repo that you created.

There are more details on Jenny Bryan’s book: https://happygitwithr.com/rstudio-git-github.html.

Next steps

We haven’t really taken advantage of GitHub’s features in terms of teams or branches. That is certainly something that you should get into once you have more confidence with these basics. As always, when it comes to Git for data scientists who use R, you should go to the relevant sections of Jenny Bryan’s book.