Chapter 10 Cleaning and preparing data

Last updated: 16 May 2021.

Required reading

  • Data Feminism, Chapter 5.
  • R for Data Science, Chapter 9.

Recommended reading

Key concepts/skills/etc

Key libraries

Key functions/etc


  1. With regard to Jordan (2019) and D’Ignazio and Klein (2020Chapter 6), to what extent do you think we should let the data speak for themselves? [Please write a page or two.]

10.1 Introduction

In earlier chapters we’ve done data cleaning and preparation, but in this chapter, we get into the weeds. For a long time, data cleaning and preparation was largely overlooked. We now realise that was a mistake. It is no longer possible to trust almost any result in disciplines that apply statistics. The reproducibility crisis, which started in psychology but has now extended to many other fields in the physical and social sciences, has brought to light issues such as p-value ‘hacking,’ researcher degrees of freedom, file-drawer issues, and even data and results fabrication (Gelman and Loken 2013). Steps are now being put in place to address these. However, there has been relatively little focus on the data gathering, cleaning, and preparation aspects of applied statistics, despite evidence that decisions made during these steps greatly affect statistical results (Huntington-Klein et al. 2020). In this chapter we focus on these issues.

While the statistical practices that underpin data science are themselves correct and robust when applied to ‘fake’ datasets created in a simulated environment, data science is typically not conducted with these types of datasets. For instance, data scientists are interested in ‘…the kind of messy, unfiltered, and possibly unclean data—tainted by heteroskedasticity, complex dependence and missingness patterns—that until recently were avoided in polite conversations between more traditional statisticians…’ (Craiu 2019). Big data does not resolve this issue, and may even exacerbate it, for instance ‘without taking data quality into account, population inferences with Big Data are subject to a Big Data Paradox: the more the data, the surer we fool ourselves’ (Meng 2018). It is important to note that the issues that are found in much applied statistics research are not necessarily associated with researcher quality, or their biases (Silberzahn et al. 2018). Instead, they are a result of the environment within which data science is conducted. This chapter aims to give you the tools to explicitly think about this work.

Gelman and Vehtari (2020) writing about the most important statistical ideas of the past 50 years say that:

…each of them was not so much a method for solving an existing problem, as an opening to new ways of thinking about statistics and new ways of data analysis. To put it another way, each of these ideas was a codification, bringing inside the tent an approach that had been considered more a matter of taste or philosophy than statistics.

We see the focus on data cleaning and preparation in this chapter as analogous, insofar, as it represents a codification, or bringing inside the tent, of aspects that are typically (incorrectly) considered those of taste rather than statistics.

The approach that I recommend that you follow is:

  1. Plan the end state.
  2. Execute that plan on a tiny sample.
  3. Write tests and documentation
  4. Iterate the plan.
  5. Generalize the execution.
  6. Update tests and documentation.

You will need to use all your skills to this point to be effective, but this is the very stuff of statistical sciences! Be dogged, but sensible. The best is the enemy of the good here. It’s better to have 90 per cent of the data cleaned and prepared, and to start exploring that, before deciding whether it’s worth the effort to clean and prepare the remaining 10 per cent because that remainder will likely take an awful lot of time and effort.

10.2 Checks and tests

data cleaning - look for anomalies

  • Plot.
  • Look for missingness.

10.3 Documentation