12 Storing and retrieving data

STATUS: Under construction.

Required reading

Recommended reading

Key concepts/skills/etc

Key libraries

Key functions/etc

12.1 Introduction

After you’ve put together a dataset, an important part of being responsible is storing it appropriately and enabling easy retrieval. While it is certainly possible to be especially concerned about this, and entire careers are based on the storage and retrieval of data, to a certain extent, the baseline here is not onerous. If you can get it off your own computer then you are half-way there! Confirming that someone else can retrieve it and use it, puts you much further than most.

That said, the FAIR principles are especially useful to be more formal about data management. These are (Wilkinson et al. 2016):

  1. Findable. This means that there is one, unchanging, identifier for the dataset and the dataset has high-quality descriptions and explanations.
  2. Accessible.
  3. Interoperable.
  4. Reusable.

12.2 Plan

Michener (2015)

Information Science and libraries

Hart et al. (2016)

12.3 R Packages for data

12.4 Documentation

Datasheets (Gebru et al. 2020) are an increasingly critical aspect of data science. Datasheets are basically nutrition labels for datasets. The process of creating them enables us to think more carefully about what we will feed our model. More importantly, they enable others to better understand what we fed our model. Recently researchers went back and wrote a datasheet for one of the most popular datasets in computer science, and they found that around 30 per cent of the data were duplicated (Bandy and Vincent 2021).

Instead of telling you how unhealthy various foods are, a datasheet tells you things like:

  • ‘Who created the dataset and on behalf of which entity?’
  • ‘Who funded the creation of the dataset?’
  • Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?’
  • ‘Is any information missing from individual instances?’

If you have done a lot of work to create the dataset that you analyze, then it may make sense to try to publish and share it on its own. But typically a datasheet might live in an appendix to the main work.

12.5 Exercises and tutorial

12.5.1 Exercises

  1. According to Gebru et al. (2020, 2), a datasheet should document a dataset’s (please select all that apply):
    1. composition.
    2. recommended uses.
    3. motivation.
    4. collection process.
  2. Following Wilkinson et al. (2016), which of the following are FAIR principles (please select all that apply)?
    1. Findable.
    2. Approachable.
    3. Interoperable.
    4. Reusable.
    5. Integrated.
    6. Fungible.
    7. Reduced.
    8. Accessible.

12.5.2 Tutorial

Look into how IQ tests are conducted and what goes into them. To what extent do you think they measure intelligence? Some aspects that you may like to think about in answering that question include: Who decides what is intelligence? How is this updated? What is missing from that definition? To what extent is this generalisable? You should write a page or two.