Chapter 1 Telling Stories with Data

1.1 On telling stories

Like many parents, when our children were born, one of the first things that my wife and I did regularly was read stories to them. In doing so we carried on a tradition that has occurred for millennia. Myths, fables, and fairy tales can be seen and heard all around us. Not only are they entertaining but they enable us to easily learn something about the world. While ‘The Very Hungry Caterpillar’ (Carle 1969) may seem quite far from the world of quantitative analysis, there are similarities. Both are trying to tell the reader a story.

When conducting quantitative analysis, we are trying to tell the reader a story that will convince them of something. It may be as exciting as predicting elections, as banal as increasing internet advertising click rates by 0.01 per cent, as serious as finding the cause of some disease, or as fun as forecasting the winner of a basketball game. In any case the key elements are the same. When writing fiction Wikipedia suggests there are five key elements: character, plot, setting, theme, and style. When we are conducting quantitative analysis, we have analogous concerns:

  1. What is the data? Who generated it and how?
  2. What is the data trying to say? How can we let it say this?
  3. What is the broader context surrounding the data? Where and when was it generated? Could other data have been generated?
  4. What are we hoping others will see from this data?
  5. How can you convince them of this?

In the past, certain elements of telling stories with quantitative data were easier. For instance, experimental design has a long and robust tradition within traditional applications such as agricultural and medical sciences, physics, and chemistry. Student’s t-distribution was identified by a chemist, William Sealy Gosset, who was working at Guinness and needed to assess the quality of the beer (Raju 2005)! It would have been possible for him to randomly sample the beer and change one aspect at a time. Indeed, many of the fundamental statistical methods that we use today were developed in an agricultural setting. In the settings for which they were developed it was typically possible to establish control groups, randomize, and easily deal with any ethical concerns. In such a setting any subsequent story that is told with the resulting data is likely to be fairly convincing.

Unfortunately, such a set-up is rarely possible in modern applied statistics applications. On the other hand, there are many aspects that are easier today. For instance, we have well-developed statistical techniques, easier access to larger datasets, and open-source statistical languages such as R (R Core Team 2020). But the lack of ability to conduct traditional experiments means that we must turn to other aspects in order to tell a reader a convincing story about our data. These other aspects allow us to tell convincing stories even in the absence of a traditional experimental set-up.

1.2 Telling stories with data

The aim of this book is to equip you with everything you need to be able to write short(ish), quantitative, papers, that convince a reader of the story you are telling. This book encourages research-based, independent learning. This means that you should develop your own questions and answer them to the extent that you can. I focus on methods that can provide convincing stories even when it is not possible to conduct traditional experiments. Importantly, these approaches do not rely on ‘big data’—which is widely known to not be a panacea (Meng 2018)—but instead on better using the data that are available. The purpose of this book is to allow you to tell convincing stories using data and quantitative analysis. It blends theory, practice, and case studies to equip you to with practical skills, a sophisticated workflow, and an appreciation for how more-advanced methods build on what is covered here.

What has become known as ‘data science’ is multi-disciplinary. It takes the ‘best’ bits from fields such as statistics, data visualisation, programming, and experimental design (to name a few). As such, data science projects require a blend of these skills. This is a hands-on book in which you will learn these skills by conducting research projects using real-world data. This means that you will:

  • obtain and clean relevant datasets;
  • develop your own research questions;
  • use statistical techniques to answer those questions; and
  • communicate your results in a meaningful way.

But you have to do the work. As King (2000) says ‘[a]mateurs sit and wait for inspiration, the rest of us just get up and go to work.’ Do not just passively read this book. My role is best described by Hamming (1996, 2–3):

I am, as it were, only a coach. I cannot run the mile for you; at best I can discuss styles and criticize yours. You know you must run the mile if the athletics course is to be of benefit to you—hence you must think carefully about what you hear and read in this book if it is to be effective in changing you—which must obviously be the purpose of any course. Again, you will get out of this course only as much as you put in, and if you put in little effort beyond sitting in the class or reading the book, then it is simply a waste of your time. You must also mull things over, compare what I say with your own experiences, talk with others, and make some of the points part of your way of doing things.

This book was developed in collaboration with statisticians, data scientists, computer scientists, sociologists, political scientists, economists, and information professionals. It is designed around approaches that are used extensively in academia, government, and industry. Furthermore, it includes many aspects, such as data cleaning, ethics, and communication, that are critical, but rarely taught. However, this book does not contain everything that you need. Your learning must be ‘active’ when using this book because that is the way you will continue to learn through the rest of your life and career. You need to seek out additional information, critically evaluate it, and apply it to your situation.

The key elements that we cover in this book are:

  1. Communication.
  2. Ethics.
  3. Reproducibility.
  4. Research question development.
  5. Data collection.
  6. Data cleaning.
  7. Data protection and dissemination.
  8. Exploratory data analysis.
  9. Statistical modelling.
  10. Scaling.

These aspects are critical to being able to convince a reader of your story. Your ability to convince them of your story depends on the quality that you bring to each.

If we were to expand on these elements then we roughly get the chapters that are covered in this book, although they are re-ordered as necessary, and communication, ethics and reproducibility are built-in throughout all aspects of the book. From the first chapter we will have a workflow (make a graph then write about it convincingly) that allows us to tell a convincing (albeit, likely basic) story. In each subsequent chapter we add aspects and depth to our workflow that will allow us to speak with increasing sophistication and credibility.

This workflow also aligns nicely with the skills that are sought in data scientists. For instance, Mango Solutions, a UK data science consultancy, describes ‘the six core capabilities of data scientists’ as: 1. communicator; 2. data-wrangler; 3. programmer; 4. technologist; 5. modeller; and 6. visualiser (“Data Science Radar: How to Identify World-Class Data Science Capabilities” 2020).

This book is also designed to enable you to build a portfolio of work that you could show to a potential employer. This is arguably the most important thing that you should be doing. (E. Robinson and Nolis 2020, 55) describe a portfolio as ‘a set of data science projects that you can show to people so they can see what kind of data science work you can do.’ They describe this as a ‘step [that] can really help you be successful.’

1.3 How do our worlds become data?

1.4 What is data and how should we use it to learn about the world

Exercises and tutorial

Exercises

TBD

Tutorial

The purpose of this tutorial is to clarify in your mind the difficulty of measurement in the real world, even of simple things, and hence the likelihood of measurement issues in complicated areas. Please obtain some seeds for a fast-growing plant. Options such as radishes, mustard greens, and argula are great choices. Plant the seeds. Measure how much soil you used. Each day take a note of any changes, and record any measurements that you can.

References

Carle, Eric. 1969. The Very Hungry Caterpillar. World Publishing Company.
“Data Science Radar: How to Identify World-Class Data Science Capabilities.” 2020. Mango Solutions. https://www.mango-solutions.com/data-science-radar-how-to-identify-world-class-data-science-capabilities/.
Hamming, Richard W. 1996. The Art of Doing Science and Engineering. Stripe Press.
King, Stephen. 2000. On Writing: A Memoir of the Craft. Scribner.
Meng, Xiao-Li. 2018. “Statistical Paradises and Paradoxes in Big Data (i): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election.” The Annals of Applied Statistics 12 (2): 685–726.
R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Raju, Tonse. 2005. William Sealy Gosset and William a. Silverman: Two "Students" of Science. Pediatrics. Vol. 116. https://doi.org/10.1542/peds.2005-1134.
Robinson, Emily, and Jacqueline Nolis. 2020. Build a Career in Data Science. https://livebook.manning.com/book/build-a-career-in-data-science?origin=product-look-inside.