Telling Stories With Data
Chapter 1 Introduction
These notes are being actively developed.
If you have comments or suggestions, or find mistakes, then please don’t hesitate to get in touch.
Hi, I’m Rohan Alexander. You can find out more about me here. These are notes that I wrote to support my teaching at the University of Toronto across the Faculty of Information and the Department of Statistical Sciences. The focus is on using quantitative methods to tell stories with data.
The parts of these notes are:
- Hello world
- R essentials
- Graphs, tables and text
- Hunting and gathering
- Sampling and survey essentials
- APIs, scraping, PDFs and text
- RCTs, A/B testing
- Implementing surveys
- Open data
- Cleaning and preparing
- Storage and retrieval
- Dissemination and protection
- Exploratory data analysis
- Regression essentials
- Matching and difference-in-differences
- Instrumental variables
- Regression discontinuity design
- Poll of polls
- Multilevel regression with post-stratification
1.3 On telling stories
Like many parents, when our child was born, one of the first things that my wife and I did regularly was read stories to him. In doing so we carried on a tradition that has occurred for millennia. Myths, fables, and fairy tales can be seen and heard all around us. Not only are they entertaining but they enable us to easily learn something about the world. While ‘The Very Hungry Caterpillar’ may seem quite far from the world of quantitative analysis, there are similarities. Both are trying to tell the reader a story.
When conducting quantitative analysis we are trying to tell the reader a story that will convince them of something. It may be as exciting as predicting elections, as banal as increasing internet advertising click rates by 0.01 per cent, as serious as finding the cause of some disease, or as fun as forecasting the winner of a basketball game. In any case the key elements are the same. When writing fiction Wikipedia suggests there are five key elements: character, plot, setting, theme, and style. When we are conducting quantitative analysis we have analogous concerns:
- What is the data? Who generated it and how?
- What is the data trying to say? How can we let it say this?
- What is the broader context surrounding the data? Where and when was it generated? Could other data have been generated?
- What are we hoping others will see from this data?
- How can you convince them of this?
In the past, certain elements of telling stories with quantitative data were easier. For instance, experimental design has a long and robust tradition within traditional applications such as agricultural and medical sciences, physics, and chemistry. Student’s t-distribution was identified by a chemist, William Sealy Gosset, who was working at Guinness and needed to assess the quality of the beer (Raju 2005)! It would have been possible for him to randomly sample the beer and change one aspect at a time. Indeed, many of the fundamental statistical methods that we use today were developed in an agricultural setting. In the settings for which they were developed it was typically possible to establish control groups, randomize, and easily deal with any ethical concerns. In such a setting any subsequent story that is told with the resulting data is likely to be fairly convincing.
Unfortunately, such a set-up is rarely possible in modern applied statistics applications. On the other hand, there are many aspects that are easier today. For instance, we have well-developed statistical techniques, easier access to larger datasets, and open source statistical languages such as R. But the lack of ability to conduct traditional experiments means that we must turn to other aspects in order to tell a reader a convincing story about our data. These other aspects allow us to tell convincing stories even in the absence of a traditional experimental set-up.
1.4 Telling stories with data
The aim of these notes is to equip you with everything you need to be able to write short(ish), technical, memos, that convince a reader of the story you are telling. These notes encourage research-based, independent learning. This means that you should develop your own questions and answer them to the extent that you can. We focus on methods that can provide convincing stories even when we cannot conduct traditional experiments. Importantly, these approaches do not rely on ‘big data’ (which is widely known by practitioners to not be a panacea (Meng and others 2018)), but instead on better using the data that are available. The purpose of the notes is to allow you to tell convincing stories using data and quantitative analysis. They blend theory and case studies to equip you to with practical skills, a sophisticated workflow, and an appreciation for how more-advanced methods build on what is covered here.
Data science is multi-disciplinary. It takes the ‘best’ bits from fields such as statistics, data visualisation, programming, and experimental design (to name a few). As such, data science projects require a blend of these skills. These are hands-on notes in which you will learn these skills by conducting research projects using real-world data. This means that you will:
- obtain and clean relevant datasets;
- develop your own research questions;
- use statistical techniques to answer those questions; and
- communicate your results in a meaningful way.
These notes were developed in collaboration with professional data scientists as well as academics from a variety of fields. They are designed around approaches that are used extensively in academia, government, and industry. Furthermore, they include many aspects, such as data cleaning and communication, that are critical, but rarely taught. However, these notes do not contain everything that you need. Your learning must be ‘active’ when using these notes because that is the way you will continue to learn through the rest of your life and career. You need to seek out additional information, critically evaluate it, and apply it to your situation.
The workflow that we follow in these notes is:
- Research question development.
- Data collection.
- Data cleaning.
- Exploratory data analysis.
- Statistical modelling.
All of these aspects are critical to being able to convince a reader of your story. Your ability to convince them of your story depends on the quality of all aspects of your workflow.
If we were to expand on this workflow then we roughly get the chapters that are covered in these notes, although they are re-ordered as necessary. From the first chapter we will have a workflow (make a graph then write about it convincingly) that allows us to tell a convincing (albeit likely basic) story. In each subsequent chapter we add aspects and depth to our workflow that will allow us to speak with increasing sophistication and credibility.
This workflow also aligns nicely with the skills that are sought in data scientists. For instance, Mango Solutions, a UK data science consultancy, describes ‘the six core capabilities of data scientists’ as: 1. communicator; 2. data-wrangler; 3. programmer; 4. technologist; 5. modeller; and 6. visualiser (“Data Science Radar: How to Identify World-Class Data Science Capabilities” 2020).
These notes are also designed to enable you to build a portfolio of work that you could show to a potential employer. This is arguably the most important thing that you should be doing. (E. Robinson and Nolis 2020, 55) describe a portfolio as ‘a set of data science projects that you can show to people so they can see what kind of data science work you can do’. They describe this as a ‘step [that] can really help you be successful’.
The software that we use in these notes is R (R Core Team 2020). This language was chosen because it is open-source, widely used, general enough to cover the entire workflow, yet specific enough to have plenty of the tools that we need for statistical analysis built in. We do not assume that you have used R before, and so another reason for selecting R for these notes is the community of R users which is, in general, especially welcoming of new-comers and there are a lot of great beginner-friendly materials available.
If you don’t have a programming language, then R is a great one to start with. If you have a preferred programming language already, then it wouldn’t hurt to pick up R as well. That said, if you have a good reason to prefer another open source programming language (for instance you use Python daily at work) then you may wish to stick with that. However, all examples in these notes are in R.
Please download R and R Studio onto your own computer. You can download R for free here: http://cran.utstat.utoronto.ca/, and you can download R Studio Desktop for free here: https://rstudio.com/products/rstudio/download/#download.
Please also create an account on R Studio Cloud: https://rstudio.cloud/. This will allow you to run R in the cloud, which will be helpful when we are getting started.
1.4.2 Assumed background
These notes assume familiarity with first-year statistics. For instance, if you have a taken a course or two where you covered hypothesis testing and similar concepts then that should be enough. That said, enthusiasm and interest can take you pretty far, so if you’ve got those then don’t worry about too much else.
These notes are structured around a fairly dense 12-week course. Each chapter contains a list of required reading, as well as a list of recommended reading for those who are interested in the topic and want a starting place for further exploration. All chapters contain a summary of the key concepts and skills that are developed in that chapter. Code and technical chapters additionally contain a list of the main packages and functions that are used in the chapter. Many of the chapters also have a pre-quiz. This is a short quiz that you should complete after doing the required readings, but before going through the chapter to test your knowledge. After completing the chapter, you should go back through the lists and the pre-quiz to make sure that you understand each aspect.
There are problem sets throughout these notes. These are opportunities for you to conduct your own research on a topic that is of interest to you. Although the initial problem sets require you to use data from the Toronto Open Data Portal (https://open.toronto.ca/), after those first few you are able to use any appropriate dataset. Although open-ended research may be new to you, the extent to which you are able to develop your own questions, use quantitative methods to explore them, and communicate your story to a reader, is the true measure of the success of these notes.
Many people gave generously of their time, code, and data to help develop these notes.
Thank you to Kelly Lyons, Hareem Naveed, and Periklis Andritsos for helpful comments.
These notes have greatly benefited from the notes and teaching materials of others that are freely available online, especially:
- Chris Bail’s Text as Data;
- Andrew Heiss’s Program Evaluation for Public Service;
- Grant McDermott’s Data Science for Economists;
- David Mimno’s Text Mining for History and Literature
- Ed Rubin’s PhD Econometrics (III) and Introduction to Econometrics (II);
Thank you to the following students who identified specific improvements in these notes: Aaron Miller, Amy Farrow, Cesar Villarreal Guzman, Faria Khandaker, Hong Shi, Mounica Thanam, and Wijdan Tariq.
Finally, thank you to the Winter 2020 and 2021 INF2178 and Fall 2020 Term STA304 students at the University of Toronto, whose feedback greatly improved all aspects.
Any comments or suggestions on these notes would be welcomed. You can contact me: email@example.com.
“Data Science Radar: How to Identify World-Class Data Science Capabilities.” 2020. Mango Solutions. https://www.mango-solutions.com/data-science-radar-how-to-identify-world-class-data-science-capabilities/.
Meng, Xiao-Li, and others. 2018. “Statistical Paradises and Paradoxes in Big Data (I): Law of Large Populations, Big Data Paradox, and the 2016 Us Presidential Election.” The Annals of Applied Statistics 12 (2): 685–726.
Raju, Tonse. 2005. William Sealy Gosset and William a. Silverman: Two "Students" of Science. Pediatrics. Vol. 116. https://doi.org/10.1542/peds.2005-1134.
R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Robinson, Emily, and Jacqueline Nolis. 2020. Build a Career in Data Science. https://livebook.manning.com/book/build-a-career-in-data-science?origin=product-look-inside.