This book will help you tell stories with data. It establishes a foundation on which you can build and share knowledge about an aspect of the world of interest to you based on data that you observe. Telling stories in small groups around a fire played a critical role in the development of humans and society (Wiessner 2014). Today our stories, based on data, can influence millions.
In this book we will explore, prod, push, manipulate, knead, and ultimately, try to understand the implications of data. A variety of features drive the choices in this book.
The motto of the university from which I took my PhD is naturam primum cognoscere rerum or roughly ‘learn the first nature of things.’ But the original quote continues temporis aeterni quoniam, or roughly ‘for eternal time.’ We will do both of these things. I focus on tools, approaches, and workflows that enable you to establish lasting and reproducible knowledge.
When I talk of data in this book, it will typically be related to humans. Humans will be at the centre of most of our stories, and we will tell social, cultural, and economic stories. In particular, throughout this book I will draw attention to inequity both in social phenomena and in data. Most data analysis reflects the world as it is. Many of the least well-off face a double burden in this regard: not only are they disadvantaged, but the extent is more difficult to measure. Respecting those whose data are in our dataset is a primary concern, and so is thinking of those who are systematically not in our dataset.
While data are often specific to various contexts and disciplines, the approaches used to understand them tend to be similar. Data are also increasingly global with resources and opportunities available from a variety of sources. Hence, I draw on examples from many disciplines and geographies.
To become knowledge, our findings must be communicated to, understood, and trusted by other people. Scientific and economic progress can only be made by building on the work of others. And this is only possible if we can understand what they did. Similarly, if we are to create knowledge about the world, then we must enable others to understand precisely what we did, what we found, and how we went about our tasks. As such, in this book I will be particularly prescriptive about communication and reproducibility.
Improving the quality of quantitative work is an enormous challenge, yet it is the challenge of our time. Data are all around us, but there is little enduring knowledge being created. This book hopes to contribute, in some small way, to changing that.
The typical person reading this book has some familiarity with first-year statistics, for instance they have run a regression. But it is not targeted at a particular level, instead providing aspects relevant to almost any quantitative course. I have taught from this book at high school, undergraduate, and graduate levels. Everyone has unique needs, but hopefully some aspect of this book speaks to you.
This book especially complements Statistical Rethinking (McElreath 2020), R for Data Science (Wickham and Grolemund 2017), An Introduction to Statistical Learning (James et al. 2017), Causal Inference: The Mixtape (Cunningham 2021), and Building Software Together (G. Wilson 2021). For instance, after this book you may be interested in learning more about Bayesian statistics, data science, statistical learning, causal inference, and building software.
All of that said, some of the most successful students have been those with no quantitative or coding background. Enthusiasm and interest have taken folks far. If you have those, then don’t worry about too much else.
This book is structured around six parts: I) Foundations, II) Communication, III) Acquisition, IV) Preparation, V) Modelling, and VI) Enrichment.
Part I – Foundations – begins with Chapter 1, which provides an overview of what I am trying to achieve with this book and why you should read it. Chapter 2 provides some worked examples. The intention of these is that you can experience the full workflow recommended in this book without worrying too much about the specifics of what is happening. That workflow is: plan, simulate, acquire, model, and communicate. It is normal to not follow everything in this chapter, but you should go through it, typing out and executing the code yourself. If you only have time to read one chapter of this book, then I recommend that one. Chapter 3 goes through some essential tasks in R, which is the statistical programming language used in this book. It is more of a reference chapter, and you may find yourself returning to it from time to time. And Chapter 4 introduces some key tools used in the workflow that I advocate. These are things like using the command line, R Markdown, R Projects, Git and GitHub, using R in practice, and developing research questions.
Part II – Communication – considers three types of communication: written, static, and interactive. Chapter 5 details the features that quantitative writing should have and how to go about writing a crisp, technical, paper. Static communication in Chapter 6 introduces features like graphs, tables, and maps. Interactive communication in Chapter 7 covers aspects such as websites, web applications, and maps that can be manipulated.
Part III – Acquisition – focuses on three aspects: gathering data, hunting data, and farming data. Gathering data in Chapter 8 covers things like using Application Programming Interface (APIs), scraping data, getting data from PDFs, and Optical Character Recognition (OCR). The idea is that data are available, but not necessarily designed to be datasets, and that we must go and get them. Hunting data in Chapter 9 covers aspects where more is expected of us. For instance, we may need to conduct an experiment, run an A/B test, or do some surveys. Finally, farming data in Chapter 10 covers datasets that are explicitly provided for us to use as data, for instance censuses and other government statistics. These are typically clean, pre-packaged datasets.
Part IV – Preparation – covers how to respectfully transform raw data into something that can be explored and shared. Chapter 11 begins by detailing some principles to follow when approaching the task of cleaning and preparing data, and then goes through specific steps to take and checks to implement. Chapter 12 focuses on methods of storing and retrieving those datasets, including the use of R packages. Chapter 13 discusses considerations and steps to take when wanting to disseminate datasets as broadly as possible, while at the same time respecting those whose data they are based on.
Part V – Modelling – begins with exploratory data analysis in Chapter 14. This is the critical process of coming to understand the nature of a dataset, but not something that typically finds itself into the final product. In Chapter 15 the use of statistical models to explore data is introduced. Chapter 16 is the first of three applications of modelling. It focuses on attempts to make causal claims from observational data and covers approaches such as difference-in-differences, regression discontinuity, and instrumental variables. Chapter 17 is the second of the modelling applications chapters and focuses on multilevel regression with post-stratification where we use a statistical model to adjust a sample for known biases. Chapter 18 is the third and final modelling application and is focused on text-as-data.
Part VI – Enrichment – introduces various next steps that would improve aspects of the workflow and approaches introduced in previous chapters. Chapter 19 which goes through moving away from your own computer and toward using the cloud. Chapter 20 discusses deploying models through the use of packages, web applications, and APIs. Chapter 21 discusses various alternatives to the storage of data including feather and SQL; and also covers some ways to improve the performance of your code. Finally, Chapter 22 offers some concluding remarks, details some open problems, and suggests some next steps.
You have to do the work. You should actively go through material and code yourself. As King (2000) says ‘[a]mateurs sit and wait for inspiration, the rest of us just get up and go to work.’ Do not passively read this book. My role is best described by Hamming (1996, 2–3):
I am, as it were, only a coach. I cannot run the mile for you; at best I can discuss styles and criticize yours. You know you must run the mile if the athletics course is to be of benefit to you—hence you must think carefully about what you hear and read in this book if it is to be effective in changing you—which must obviously be the purpose…
This book is structured around a dense 12-week course. It provides enough material for advanced readers to be challenged, while establishing a core that all readers should master. Typically courses cover the material through to Chapter 15, and then pick another couple of chapters that are of particular interest.
From as early as Chapter 2 you will have a workflow—plan, simulate, acquire, model, and communicate—allowing you to tell a convincing story with data. In each subsequent chapter you will add aspects and depth to this workflow that will allow you to speak with increasing sophistication and credibility. As this workflow expands it addresses the skills that are typically sought in industry. For instance, features such as: communication, ethics, reproducibility, research question development, data collection, data cleaning, data protection and dissemination, exploratory data analysis, statistical modelling, and scaling.
One of the defining aspects of this book is that ethics and inequity concerns are integrated throughout, rather than being clustered in one, easily ignorable, chapter. These aspects are critical, yet it can be difficult to immediately see their value, hence their tight integration.
This book is also designed to enable you to build a portfolio of work that you could show to a potential employer. If you want an industry job, then this is arguably the most important thing that you should be doing. E. Robinson and Nolis (2020, 55) describe how a portfolio is a collection of projects that show what you can do and is something that can help be successful in a job search.
[A] scholar should be able to look at any word in a passage and instantly think of another passage where it occurred; … [so a] text was like a pack of icebergs each word a snowy peak with a huge frozen mass of cross-references beneath the surface.
In an analogous way, this book not only provides you with text and instruction that is self-contained, but also helps develop critical masses of knowledge on which expertise is built. No chapter positions itself as the last word, instead they are written in relation to other work.
Each chapter has the following features:
- A list of required materials that you should go through before you read that chapter. To be clear, you should first read that material and then return to this book. Each chapter also contains recommended materials for those who are particularly interested in the topic and want a starting place for further exploration.
- A summary of the key concepts and skills that are developed in that chapter. Technical chapters additionally contain a list of the main packages and functions that are used in the chapter. The combination of these features acts as a checklist for your learning, and you should return to them after completing the chapter.
- A section called ‘Oh, you think we have good data on that!’ which focuses on a particular setting, such as cause of death, in which it is often assumed that there is unimpeachable and unambiguous data but the reality tends to be quite far from that.
- A section called ‘Shoulders of giants,’ which focuses on some of those who created the intellectual foundation on which we build.
- A series of short exercises that you should complete after going through the required materials, but before going through the chapter, to test your knowledge. After completing the chapter, you should go back through the exercises to make sure that you understand each aspect.
- One or two tutorial questions are included at the end of each chapter to further encourage you to actively engage with the material. You could consider forming small groups to discuss your answers to these questions.
Finally, a set of six papers is included in the appendix. If you write these, you will be conducting original research on a topic that is of interest to you. Although open-ended research may be new to you, the extent to which you are able to: develop your own questions, use quantitative methods to explore them, and communicates your findings, is the measure of the success of this book.
The software that I use in this book is R (R Core Team 2020). This language was chosen because it is open source, widely used, general enough to cover the entire workflow, yet specific enough to have plenty of well-developed features. I do not assume that you have used R before, and so another reason for selecting R for this book is the community of R users. The community is especially welcoming of new-comers and there is a lot of complementary beginner-friendly material available. There is an R package,
DoSSToolkit (R. Alexander et al. 2021), that contains
learnr modules (Schloerke et al. 2021). This may be useful if you are newer to R and are especially complementary to this book.
If you don’t have a programming language, then R is a great one to start with. If you have a preferred programming language already, then it wouldn’t hurt to pick up R as well. That said, if you have a good reason to prefer another open-source programming language (for instance you use Python daily at work) then you may wish to stick with that. However, all examples in this book are in R.
Please download R and R Studio onto your own computer. You can download R for free here: http://cran.utstat.utoronto.ca/, and you can download R Studio Desktop for free here: https://rstudio.com/products/rstudio/download/#download. Please also create an account on R Studio Cloud: https://rstudio.cloud/. This will allow you to run R in the cloud.
Packages are in typewriter text, for instance,
tidyverse, while functions are also in typewriter text, but include brackets, for instance
Many people generously gave code, data, examples, guidance, opportunities, thoughts, and time, that helped develop this book.
Thank you to David Grubbs and the team at CRC Press for taking a chance on me and providing invaluable support.
Thank you to Michael Chong and Sharla Gelfand for greatly helping to shape some of the approaches I advocate. However, both do much more than that and contribute in an enormous way to the spirit of generosity that characterises the R community.
Thank you to Kelly Lyons for her support, guidance, mentorship, and friendship. Every day she demonstrates what an academic should be, and more broadly, what we should all aspire to be as a person.
Thank you to Greg Wilson for providing a structure to think about teaching, for being the catalyst for this book, and for helpful comments on drafts. Every day he provides an example of how to contribute to the intellectual community.
Thank you to an anonymous reviewer and Isabella Ghement, who thoroughly went through an early draft of this book and provided detailed feedback that improved this book.
Thank you to Hareem Naveed for helpful feedback and encouragement. Her industry experience was an invaluable resource as I grappled with questions of coverage and focus.
Thank you to my PhD supervisory panel John Tang, Martine Mariotti, Tim Hatton, and Zach Ward who gave me the freedom to explore the intellectual space that was of interest to me, the support to follow through on those interests, and the guidance to ensure that it all resulted in something tangible.
Thank you to Elle Côtè for enabling this book to be written.
This book has greatly benefited from the notes and teaching materials of others that are freely available online, especially: Chris Bail, Scott Cunningham, Andrew Heiss, Lisa Lendway, Grant McDermott, Nathan Matias, David Mimno, and Ed Rubin. Thank you to these folks. The changed norm of academics making their materials freely available online is a great one and one that I hope the free online version of this book helps contribute to.
Thank you to those students who contributed substantially to the development of this book, including: A Mahfouz, Faria Khandaker, Keli Chiu, Paul Hodgetts, and Thomas William Rosenthal. I discussed most aspects of this book with them, and while they made specific contributions, they also changed and sharpened the way that I thought about almost everything covered here. Paul additionally made the art for this book.
Thank you to those students who identified specific improvements, including: Aaron Miller, Amy Farrow, Cesar Villarreal Guzman, Flavia López, Hong Shi, Laura Cline, Lorena Almaraz De La Garza, Mounica Thanam, Reem Alasadi, Wijdan Tariq, and Yang Wu.
Finally, thank you to Monica Alexander. Without you I would not have written a book; I would not have even thought it possible. Thank you for your inestimable help with writing this book, providing the base on which it builds (remember in the library showing me many times how to get certain rows in R!), giving me the time that I needed to write when that was the only thing I could think about, encouragement when it turned out that writing a book just meant endlessly re-writing that which was perfect the day before, reading everything in this book many times, and much more.
You can contact me at: email@example.com.