Preface

This aim of this book is to help you learn how to tell stories with data. It establishes a foundation on which you can build and share knowledge, based on data, about an aspect of the world of interest to you.

In this book we explore, prod, push, manipulate, knead, and ultimately, try to understand the implications of, data. The motto of the university from which I took my PhD is ‘Naturam primum cognoscere rerum’ or roughly ‘first to learn the nature of things,’ and we will indeed attempt to do that. But the original quote continues ‘temporis aeterni quoniam,’ or roughly ‘for eternal time,’ and it is tools, approaches, and workflows that enable you to establish lasting knowledge that I focus on in this book.

When I talk of data in this book, it will typically be related to humans. This means that humans are at the centre of our stories, and we must keep that front of mind at all times. Respecting those whose data are contained in our datasets is a primary concern, but so is thinking of those who are systematically not in our dataset. To become knowledge, our findings must be communicated to, understood, and trusted by, others. These factors drive the choices in this book.

Improving the quality of quantitative work is an enormous challenge, yet it is the challenge of our time. Data are all around us, but there is very little knowledge being created. This book hopes to contribute, in some small way, to changing that.

Audience and assumed background

The typical audience for this book has some familiarity with first-year statistics. For instance, if you have a taken a course or two where you covered hypothesis testing and similar concepts then that should be enough. It is not targeted specifically at undergraduate or graduate levels, however, provides essentials for any level of education. I have taught aspects of it to everyone from high schoolers through to professors.

This book especially complements books such as McElreath (2020), Wickham and Grolemund (2017), James et al. (2017), and Cunningham (2021). For instance, after taking a course based on this book, or concurrently, many students would likely also be interested in Bayesian statistics, data science, statistical learning, and causal inference.

All of that said, enthusiasm and interest has taken students very far in the past. Some of the most successful students have been those with no quantitative background at all. If you have enthusiasm and interest, then don’t worry about too much else.

Structure

This book is structured around six parts: I) Foundations, II) Communication, III) Acquisition IV) Preparation, V) Modelling, and VI) Enrichment.

Part I – Foundations – begins with Chapter 1, which provides an overview of what we are trying to achieve with this book and why you should read it. Chapter 2, then provides some worked examples. The intention of these is that you can run the R code without worrying too much about the specifics of what is happening. It is normal to not follow everything in this chapter, but you should go through it, typing out and executing the code. Chapter 3 goes through some essential functions of R, which is the statistical programming language used in this book. It is more of a reference chapter, and you will find yourself returning to it from time to time. And finally, Chapter 4 introduces the key aspects of the workflow that you should adopt. These are things like using R Markdown, R Projects, Git and GitHub, using R in practice and finally developing research questions.

Part II – Communication – considers three types of communication: writing, static, and interactive. Chapter 5 details the features that quantitative writing should have and how to go about writing a paper when you start with a blank page. Static communication in Chapter 6 introduces features like graphs, tables and maps that are fixed. Interactive communication in Chapter 7 refers to things like websites, Shiny and maps that a reader can manipulate.

Part III – Acquisition – focuses on three aspects. Gather data in Chapter 8 covers things like APIs, scraping data, getting data from PDFs, and OCR. The idea is that the data are available, albeit not necessarily designed to be data, and that we have to go and get it. Hunt data in Chapter 9 covers aspects where more is expected from the analyst. For instance, we may conduct an experiment, run an A/B test, or do some surveys. Finally, farm data in Chapter 10 covers datasets that are explicitly provided for us to use as data, for instance censuses and other government statistics. These are clean, pre-packaged datasets that we can just use.

Part IV – Preparation – covers how to get data from a raw form into something that can be explored and shared with others in a respectful way. Chapter 11 begins by detailing some principles to follow when approaching the task of cleaning and preparing data, and then steps through specific steps to take and checks to implement. Chapter 12 focuses on methods of storing and retrieving those datasets, including the use of R packages. Chapter 13 discusses considerations and steps to take when wanting to, correctly, disseminate the datasets, while at the same time respecting those whose data they are based on.

Part V – Modelling – begins with exploratory data analysis in Chapter 14. This is the critical process of coming to understand the dataset, but not something that typically finds itself into the final product. In Chapter 15 the use of statistical models to explore data is introduced. Chapter 16 is the first of three applications of modelling. It focuses on attempts to make causal claims from observational data. Chapter 17 is the second of the modelling applications chapters and focuses on multilevel regression with post-stratification. Chapter 18 is the third and final modelling application and is focused on models of text datasets.

Part VI – Enrichment – goes through various next steps that you could take Next steps begins with Chapter 19 which goes through moving away from your own computer and toward using other people’s computer. Chapter 20 discusses deploying models through the use of packages, shiny, and plumber. Chapter 21 discusses various alternatives to the storage of data including feather and SQL. Finally, Chapter 22 concludes and offers some concluding remarks, details some open problems, and suggests some next steps

Pedagogy

This book is structured around a fairly dense 12-week course. It provides enough material for advanced readers to push themselves, and pick projects and a focus of their own, while there is a core of essential material that all readers should master. Typically courses cover all of the material through to Chapter 15, and then pick another couple that are of particular interest. The key is that readers actively go through material and code themselves, rather than be passively lectured at.

In DeWitt (2000, 326), a character says of another:

[A] scholar should be able to look at any word in a passage and instantly think of another passage where it occurred; … [a] text was like a pack of icebergs each word a snowy peak with a huge frozen mass of cross-references beneath the surface.

This book not only provides readers with snowy peaks, but also develops the masses of knowledge on which to build. No chapter contains the last word, instead it is written in relation to other work. As such, each chapter contains a list of required materials that all readers should go through before they read that chapter. To be clear, readers should first read that material and then return to this book. Each chapter also contains recommended materials for those who are particularly interested in the topic and want a starting place for further exploration.

Each chapter also contains a short quiz that readers should complete after going through the required materials, but before going through the chapter to test their knowledge. After completing the chapter, readers should go back through the lists, and the pre-quiz to make sure that they understand each aspect.

All chapters contain a summary of the key concepts and skills that are developed in that chapter. Code and technical chapters additionally contain a list of the main packages and functions that are used in the chapter.

A set of tutorial questions is included at the end of each chapter to encourage readers to actively engage with the material. Readers could consider forming small groups to discuss their answers to these questions, or writing brief answers.

Finally, a set of six papers is included in the appendix. Readers that write these papers will be conducting their own research on a topic that is of interest to them. Although open-ended research may be new to readers, the extent to which a reader is able to: develops their own questions, uses quantitative methods to explore them, and communicates their findings, is the measure of the success of this book.

Software information and conventions

The software that we use in this book is R (R Core Team 2020). This language was chosen because it is open-source, widely used, general enough to cover the entire workflow, yet specific enough to have plenty of the tools that we need for statistical analysis built in. We do not assume that you have used R before, and so another reason for selecting R for this book is the community of R users which is, in general, especially welcoming of new-comers and there are a lot of great beginner-friendly materials available.

If you don’t have a programming language, then R is a great one to start with. If you have a preferred programming language already, then it wouldn’t hurt to pick up R as well. That said, if you have a good reason to prefer another open-source programming language (for instance you use Python daily at work) then you may wish to stick with that. However, all examples in this book are in R.

Please download R and R Studio onto your own computer. You can download R for free here: http://cran.utstat.utoronto.ca/, and you can download R Studio Desktop for free here: https://rstudio.com/products/rstudio/download/#download. Please also create an account on R Studio Cloud: https://rstudio.cloud/. This will allow you to run R in the cloud, which will be helpful when we are getting started.

Packages are in typewriter text, for instance, tidyverse, while functions are also in typewriter text, but include brackets, for instance `dplyr::filter().

Acknowledgments

Many people gave generously of their code, data, examples, thoughts, and time, to help develop this book.

Thank you to Michael Chong, and Sharla Gelfand for allowing their code to be used, and for helping shape some of the approaches advocated in this book. However, both do much more than that and also contribute in an enormous way to the spirit of generosity that characterises the R community.

Thank you to Kelly Lyons for her incredible support, guidance, mentorship and friendship.

Thank you to Hareem Naveed, and Periklis Andritsos for helpful comments and encouragement.

Thank you to Greg Wilson for providing a structure to think about teaching and for being the catalyst for this book.

Thank you to my supervisory panel John Tang (chair), Martine Mariotti, Tim Hatton, and Zach Ward who gave me the freedom to explore the intellectual space that was of interest to me.

Thank you to Elle Côtè for enabling this book to be written.

This book has greatly benefited from the notes and teaching materials of others that are freely available online, especially: Chris Bail’s ‘Text as Data’; Scott Cunningham’s ‘Causal Inference: The Mixtape’; Andrew Heiss’ ‘Program Evaluation for Public Service’; Lisa Lendway’s ‘Advanced Data Science in R’; Grant McDermott’s ‘Data Science for Economists’; Nathan Matias’ ‘Designing Field Experiments at Scale’; David Mimno’s ‘Text Mining for History and Literature,’ and Ed Rubin’s ‘PhD Econometrics (III)’ and ‘Introduction to Econometrics (II).’ The changed norm such that scholars increasingly make the entirety of their materials freely available online is a great one and one that I hope the free online version of this book helps contribute to.

Thank you to those who identified specific improvements in this book, including: A Mahfouz, Aaron Miller, Amy Farrow, Cesar Villarreal Guzman, Faria Khandaker, Flavia López, Hong Shi, Laura Cline, Lorena Almaraz De La Garza, Mounica Thanam, Reem Alasadi, Wijdan Tariq, and Yang Wu.

More broadly, thank you to the Winter 2020 and 2021 INF2178 and Fall 2020 Term STA304 students at the University of Toronto, whose feedback greatly improved all aspects of this book.

Finally, thank you to Monica Alexander

You can contact me at: .

Rohan Alexander
Toronto, Canada

References

Cunningham, Scott. 2021. Causal Inference: The Mixtape. Yale Press.
DeWitt, Helen. 2000. The Last Samurai. Talk Mirimax Books.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2017. An Introduction to Statistical Learning with Applications in r.
McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.
R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science. https://r4ds.had.co.nz/.