Chapter 1 Introduction

Last updated: 3 May 2021.

This book is being actively developed.

If you have comments or suggestions, or find mistakes, then please don’t hesitate to get in touch.

Version: 0.0.0.9000.

1.1 Welcome

Hi, I’m Rohan Alexander. You can find out more about me here. I wrote this book to support my teaching at the University of Toronto across the Faculty of Information and the Department of Statistical Sciences. The focus is on using quantitative methods to tell stories with data.

1.2 Structure

The parts of this book are:

  • Essentials
    • Introduction
    • Drinking from a fire hose
    • R Essentials
    • Workflow
  • Communicate
    • Static communication
    • Interactive communication
  • Hunt and gather
    • Gather data
    • Hunt data
    • Farm data
  • Clean
    • Cleaning and preparing
    • Storage and retrieval
    • Dissemination and protection
  • Model
    • Exploratory data analysis
    • It’s just a linear model
    • Causality from observational data
    • Multilevel regression with post-stratification
    • Text as data
  • Other
    • Cloud
    • Deploy
  • Assessment

1.3 On telling stories

Like many parents, when our child was born, one of the first things that my wife and I did regularly was read stories to him. In doing so we carried on a tradition that has occurred for millennia. Myths, fables, and fairy tales can be seen and heard all around us. Not only are they entertaining but they enable us to easily learn something about the world. While ‘The Very Hungry Caterpillar’ may seem quite far from the world of quantitative analysis, there are similarities. Both are trying to tell the reader a story.

When conducting quantitative analysis, we are trying to tell the reader a story that will convince them of something. It may be as exciting as predicting elections, as banal as increasing internet advertising click rates by 0.01 per cent, as serious as finding the cause of some disease, or as fun as forecasting the winner of a basketball game. In any case the key elements are the same. When writing fiction Wikipedia suggests there are five key elements: character, plot, setting, theme, and style. When we are conducting quantitative analysis, we have analogous concerns:

  1. What is the data? Who generated it and how?
  2. What is the data trying to say? How can we let it say this?
  3. What is the broader context surrounding the data? Where and when was it generated? Could other data have been generated?
  4. What are we hoping others will see from this data?
  5. How can you convince them of this?

In the past, certain elements of telling stories with quantitative data were easier. For instance, experimental design has a long and robust tradition within traditional applications such as agricultural and medical sciences, physics, and chemistry. Student’s t-distribution was identified by a chemist, William Sealy Gosset, who was working at Guinness and needed to assess the quality of the beer (Raju 2005)! It would have been possible for him to randomly sample the beer and change one aspect at a time. Indeed, many of the fundamental statistical methods that we use today were developed in an agricultural setting. In the settings for which they were developed it was typically possible to establish control groups, randomize, and easily deal with any ethical concerns. In such a setting any subsequent story that is told with the resulting data is likely to be fairly convincing.

Unfortunately, such a set-up is rarely possible in modern applied statistics applications. On the other hand, there are many aspects that are easier today. For instance, we have well-developed statistical techniques, easier access to larger datasets, and open-source statistical languages such as R. But the lack of ability to conduct traditional experiments means that we must turn to other aspects in order to tell a reader a convincing story about our data. These other aspects allow us to tell convincing stories even in the absence of a traditional experimental set-up.

1.4 Telling stories with data

The aim of this book is to equip you with everything you need to be able to write short(ish), technical, memos, that convince a reader of the story you are telling. This book encourages research-based, independent learning. This means that you should develop your own questions and answer them to the extent that you can. I focus on methods that can provide convincing stories even when it is not possible to conduct traditional experiments. Importantly, these approaches do not rely on ‘big data’ – which is widely known by practitioners to not be a panacea (Meng 2018) – but instead on better using the data that are available. The purpose of this book is to allow you to tell convincing stories using data and quantitative analysis. It blends theory and case studies to equip you to with practical skills, a sophisticated workflow, and an appreciation for how more-advanced methods build on what is covered here.

Data science is multi-disciplinary. It takes the ‘best’ bits from fields such as statistics, data visualisation, programming, and experimental design (to name a few). As such, data science projects require a blend of these skills. This is a hands-on book in which you will learn these skills by conducting research projects using real-world data. This means that you will:

  • obtain and clean relevant datasets;
  • develop your own research questions;
  • use statistical techniques to answer those questions; and
  • communicate your results in a meaningful way.

In the words of Hamming (1996, 2–3):

I am, as it were, only a coach. I cannot run the mile for you; at best I can discuss styles and criticize yours. You know you must run the mile if the athletics course is to be of benefit to you—hence you must think carefully about what you hear and read in this book if it is to be effective in changing you—which must obviously be the purpose of any course. Again, you will get out of this course only as much as you put in, and if you put in little effort beyond sitting in the class or reading the book, then it is simply a waste of your time. You must also mull things over, compare what I say with your own experiences, talk with others, and make some of the points part of your way of doing things.

Since the subject matter is “style,” I will use the comparison with teaching painting. Having learned the fundamentals of painting, you then study under a master you accept as being a great painter; but you know you must forge your own style of out the elements of various earlier painters plus your native abilities. You must also adapt your style to fit the future, since merely copying the past will not be enough if you aspire to future greatness—a matter I assume, and will talk about often in the book. I will show you my style as best I can, but, again, you must take those elements of it which seem to fit you, and you must finally create your own style. Either you will be a leader or a follower, and my goal is for you to be a leader. You cannot adopt every trait I discuss in what I have observed in myself and others; you must select and adapt, and make them your own if the course if to be effective.

Richard W. Hamming.

This book was developed in collaboration with professional data scientists as well as academics from a variety of fields. They are designed around approaches that are used extensively in academia, government, and industry. Furthermore, they include many aspects, such as data cleaning and communication, that are critical, but rarely taught. However, this book does not contain everything that you need. Your learning must be ‘active’ when using this book because that is the way you will continue to learn through the rest of your life and career. You need to seek out additional information, critically evaluate it, and apply it to your situation.

The workflow that we follow in this book is:

  1. Research question development.
  2. Data collection.
  3. Data cleaning.
  4. Exploratory data analysis.
  5. Statistical modelling.
  6. Evaluation.
  7. Communication.
  8. Reproduce.

All of these aspects are critical to being able to convince a reader of your story. Your ability to convince them of your story depends on the quality of all aspects of your workflow.

If we were to expand on this workflow then we roughly get the chapters that are covered in this book, although they are re-ordered as necessary. From the first chapter we will have a workflow (make a graph then write about it convincingly) that allows us to tell a convincing (albeit likely basic) story. In each subsequent chapter we add aspects and depth to our workflow that will allow us to speak with increasing sophistication and credibility.

This workflow also aligns nicely with the skills that are sought in data scientists. For instance, Mango Solutions, a UK data science consultancy, describes ‘the six core capabilities of data scientists’ as: 1. communicator; 2. data-wrangler; 3. programmer; 4. technologist; 5. modeller; and 6. visualiser (“Data Science Radar: How to Identify World-Class Data Science Capabilities” 2020).

This book is also designed to enable you to build a portfolio of work that you could show to a potential employer. This is arguably the most important thing that you should be doing. (E. Robinson and Nolis 2020, 55) describe a portfolio as ‘a set of data science projects that you can show to people so they can see what kind of data science work you can do.’ They describe this as a ‘step [that] can really help you be successful.’

1.4.1 Software

The software that we use in this book is R (R Core Team 2020). This language was chosen because it is open-source, widely used, general enough to cover the entire workflow, yet specific enough to have plenty of the tools that we need for statistical analysis built in. We do not assume that you have used R before, and so another reason for selecting R for this book is the community of R users which is, in general, especially welcoming of new-comers and there are a lot of great beginner-friendly materials available.

If you don’t have a programming language, then R is a great one to start with. If you have a preferred programming language already, then it wouldn’t hurt to pick up R as well. That said, if you have a good reason to prefer another open-source programming language (for instance you use Python daily at work) then you may wish to stick with that. However, all examples in this book are in R.

Please download R and R Studio onto your own computer. You can download R for free here: http://cran.utstat.utoronto.ca/, and you can download R Studio Desktop for free here: https://rstudio.com/products/rstudio/download/#download.

Please also create an account on R Studio Cloud: https://rstudio.cloud/. This will allow you to run R in the cloud, which will be helpful when we are getting started.

1.4.2 Assumed background

This book assumes familiarity with first-year statistics. For instance, if you have a taken a course or two where you covered hypothesis testing and similar concepts then that should be enough. That said, enthusiasm and interest can take you pretty far, so if you’ve got those then don’t worry about too much else.

1.4.3 Structure

This book is structured around a fairly dense 12-week course. Each chapter contains a list of required reading, as well as a list of recommended reading for those who are interested in the topic and want a starting place for further exploration. All chapters contain a summary of the key concepts and skills that are developed in that chapter. Code and technical chapters additionally contain a list of the main packages and functions that are used in the chapter. Many of the chapters also have a pre-quiz. This is a short quiz that you should complete after doing the required readings, but before going through the chapter to test your knowledge. After completing the chapter, you should go back through the lists and the pre-quiz to make sure that you understand each aspect.

There are problem sets contained at the end of this book, which roughly correspond with the parts. These are opportunities for you to conduct your own research on a topic that is of interest to you. Although the initial problem set requires you to use data from the Toronto Open Data Portal (https://open.toronto.ca/), after that you are able to use any appropriate dataset. Although open-ended research may be new to you, the extent to which you are able to develop your own questions, use quantitative methods to explore them, and communicate your story to a reader, is the true measure of the success of this book.

1.5 Acknowledgements

Many people gave generously of their time, code, and data to help develop this book.

Thank you to Monica Alexander, Michael Chong, and Sharla Gelfand for allowing their code to be used.

Thank you to Kelly Lyons, Hareem Naveed, and Periklis Andritsos for helpful comments.

Thank you to Greg Wilson for providing a structure to think about teaching.

Thank you to Elle Côtè for enabling this book to be written.

This book has greatly benefited from the notes and teaching materials of others that are freely available online, especially:

Thank you to the following students who identified specific improvements in this book: A Mahfouz, Aaron Miller, Amy Farrow, Cesar Villarreal Guzman, Faria Khandaker, Flavia López, Hong Shi, Laura Cline, Lorena Almaraz De La Garza, Mounica Thanam, Reem Alasadi, Wijdan Tariq, and Yang Wu.

Finally, thank you to the Winter 2020 and 2021 INF2178 and Fall 2020 Term STA304 students at the University of Toronto, whose feedback greatly improved all aspects of this book.

1.6 Contact

Any comments or suggestions would be welcomed. You can contact me: .