Telling Stories With Data
I Essentials
1
Introduction
1.1
Welcome
1.2
Structure
1.3
On telling stories
1.4
Telling stories with data
1.4.1
Software
1.4.2
Assumed background
1.4.3
Structure
1.5
Acknowledgements
1.6
Contact
2
Drinking from a fire hose
2.1
Hello world
2.2
Case study - Canadian elections
2.2.1
Getting started
2.2.2
Get the data
2.2.3
Clean the data
2.2.4
Make a graph
2.2.5
Make a table
2.3
Case study - Toronto homelessness
2.3.1
Getting started
2.3.2
Get the data
2.3.3
Make a graph
3
R Essentials
3.1
R essentials
3.2
Social impact
3.3
R, R Studio, and R Studio Cloud
3.3.1
R
3.3.2
R Studio
3.3.3
R Studio Cloud
3.4
Tidyverse I
3.4.1
The pipe
3.4.2
Selecting
3.4.3
Filtering
3.4.4
Arranging
3.4.5
Grouping
3.4.6
Mutating
3.4.7
Summarise
3.4.8
Counting
3.4.9
Proportions
3.5
Base
3.5.1
Class
3.5.2
Simulating data
3.5.3
Functions
3.6
ggplot essentials
3.6.1
Main features
3.6.2
Facets
3.7
Tidyverse II
3.7.1
Tibbles
3.7.2
Importing data
3.7.3
Joining data
3.7.4
Strings
3.7.5
Pivot
3.7.6
Factors
3.7.7
Cases
4
Workflow
4.1
Introduction
4.2
R Markdown
4.2.1
Getting started
4.2.2
Basic commands
4.2.3
R chunks
4.2.4
Abstracts and PDF outputs
4.2.5
References
4.2.6
Cross-references
4.3
R projects
4.4
Git and GitHub
4.4.1
Introduction
4.4.2
Git
4.4.3
GitHub
4.4.4
Using Git within RStudio
4.4.5
Next steps
4.5
Using R in practice
4.5.1
Introduction
4.5.2
Getting help
4.5.3
Mentality
4.5.4
Code comments
4.5.5
Learning more
4.6
Developing research questions
II Communicate
5
Static communication
5.1
Introduction
5.2
Graphs
5.2.1
Bar chart
5.2.2
Scatter plot
5.2.3
Other
5.3
Tables
5.4
Maps
5.4.1
Australian polling places
5.4.2
Toronto bike parking
5.5
Writing
5.5.1
Title, abstract, and introduction
5.5.2
Figures, tables, equations, and technical terms
5.5.3
On brevity
5.5.4
Other
6
Interactive communication
6.1
Making a website
6.1.1
Getting started with Blogdown
6.1.2
Introduction
6.1.3
Foundations
6.1.4
Build the frame
6.1.5
Add content
6.1.6
Making your website public
6.2
Interactive maps
6.3
Interactive maps
6.4
Shiny
III Hunt and gather
7
Gathering data
7.1
APIs
7.1.1
Introduction
7.1.2
R packages that wrap around APIs
7.1.3
Using APIs directly
7.2
Case study - rtweet
7.3
Case study - spotifyr
7.4
Scraping
7.4.1
Introduction
7.4.2
Getting started
7.5
Case study - Rohan’s books
7.5.1
Introduction
7.5.2
Gather
7.5.3
Clean
7.5.4
Explore
7.6
Case study - Canadian Prime Ministers
7.6.1
Introduction
7.6.2
Gather
7.6.3
Clean
7.6.4
Explore
7.7
PDFs
7.7.1
Introduction
7.7.2
Getting started
7.8
Case-study: US Total Fertility Rate, by state and year (2000-2018)
7.8.1
Introduction
7.8.2
Begin with an end in mind
7.8.3
Start simple, then iterate.
7.8.4
Iterating
7.9
Case-study: Kenyan census data
7.9.1
Set-up
7.9.2
Extract
7.9.3
Clean
7.9.4
Check
7.9.5
Tidy-up
7.9.6
Make Monica’s dataset
7.10
Optical Character Recognition
7.11
Text
7.11.1
Introduction
7.11.2
Getting text data
7.11.3
Preparing text datasets
8
Hunting data
8.1
Experiments and randomised controlled trials
8.1.1
Introduction
8.1.2
Motivation and notation
8.1.3
Randomised sampling
8.1.4
ANOVA
8.1.5
Treatment and control
8.2
Case study - Fisher’s tea party
8.3
Case study - Tuskegee Syphilis Study
8.4
Case study - The Oregon Health Insurance Experiment
8.5
A/B testing
8.5.1
Introduction
8.5.2
Unique complications of A/B testing
8.6
Case study - Upworthy
8.7
Sampling and survey essentials
8.7.1
Introduction
8.7.2
Simple random sampling
8.7.3
Stratified and cluster sampling
8.8
Implementing surveys
8.8.1
Google
8.8.2
Facebook
8.8.3
Survey Monkey
8.8.4
Mechanical Turk
8.8.5
Prolific
8.8.6
Qualtrics
8.8.7
Other
8.9
Next steps
9
Other sources
9.1
Open Government Data
9.1.1
Canadian Census
9.1.2
City of Toronto Open Data Portal
9.2
Electoral Studies
9.2.1
Canadian Electoral Study
9.2.2
Australian Electoral Study
IV Clean
10
Cleaning and preparing data
11
Storing and retrieving data
12
Disseminating and protecting data
V Model
13
Exploratory data analysis
13.1
Introduction
13.2
A note on packages
13.3
TTC subway delays
13.4
EDA and data viz
13.5
Data checks
13.5.1
Sanity Checks
13.5.2
Missing values
13.5.3
Duplicates?
13.5.4
Visualizing distributions
13.5.5
Visualizing time series
13.5.6
Visualizing relationships
13.5.7
PCA
13.6
Exercises
13.7
Case study - Opinions about a casino in Toronto
13.7.1
Data preparation
13.7.2
Some visual exploration (and more cleanup, of course)
13.7.3
Logistic Regression
13.8
Case study - Historical Canadian elections
13.9
Case study - Airbnb listing in Toronto
13.9.1
Essentials
13.9.2
Set up
13.9.3
Get data
13.9.4
Clean data
13.9.5
Explore data
13.9.6
Model data
13.9.7
Next steps
13.9.8
References
14
Regression essentials
14.1
Linear regression
14.1.1
Introduction
14.1.2
Competing relationships
14.1.3
Implementing this in R
14.1.4
Tidy up with broom
14.1.5
Testing hypothesis
14.1.6
Adding more and varied explanatory variables
14.1.7
Threats to validity and aspects to think about
14.1.8
More credible outputs
14.2
Classification
14.3
Count data
14.3.1
Logistic regression
14.3.2
Poisson regression
15
Difference in differences
15.1
Introduction
15.2
Matching and difference-in-differences
15.2.1
Introduction
15.2.2
Motivation
15.2.3
Simulated example
15.2.4
Assumptions
15.2.5
Matching
15.3
Case study - Lower advertising revenue reduced French newspaper prices between 1960 and 1974
15.3.1
Introduction
15.3.2
Background
15.3.3
Data
15.3.4
Model
15.3.5
Results
15.3.6
Other points
15.4
Tutorial - Propensity score matching - Lalonde
16
Instrumental variables
16.1
Introduction
16.2
History
16.3
Simulated example
16.4
Implementation
16.5
Assumptions
16.6
Example - Effect of Police on Crime
16.6.1
Overview
16.6.2
Data
16.6.3
Model
16.6.4
Discussion
16.7
Conclusion
17
Regression discontinuity design
17.1
Introduction
17.2
Simulated example
17.2.1
Different slopes
17.3
Overlap
17.4
Examples
17.4.1
Elections
17.4.2
Economic development
17.5
Implementation
17.6
Fuzzy RDD
17.7
Threats to validity
17.8
Weaknesses
17.9
Case study - Stiers, Hooghe, and Dassonneville, 2020
17.10
Case study - Caughey, and Sekhon., 2011
18
Poll of polls
18.1
Introduction
19
Multilevel modelling with post-stratification
19.1
Introduction
19.2
Hello world
19.3
Your turn!
19.4
Extended example
19.5
Your turn!
19.6
Adding layers
19.7
Communication
19.8
Concluding remarks
20
Text as data
20.1
Introduction
20.2
Lasso regression
20.3
Topic models
20.3.1
Overview
20.3.2
Document generation process
20.3.3
Analysis process
20.3.4
Warnings and extensions
20.4
Word embedding
20.5
Conclusion
VI Scale
21
Cloud
21.1
Introduction
21.2
Google Colab
21.3
AWS
21.4
Google Compute Engine
21.5
Azure
22
Deploy
22.1
Introduction
VII Assessment
23
Papers
23.1
‘Mandatory Minimums’
23.1.1
Task
23.1.2
Guidance
23.1.3
Check offs points
23.1.4
FAQ
23.2
‘These numbers mean dial it up’
23.2.1
Task
23.2.2
Guidance
23.2.3
Check offs points
23.2.4
FAQ
23.3
‘The Short List’
23.3.1
Task
23.3.2
Guidance
23.3.3
Check offs points
23.3.4
FAQ
23.4
‘Two Cathedrals’
23.4.1
Task
23.4.2
Guidance
23.4.3
Peer review submission
23.4.4
Conduct peer-review
23.4.5
Check offs points
23.4.6
FAQ
23.5
‘A Proportional Response’
23.5.1
Task
23.5.2
Recommended steps
23.5.3
Check offs points
23.5.4
FAQ
23.6
‘Mr Willis of Ohio’
23.6.1
Task
23.6.2
Recommended steps
23.6.3
Check offs points
23.6.4
FAQ
23.7
‘Five Votes Down’
23.7.1
Task
23.7.2
Recommended steps
23.7.3
Check offs points
23.7.4
FAQ
23.8
‘What’s next?’
23.8.1
Task
23.8.2
Recommended steps
23.8.3
Check offs points
23.8.4
FAQ
References
Telling Stories With Data
Chapter 10
Cleaning and preparing data
Required reading
Recommended reading
Key concepts/skills/etc
Key libraries
Key functions/etc
Quiz