Table of Contents

Required reading

Recommended reading

Required viewing

Recommended viewing

Key concepts/skills/etc


  1. What are three features of a good research question?
  2. What is a counterfactual?
  3. Name three features of R Markdown that might be useful?
  4. How do you hide the warnings in a R Markdown R chunk?
  5. What is a RepEx and why is it important to be able to make one?


In this section….

R Markdown

Getting started

R Markdown is a mark-up language similar to html or LaTeX, in comparison to a WYSIWYG language, such as Word. This means that all of the aspects are consistent, for instance, all ‘main headings’ will look the same. However it means that use symbols to designate how you would like certain aspects to appear, and it is only when you compile it that you get to see it.

R Markdown is a variant of regular markdown that is specifically designed to allow R code chunks to be included. The advantage is that you can get a ‘live’ document in which code executes and is then printed to a document. The disadvantage is that it can take a while for the document to compile because all of the code needs to run.

You can create a new R Markdown document within R Studio (File -> New File -> R Markdown Document). Another advantage of R Markdown is that very similar code can compile into a variety of documents, including html pages and PDFs. R Markdown also has default options set up for including a title, author, and date sections.

Basic commands

If you ever need a reminder of the basics of R Markdown then this is built into R Studio (Help -> Markdown Quick Reference). This provides the code for commonly needed commands:

Unordered List
* Item 1
* Item 2
    + Item 2a
    + Item 2b
Ordered List
1. Item 1
2. Item 2
3. Item 3
    + Item 3a
    + Item 3b

In order to create an actual document, once you have these pieces set up, click ‘Knit’.

R chunks

You can include R (and a bunch of other languages) code in code chunks within your R Markdown document. Then when you knit your document, the R code will run and be included in your document.

To create an R chunk start with three backticks and then within curly braces tell markdown that this is an R chunk. Anything inside this chunk will be considered R code and run as such.

ggplot(data = diamonds) + 
  geom_point(aes(x = price, y = carat))

There are various evaluation options that are available in chunks. You include these by putting a comma after r and then specifying any options before the closing curly brace. Helpful options include:

Abstracts and PDF outputs

In the default header, you can add a section for a header, so that it would look like this:

title: My document
author: Rohan Alexander
date: 25 February 2020
output: html_document
abstract: "This is my abstract."

Similarly, you can change the output from html_document to pdf_document in order to produce a PDF. This uses LaTeX in the background so you may need to install a bunch of related packages.

R projects

RStudio has the option of creating a project, which allows you to keep all the files (data, analysis, report etc) associated with a particular project together. To create a project, click Click File > New Project, then select empty project, name your project and think about where you want to save it. For example, if you are creating a project for Problem Set 2, you might call it ps2 and save it in a sub-folder called PS2 in your INF2178 folder.

Once you have created a project, a new file with the extension .RProj will appear in that file. As an example, download the R help folder. Whenever I work on class materials, I open the project file and work from that.

The main advantage of projects is that you don’t have to set the working directory or type the whole file path to read in a file (for example, a data file). So instead of reading a csv from "~/Documents/toronto/teaching/INF2178/data/" you can just read it in from data/.

Using R in practice


This section is what do when your code doesn’t do what you want, discusses a mindset that may help when doing quantitative analysis with R, and finally, some recommendations around how to write your code.

Getting help

Programming is hard and everyone struggles sometimes. At some point your code won’t run or will throw an error. This is normal, and it happens to everyone. It happens to me on a daily, sometimes hourly, basis. Everyone gets frustrated. There are a few steps that are worthwhile taking when this happens:

There are a few small mistakes that I often make and may be worth checking in case you make them too:

It’s almost always helpful to take a break and come back the next day.

Asking for help is a skill that you will get better at, but in general I recomend:

  1. Provide an example of your data and what is going wrong.
  2. Document what you have tried so far.
  3. Document what outcome you would like.

Reproducible examples

A Reproducible Example, or ReprEx,


(Y)ou are a real, valid, competent user and programmer no matter what IDE you develop in or what tools you use to make your work work for you

(L)et’s break down the gates, there’s enough room for everyone

Sharla Gelfand, 10 March 2020.

I’m a little hesitant to make suggestions with regard to mentality. If you write code, then you’re coder regardless of how you do it, what you’re using it for, or who you are. But I want to share a few traits that I have found have been useful to cultivate in myself. That said, entirely, whatever works for you is great, so take or leave this section.


Comment your code.

There is no one way to write code, especially in R. However there are some general guidelines that will make it easier for you even if you’re just working on your own.

Comment your code.

Comments in R can be added by including the # symbol. The shortcut on mac is Command + Shift + m. You don’t have to put a comment at the start of the line, it can be midway through. In general you don’t need to comment what every aspect of your code is doing but you should comment parts that are not obvious. For instance if you read in some value then you may like to comment where it is coming from. You should also try to comment why you are doing something.

Comment your code.

I like to break my code into sections. For instance, setting up my workspace, reading in datasets, manipulating and cleaning the dataset, analysing the datasets, and finally producing tables and figures. While it can be difficult to speak generally, I usually separate each of those certainly with comments explaining what is going on, and sometimes into separate files, depending on the length.

Comment your code.

Additionally, at the top of each file I put basic information, such as the purpose of the file, and pre-requisites or dependencies, the date, the author and contact information, and finally and red-flags, bodies, or todos.

Comment your code.

Learning more

One of the great aspects of R is that there is a friendly community of people who use it. There are a variety of ways that I learn about new tricks, functions, and packages including:

Another great way to learn is by exchanging your code with others. Initially, just have them read it and give you feedback about it. But after you get a bit more confident run each other’s code. The most efficiently I’ve ever improved in my R journey has been by having Monica try to run my code.

Developing research questions

Both qualitative and quantitative approaches have their place, but here we focus on quantitative approaches. (Qualitative research is important as well, and often the most interesting work has a little of both - ‘mixed methods’.) This means that we are subject to issues surrounding data quality, scales, measures, sources, etc. We are especially interested in trying to tease out causality.

Broadly there are two ways to go about research: 1) data-first, 2) question-first. If you get a job somewhere typically you will initially be data-first. This means that you will need to work out the questions that you can reasonably answer with the data available to you. After you show some promise, you may be given the latitude to explore specific questions, possibly even gathering data specifically for that purpose. Contrast this with the example of the Behavioural Insights Team, (IEP, p.23) who got to design and then carry out experiments given the remit of the entire British government (as they were spun out of the prime minister’s office).

When deciding the questions that you can reasonably answer with the data that are available, you need to think about:

  1. Theory: Do you have a reasonable expectation that there is something causal that could be determined? Charting the stock market - maybe, but might be better with haruspex because at least that way you have something you could eat. You need a reasonable theory of how \(x\) may be affecting \(y\).
  2. Importance: There are plenty of trivial questions that you could ask, but it’s important to not waste you time. The importance of a question also helps with motivation when you are on your fourth straight week of cleaning data and de-bugging your code. It also (and this becomes important) makes it easier to get talented people to work with you, or similarly to convince people to fund you or allow you to work on this project.
  3. Availability: Can you reasonably expect get more data about this research question in the future or is this the extent of the data that could be gathered?
  4. Iteration: Is this something that you can come back to and run often or is this a once-off analysis?

Aaron Miller points to the ‘FINER framework’ as a mnemonic device used in medicine. This framework reminds us to ask questions that are (quoting from Hulley S, Cummings S, Browner W, et al. Designing clinical research. 3rd ed. Philadelphia (PA): Lippincott Williams and Wilkins; 2007):

Farrugia P, Petrisor BA, Farrokhyar F, Bhandari M. (2010) build on this in terms of developing research questions and recommend PICOT (quoting from Farrugia, et al.):

I want to follow up on a couple of aspects here:

Time: Often this will be constrained, possibly in interesting ways. If we are interested in the the effect of Trump’s tweets on the stock market, then that can be done just by looking at the minutes (milliseconds?) after he tweets. But what if we are interested in the effect of a cancer drug on long term outcomes? If the effect takes 20 years then we either have to wait a while, or we need to look at people who were treated in 2000, but then we have selection effects and different circumstances to if we give the drug today. Often the only reasonable thing to do is to build a statistical model, but then we need adequate sample sizes, etc.

Comparison: The creation of a counterfactual is crucial. A counterfactual is an in-then statement in which the if is false. Consider the example of Humpty Dumpty from Lewis Carroll’s Through the Looking-Glass:

Humpty Dumpty example

Figure 1: Humpty Dumpty example

Humpty is satisfied with what would happened if he were to fall off, even though he is similarly satisfied that this would never happen. (I won’t ruin the story for you.) The comparison group often determines your results e.g. the relationship between VO2 and athletic outcomes, cf elite athletic outcomes.

Ethics: Often these guides are focused on ethics boards etc. But we often don’t have those in data science applications. Even if your intentions are unimpeachable I want to suggest one additional aspect to think about, and that is Bayes theorem: \[P(A|B) = \frac{P(B|A)\times P(A)}{P(B)}\]

The probability of A given B depends on the probability of B given A, the probabilty of A, and the probability of B. To see why, let’s go to the canonical Bayes example: There is some test for a disease that is 99 per cent accurate both ways (that is, if a person actually has the disease there is a 99 per cent chance the test says positive, and is a person does not have the disease then there is a 99 per cent chance the test says negative). Let’s just say that only 0.5 of the population has the disease. Then if we randomly pick someone from the general population then the chance that they have the disease is outstandingly low. EVEN IF THEY TEST POSITIVE!: \[\frac{0.99\times0.005}{0.99\times0.005 + 0.01\times0.995} \approx 33.2\]

To see this in action, consider the example of Google’s cancer testing: And then consider the points raised by Vinay Prasad: