Workflow

Required viewing

Recommended viewing

Key concepts/skills/etc

• Restart R often (Session -> Restart R and Clear Output).
• Debugging is a skill, and you will get better at it with time and practice.
• Check the class.
• You may get frustrated at times, this is normal.
• There are various tools that can help. Google is your friend.
• Make a small example and try to get the code running on that.
• Cultivating a tenacious mentality may help.
• Write code that future-you can understand.
• Data-first research.
• Question-first research.
• Developing important questions.

Pre-quiz

1. What are three features of a good research question?
2. What is a counterfactual?
3. Name three features of R Markdown that might be useful?
4. How do you hide the warnings in a R Markdown R chunk?
5. What is a RepEx and why is it important to be able to make one?

Introduction

In this section….

R Markdown

Getting started

R Markdown is a mark-up language similar to html or LaTeX, in comparison to a WYSIWYG language, such as Word. This means that all of the aspects are consistent, for instance, all ‘main headings’ will look the same. However it means that use symbols to designate how you would like certain aspects to appear, and it is only when you compile it that you get to see it.

R Markdown is a variant of regular markdown that is specifically designed to allow R code chunks to be included. The advantage is that you can get a ‘live’ document in which code executes and is then printed to a document. The disadvantage is that it can take a while for the document to compile because all of the code needs to run.

You can create a new R Markdown document within R Studio (File -> New File -> R Markdown Document). Another advantage of R Markdown is that very similar code can compile into a variety of documents, including html pages and PDFs. R Markdown also has default options set up for including a title, author, and date sections.

Basic commands

If you ever need a reminder of the basics of R Markdown then this is built into R Studio (Help -> Markdown Quick Reference). This provides the code for commonly needed commands:

• Emphasis: *italic*, **bold**, _italic_, __bold__
• Headers (these need to go on their own line with a line before and after): # Header 1, ## Header 2, ### Header 3
• Lists:

Unordered List
* Item 1
* Item 2
+ Item 2a
+ Item 2b
Ordered List
1. Item 1
2. Item 2
3. Item 3
+ Item 3a
+ Item 3b
• URLs: Can just include an address: http://example.com, or can include a [linked phrase](http://example.com).
• Basic images can just be included either from the internet: ![alt text](http://example.com/logo.png) or from a local file: ![alt text](figures/img.png).

In order to create an actual document, once you have these pieces set up, click ‘Knit’.

R chunks

You can include R (and a bunch of other languages) code in code chunks within your R Markdown document. Then when you knit your document, the R code will run and be included in your document.

To create an R chunk start with three backticks and then within curly braces tell markdown that this is an R chunk. Anything inside this chunk will be considered R code and run as such.


library(tidyverse)
ggplot(data = diamonds) +
geom_point(aes(x = price, y = carat))

There are various evaluation options that are available in chunks. You include these by putting a comma after r and then specifying any options before the closing curly brace. Helpful options include:

• echo = FALSE: run the code and include the output, but don’t print the code in the document.
• include = FALSE: run the code but don’t output anything and don’t print the code in the document.
• eval = FALSE: don’t run the code, and hence don’t include the outpus, but do print the code in the document.

Abstracts and PDF outputs

In the default header, you can add a section for a header, so that it would look like this:


---
title: My document
author: Rohan Alexander
date: 25 February 2020
output: html_document
abstract: "This is my abstract."
---

Similarly, you can change the output from html_document to pdf_document in order to produce a PDF. This uses LaTeX in the background so you may need to install a bunch of related packages.

R projects

RStudio has the option of creating a project, which allows you to keep all the files (data, analysis, report etc) associated with a particular project together. To create a project, click Click File > New Project, then select empty project, name your project and think about where you want to save it. For example, if you are creating a project for Problem Set 2, you might call it ps2 and save it in a sub-folder called PS2 in your INF2178 folder.

Once you have created a project, a new file with the extension .RProj will appear in that file. As an example, download the R help folder. Whenever I work on class materials, I open the project file and work from that.

The main advantage of projects is that you don’t have to set the working directory or type the whole file path to read in a file (for example, a data file). So instead of reading a csv from "~/Documents/toronto/teaching/INF2178/data/" you can just read it in from data/.

Using R in practice

Introduction

This section is what do when your code doesn’t do what you want, discusses a mindset that may help when doing quantitative analysis with R, and finally, some recommendations around how to write your code.

Getting help

Programming is hard and everyone struggles sometimes. At some point your code won’t run or will throw an error. This is normal, and it happens to everyone. It happens to me on a daily, sometimes hourly, basis. Everyone gets frustrated. There are a few steps that are worthwhile taking when this happens:

• Sometimes the error messages in R are useful. Read it carefully and see if there’s anything of use in it. At the very least, if you get the same message in the future, hopefully you might remember how you solved the problem this time!
• If you’re getting an error then try googling it, (I find it can help to include the term ‘R’ or ‘tidyverse’ or the relevant package name).
• If there’s a particular function that seems to be giving trouble, have a look at the help file for it. Sometimes you might be putting in the arguments in the wrong order. You can do this with ‘?function’ e.g. for help with select, you would type ‘?select’ and then run that line.
• Check the class of the object. Sometimes R is a little fussy and converting the class can help.
• If your code just isn’t running, then try searching for what you are trying to do, e.g. ‘save PDF of graph in R made using ggplot’. Almost always there are relevant blog posts or Stack Overflow answers that will help.
• Try to restart R and R Studio and load everything again.
• Try to restart your computer.

There are a few small mistakes that I often make and may be worth checking in case you make them too:

• check the class e.g. class(my_dataset\$its_column) to make sure that is what it should be;
• when you’re using ggplot make sure you use ‘+’ not ‘%>%’; and
• check whether you are using ‘.’ when you shouldn’t be, or vice versa.

It’s almost always helpful to take a break and come back the next day.

Asking for help is a skill that you will get better at, but in general I recomend:

1. Provide an example of your data and what is going wrong.
2. Document what you have tried so far.
3. Document what outcome you would like.

Reproducible examples

A Reproducible Example, or ReprEx,

Mentality

(Y)ou are a real, valid, competent user and programmer no matter what IDE you develop in or what tools you use to make your work work for you

(L)et’s break down the gates, there’s enough room for everyone

Sharla Gelfand, 10 March 2020.

I’m a little hesitant to make suggestions with regard to mentality. If you write code, then you’re coder regardless of how you do it, what you’re using it for, or who you are. But I want to share a few traits that I have found have been useful to cultivate in myself. That said, entirely, whatever works for you is great, so take or leave this section.

• Focused: I’ve found that having an aim to ‘learn R’ or something similar tends to be problematic, because there’s no real end point to that. Instead I would recommend smaller, more specific goals, such as ‘make a histogram about the 2019 Canadian Election with ggplot’. That is something that you can focus on and achieve. With more nebulous goals it becomes easier to get lost on tangents, much more difficult to get help, and I’ve noticed that poeple who have nebulous goals seem to give up.
• Curious: I’ve found that it’s useful to just have a go. In general, the worst that happens is that you waste your time and have to give up. You can rarely break something irreparably. If you want to know what happens if you pass a vector instead of a dataframe to ggplot then just try it.
• Pragmatic: At the same time, I’ve found that it’s best to try to stick within the bounds of what I know and just make one small change each time. For instance if you’re wanting to do some regression, and curious about the tidymodels package instead of lm. Perhaps you could just use one aspect from the tidymodels package initially and then make another change next time. Ugly code that gets the job done, is better than beautiful code that doesn’t.
• Tenacious: This is a balancing act. I always find there are unexpected problems and issues with every project. On the one hand persevering despite these is a good tendency. But on the other hand I’ve learnt that sometimes I need to be prepared to give up on something if it doesn’t seem like a break-through is possible. This is where I have found that mentors can be useful as they tend to have a better idea. This is also where planning comes in.
• Planned: I have found it is very useful to plan out what you are going to do. For instance, you may want to make a histogram of the 2019 Canadian Election. I find it useful to plan the steps that are needed and even to sketch out how I might implement each step. For instance, the first step is to get the data. What packages might be useful? Where might the data be? What is our back-up plan for if we can’t find the data in that initial spot?
• Done is better than perfect: We all have various perfectionist tendencies to a certain extent, but I recommend that you try to turn them off to a certain extent when it comes to R. In the first instance just try to write code that works, especially in the early days. You can always come back and improve aspects of it. But it is important to actually ship.

Style

There is no one way to write code, especially in R. However there are some general guidelines that will make it easier for you even if you’re just working on your own.

Comments in R can be added by including the # symbol. The shortcut on mac is Command + Shift + m. You don’t have to put a comment at the start of the line, it can be midway through. In general you don’t need to comment what every aspect of your code is doing but you should comment parts that are not obvious. For instance if you read in some value then you may like to comment where it is coming from. You should also try to comment why you are doing something.

I like to break my code into sections. For instance, setting up my workspace, reading in datasets, manipulating and cleaning the dataset, analysing the datasets, and finally producing tables and figures. While it can be difficult to speak generally, I usually separate each of those certainly with comments explaining what is going on, and sometimes into separate files, depending on the length.

Additionally, at the top of each file I put basic information, such as the purpose of the file, and pre-requisites or dependencies, the date, the author and contact information, and finally and red-flags, bodies, or todos.

Learning more

One of the great aspects of R is that there is a friendly community of people who use it. There are a variety of ways that I learn about new tricks, functions, and packages including:

Another great way to learn is by exchanging your code with others. Initially, just have them read it and give you feedback about it. But after you get a bit more confident run each other’s code. The most efficiently I’ve ever improved in my R journey has been by having Monica try to run my code.

Developing research questions

Both qualitative and quantitative approaches have their place, but here we focus on quantitative approaches. (Qualitative research is important as well, and often the most interesting work has a little of both - ‘mixed methods’.) This means that we are subject to issues surrounding data quality, scales, measures, sources, etc. We are especially interested in trying to tease out causality.

Broadly there are two ways to go about research: 1) data-first, 2) question-first. If you get a job somewhere typically you will initially be data-first. This means that you will need to work out the questions that you can reasonably answer with the data available to you. After you show some promise, you may be given the latitude to explore specific questions, possibly even gathering data specifically for that purpose. Contrast this with the example of the Behavioural Insights Team, (IEP, p.23) who got to design and then carry out experiments given the remit of the entire British government (as they were spun out of the prime minister’s office).

When deciding the questions that you can reasonably answer with the data that are available, you need to think about:

1. Theory: Do you have a reasonable expectation that there is something causal that could be determined? Charting the stock market - maybe, but might be better with haruspex because at least that way you have something you could eat. You need a reasonable theory of how $$x$$ may be affecting $$y$$.
2. Importance: There are plenty of trivial questions that you could ask, but it’s important to not waste you time. The importance of a question also helps with motivation when you are on your fourth straight week of cleaning data and de-bugging your code. It also (and this becomes important) makes it easier to get talented people to work with you, or similarly to convince people to fund you or allow you to work on this project.
3. Availability: Can you reasonably expect get more data about this research question in the future or is this the extent of the data that could be gathered?
4. Iteration: Is this something that you can come back to and run often or is this a once-off analysis?

Aaron Miller points to the ‘FINER framework’ as a mnemonic device used in medicine. This framework reminds us to ask questions that are (quoting from Hulley S, Cummings S, Browner W, et al. Designing clinical research. 3rd ed. Philadelphia (PA): Lippincott Williams and Wilkins; 2007):

• Feasible: Adequate number of subjects; adequate technical expertise; affordable in time and money; manageable in scope.
• Interesting: Getting the answer intrigues investigator, peers and community.
• Novel: Confirms, refutes or extends previous findings
• Ethical: Amenable to a study that institutional review board will approve.
• Relevant: To scientific knowledge; to clinical and health policy; to future research.

Farrugia P, Petrisor BA, Farrokhyar F, Bhandari M. (2010) build on this in terms of developing research questions and recommend PICOT (quoting from Farrugia, et al.):

• Population: What specific population are you interested in?
• Intervention: What is your investigational intervention?
• Comparison group: What is the main alternative to compare with the intervention?
• Outcome of interest: What do you intend to accomplish, measure, improve or affect?
• Time: What is the appropriate follow-up time to assess outcome

I want to follow up on a couple of aspects here:

Time: Often this will be constrained, possibly in interesting ways. If we are interested in the the effect of Trump’s tweets on the stock market, then that can be done just by looking at the minutes (milliseconds?) after he tweets. But what if we are interested in the effect of a cancer drug on long term outcomes? If the effect takes 20 years then we either have to wait a while, or we need to look at people who were treated in 2000, but then we have selection effects and different circumstances to if we give the drug today. Often the only reasonable thing to do is to build a statistical model, but then we need adequate sample sizes, etc.

Comparison: The creation of a counterfactual is crucial. A counterfactual is an in-then statement in which the if is false. Consider the example of Humpty Dumpty from Lewis Carroll’s Through the Looking-Glass:

Humpty is satisfied with what would happened if he were to fall off, even though he is similarly satisfied that this would never happen. (I won’t ruin the story for you.) The comparison group often determines your results e.g. the relationship between VO2 and athletic outcomes, cf elite athletic outcomes.

Ethics: Often these guides are focused on ethics boards etc. But we often don’t have those in data science applications. Even if your intentions are unimpeachable I want to suggest one additional aspect to think about, and that is Bayes theorem: $P(A|B) = \frac{P(B|A)\times P(A)}{P(B)}$

The probability of A given B depends on the probability of B given A, the probabilty of A, and the probability of B. To see why, let’s go to the canonical Bayes example: There is some test for a disease that is 99 per cent accurate both ways (that is, if a person actually has the disease there is a 99 per cent chance the test says positive, and is a person does not have the disease then there is a 99 per cent chance the test says negative). Let’s just say that only 0.5 of the population has the disease. Then if we randomly pick someone from the general population then the chance that they have the disease is outstandingly low. EVEN IF THEY TEST POSITIVE!: $\frac{0.99\times0.005}{0.99\times0.005 + 0.01\times0.995} \approx 33.2$