Randomised controlled trials

Required reading

Required viewing

Recommended reading

Recommended listening

Key concepts/skills/etc

Key libraries

Key functions/etc


  1. In your own words, what is the role of randomisation in constructing a counterfactual?
  2. What is external validity? What is internal validity?
  3. If we have a dataset named ‘netflix_data’, with the columns ‘person’ and ‘tv_show’ and ‘hours’, (person is a character class uniqueID for every person, tv_show is a character class name of a tv show, and hours is double expressing the number of hours that person watched that tv show). Could you please write some code that would randomly assign people into one of two groups? The data looks like this:
netflix_data <- 
  tibble(person = c("Rohan", "Rohan", "Monica", "Monica", "Monica", 
                    "Patricia", "Patricia", "Helen"),
         tv_show = c("Broadchurch", "Duty-Shame", "Broadchurch", "Duty-Shame", 
                     "Shetland", "Broadchurch", "Shetland", "Duty-Shame"),
         hours = c(6.8, 8.0, 0.8, 9.2, 3.2, 4.0, 0.2, 10.2)
  1. What does stratification mean to you (in the context of randomisation)?
  2. How could you check that your randomisation had been done appropriately?


First a note on Ronald Fisher and Francis Galton. Fisher and Galton are the intellectual grandfathers of much of the work that we cover. In some cases it is directly their work, in other cases it is work that built on their contributions. Both of these men believed in eugenics, amongst other things that we would generally find reprehensible today.

This chapter is about experiments. This is a situation in which we can explicitly control and vary some aspects. The advantage of this is that identification should be clear. There is a treatment group that is treated and a control group that is not. These are randomly split. And so if they end up different then it must be because of the treatment. Unfortunately, life is rarely so smooth. Arguing about how similar the treatment and control groups were tends to carry on indefinitely, because our ability to speak to internal validity affects our ability to speak to external validity.

It’s also important to note that the statistics of this were designed in agricultural settings ‘does fertilizer work?’, etc. In those settings you can more easily divide a field into ‘treated’ and ‘non-treated’, and the magnitude of the effect is large. In general, these same statistical approaches are still used today (especially in the social sciences) but often inappropriately. If you hear someone in economics or political science talking about power to identify effects and similar terms then it’s not necessarily that they’re not right, but it usually pays to take a step back and really think about what is being done.


Never forget: if your sampling is in any way non-representative, your observe[d] data is not sufficient for population estimates. You must deal with design, sampling issues, data quality, and misclassification. Otherwise you’ll just be wrong.

Dan Simpson, 30 January 2020.

When Monica and I moved to San Francisco, the Giants immediately won the baseball, and the Warriors began a historic streak. We moved to Chicago and the Cubs won the baseball for the first time in a hundred years. We then moved to Massachusetts, and the Patriots won the Super Bowl again and again and again. Finally, we moved to Toronto, and the Raptors won the basketball. Should a city pay us to live there or could their funds be better spent elsewhere?

One way to get at the answer would be to run an experiment. Make a list of the North American cities with major sports teams, and then roll a dice and send us to live there for a year. If we had enough lifetimes, then we could work it out. But by using A/B testing and experiments we can try to use a larger sample size to work out causality a little more quickly.

Tea party

Fisher (see note above) introduced a, now, famous example of an experiment designed to see if a person can distinguish between a cup of tea when the milk was added first, or last (I’m personally very attached to this example as this issue also matters a lot to my father). The set-up is:

We’ll now try this experiment.

To decide if the person’s choices were likely to have occurred at random or not, we need to think about the probability of this happening by chance. First count the number of successes out of the four that were chosen. There are: \({8 \choose 4} = \frac{8!}{4!(8-4)!}=70\) possible outcomes.

By chance, there are two ways for the person to be perfectly correct (because we are only asking them to be grouped): correctly identify all the ones that were milk-first (one outcome out of 70) or correctly identify all the ones that were tea-first (one outcome out of 70), so the chance of that is \(2/70 \approx 0.028\). So if the null is that they can’t distinguish, but they correctly separate them all, then at the five-per-cent level, we reject the null!

What if they miss one? Similarly, by chance there are 16 ways for a person to be ‘off-by-one’. Either they think there was one that was milk-first when it was tea-first - there are, \({4 \choose 1}\), four ways this could happen - or they think there was one that was tea-first when it was milk-first - again, there are, \({4 \choose 1}\), four ways this could happen. But these outcomes are independent, so the probability is \(\frac{4\times 4}{70} \approx 0.228\). And so on. So, we fail to reject the null.

Randomised controlled trials bring this same logic everywhere.

Some unexpected properties from randomised sampling

Correlation can be enough in some settings, but in order to be able to make forecasts when things change and the circumstances are slightly different we need to understand causation. The key is the counterfactual - what would have happened in the absence of the treatment. Ideally we could keep everything else constant, randomly divide the world into two groups, and then treat one and not the other. Then we can be pretty confident that any difference between the two groups is due to that treatment. The reason for this is that if we have some population and we randomly select two groups from it, then our two groups (so long as they are both big enough) should have the same characteristics as the population. Randomised controlled trials (RCTs) and A/B testing attempts to get us as close to this ‘gold standard’ as we can hope. (This is often described as the ‘gold standard’. In doing so, it’s not to say that it’s perfect, just that it’s generally better than most of the other options. There is plenty that is wrong with it.)

To see this, let’s generate a simulated dataset and then sample from it. (In general, this is a good way to approach problems: 1) generate a simulated dataset; 2) do your analysis on the simulated dataset; 3) take your analysis to the real dataset. The reason this is a good approach is that you know roughly what the outcomes should be in step 2, whereas if you go directly to the real dataset then you don’t know if unexpected outcomes are likely due to your own analysis errors, or actual results. The first time you generate a simulated dataset it will take a while, but after a bit of practice you’ll get good at it. There are also packages that can help, including DeclareDesign.)


# Construct a population so that 25 per cent of people like blue and 75 per cent 
# like white.
population <- 
  tibble(person = c(1:10000),
         favourite_color = sample(x = c("Blue", "White"), 
                                  size  = 10000, 
                                  replace = TRUE,
                                  prob = c(0.25, 0.75)),
         supports_the_leafs = sample(x = c("Yes", "No"), 
                                  size  = 10000, 
                                  replace = TRUE,
                                  prob = c(0.80, 0.20)),
         ) %>% 
  mutate(in_frame = sample(x = c(0:1),
                        size  = 10000, 
                        replace = TRUE)) %>% 
  mutate(group = sample(x = c(1:10),
                        size  = 10000, 
                        replace = TRUE)) %>% 
  mutate(group = ifelse(in_frame == 1, group, NA))

As a reminder, the sampling frame is subset of the population that can actually be sampled, for instance they are listed somewhere. For instance, Lauren Kennedy uses the analogy of a city’s population, and the phonebook - almost everyone is in there (or at least they used to be), so the population and the sampling frame are almost the same, but they are not.

Now look at the mean for two groups drawn out of the sampling frame.

population %>% 
  filter(in_frame == 1) %>% 
  filter(group %in% c(1, 2)) %>% 
  group_by(group, favourite_color) %>% 
# A tibble: 4 x 3
# Groups:   group, favourite_color [4]
  group favourite_color     n
  <int> <chr>           <int>
1     1 Blue              114
2     1 White             420
3     2 Blue              105
4     2 White             369

We are probably convinced by looking at it, but to formally test if there is a difference in the two samples, we can use a t-test.


population <- 
  population %>% 
  mutate(color_as_integer = case_when(
    favourite_color == "White" ~ 0,
    favourite_color == "Blue" ~ 1,
    TRUE ~ 999

group_1 <- 
  population %>% 
  filter(group == 1) %>% 
  select(color_as_integer) %>% 
  as.vector() %>% 

group_2 <- 
  population %>% 
  filter(group == 2) %>% 
  select(color_as_integer) %>% 

t.test(group_1, group_2)

    Welch Two Sample t-test

data:  group_1 and group_2
t = -0.30825, df = 988.57, p-value = 0.758
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.05919338  0.04312170
sample estimates:
mean of x mean of y 
0.2134831 0.2215190 
# We could also use the tidy function in the broom package.
tidy(t.test(group_1, group_2))
# A tibble: 1 x 10
  estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
     <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
1 -0.00804     0.213     0.222    -0.308   0.758      989.  -0.0592    0.0431
# … with 2 more variables: method <chr>, alternative <chr>

If properly done then not only will we get a ‘representative’ share of people with the favourite color blue, but we should also get a representative share of people who support the Maple Leafs. Why should that happen when we haven’t randomised on these variables? Let’s start by looking at our dataset.

population %>% 
  filter(in_frame == 1) %>% 
  filter(group %in% c(1, 2)) %>% 
  group_by(group, supports_the_leafs) %>% 
# A tibble: 4 x 3
# Groups:   group, supports_the_leafs [4]
  group supports_the_leafs     n
  <int> <chr>              <int>
1     1 No                   102
2     1 Yes                  432
3     2 No                    81
4     2 Yes                  393

This is very powerful. We have a representative share on ‘unobservables’ (in this case we do ‘observe’ them - to illustrate the point - but we didn’t select on them). But it will break-down in a number of ways that we will discuss. It also assumes large enough groups - if we sampled in Toronto are we likely to get a ‘representative’ share of people who support the Canadiens? What about F.C. Hansa Rostock? If we want to check that the two groups are the same then what can we do? Exactly what we did above - just check if we can identify a difference between the two groups based on observables (we looked at the mean, but we could look at other aspects as well).


Analysis of Variation (ANOVA) was introduced by Fisher while he was working on statistical problems in agriculture (to steal a joke from Darren L Dahly, cited in an earlier reading, “Q: What’s the difference between agricultural and medical research?” “A: The former isn’t conducted by farmers.”). Typically, the null is that all of the groups are from the same distribution.

We can run ANOVA with the function built into R - aov().

just_two_groups <- population %>%
  filter(in_frame == 1) %>%
  filter(group %in% c(1, 2))

aov(group ~ favourite_color, 
    data = just_two_groups) %>% 
# A tibble: 2 x 6
  term               df    sumsq meansq statistic p.value
  <chr>           <dbl>    <dbl>  <dbl>     <dbl>   <dbl>
1 favourite_color     1   0.0238 0.0238    0.0952   0.758
2 Residuals        1006 251.     0.250    NA       NA    

In this case, we fail to reject the null that the samples are the same.

Treatment and control

If the treated and control groups are the same in all ways and remain that way, then we have internal validity, which is to say that our control will work as a counterfactual and our results can speak to a difference between these groups. If the group to which we applied our randomisation were representative of the broader population, and the experimental set-up were fairly similar to outside conditions, then we further have external validity. That means that the difference that we find does not just apply in our own experiment, but also in the broader population. But this means we need randomisation twice. How does this trade-off happen and to what extent does it matter?

As such, we are interested in the effect of being ‘treated’. This may be that we charge different prices (continuous treatment variable), or that we compare different colours on a website (discrete treatment variable, and a staple of A/B testing). If we consider just discrete treatments (so that we can use dummy variables) then need to make sure that all of the groups are otherwise the same. How can we do this? One way is to ignore the treatment variable and to examine all other variables - can you detect a difference between the groups based on any other variables? In the website example, are there a similar number of:

These are all threats to the validity of our claims.

But if done properly, that is if the treatment is truly independent, then we can estimate an ‘average treatment effect’, which in a binary treatment variable setting is: \[\mbox{ATE} = \mbox{E}[y|d=1] - \mbox{E}[y|d=0].\]

That is, the difference between the treated group, \(d = 1\), and the control group, \(d = 1\), when measured by the expected value of some outcome variable, \(y\). So the mean causal effect is simply the difference between the two expectations!

Let’s again get stuck into some code. First we need to generate some data.

example_data <- tibble(person = c(1:1000),
                       treatment = sample(x = 0:1, size  = 1000, replace = TRUE)
# We want to make the outcome slightly more likely if they were treated than if not.
example_data <- 
  example_data %>% 
  rowwise() %>% 
  mutate(outcome = if_else(treatment == 0, 
                           rnorm(n = 1, mean = 5, sd = 1),
                           rnorm(n = 1, mean = 6, sd = 1)

example_data$treatment <- as.factor(example_data$treatment)

example_data %>% 
  ggplot(aes(x = outcome, 
             fill = treatment)) +
  geom_histogram(position = "dodge",
                 binwidth = 0.2) +
  theme_minimal() +
  labs(x = "Outcome",
       y = "Number of people",
       fill = "Person was treated") +
  scale_fill_brewer(palette = "Set1")
example_regression <- lm(outcome ~ treatment, data = example_data)

# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     5.00    0.0430     116.  0.      
2 treatment1      1.01    0.0625      16.1 5.14e-52

But then reality happens. Your experiment cannot run for too long otherwise people may be treated many times, or become inured to the treatment, but it cannot be too short otherwise you can’t measure longer term outcomes. You cannot have a ‘representative’ sample on every cross-tab, but if not then the treatment and control will be different. Practical difficulties may make it difficult to follow up with certain groups.

Questions to ask (if they haven’t been answered already) include:

Bias and other issues are not the end of the world. But you need to think about it carefully. In the famous example, Abraham Wald was given data on the planes that came back to Britain after being shot at in WW2. The question is where to place the armour. One option is to place it over the bullet holes. Wald recognised that there is a selection effect here - these are the planes that made it back - they didn’t need the armour, but instead we should put the armour where there were no bullet holes.

To consider an example that may be closer to home - how would the results of a survey differ if I only asked students who completed this course what was difficult about it and not those who dropped out?

While, as Dan suggests, we should work to try to make the dataset as good as possible, it may be possible to use the model to control for some of the bias. If there is a variable that is correlated with say, attrition, then we can add it to the model. Either by itself, or as an interaction.

What if there is a correlation between the individuals? For instance, what if there were some ‘hidden variable’ that we didn’t know about, such as province, and it turned out that people from the same province were similar? In that case we could use ‘wider’ standard errors.

But a better way to deal with this may be to change the experiment. For instance, we discussed stratified sampling - perhaps we should stratify by province? How would we implement this?

And of course, these days we’d not really use a 100-year-old method but would instead use Bayes-based approaches.

TBD: Add code

Case study - TBD

Next steps

Large scale experiments are happening all around us. These days I feel we all know a lot more about healthcare experiments than perhaps we’d like to know and the AstraZeneca/Oxford situation is especially interesting, for instance, Oxford-AstraZeneca (2020), but see Bastian (2020) for how this is actually possibly more complicated.

There are also well-known experiments that tried to see if big government programs are effective, such as:

Bastian, Hilda. 2020. “A Timeline of the Oxford-Astrazeneca Covid-19 Vaccine Trials.” http://hildabastian.net/index.php/100.

Brook, Robert H, John E Ware, William H Rogers, Emmett B Keeler, Allyson Ross Davies, Cathy D Sherbourne, George A Goldberg, Kathleen N Lohr, Patricia Camp, and Joseph P Newhouse. 1984. “The Effect of Coinsurance on the Health of Adults: Results from the Rand Health Insurance Experiment.”

Finkelstein, Amy, Sarah Taubman, Bill Wright, Mira Bernstein, Jonathan Gruber, Joseph P Newhouse, Heidi Allen, Katherine Baicker, and Oregon Health Study Group. 2012. “The Oregon Health Insurance Experiment: Evidence from the First Year.” The Quarterly Journal of Economics 127 (3): 1057–1106.

Oxford-AstraZeneca. 2020. “AZD1222 Vaccine Met Primary Efficacy Endpoint in Preventing Covid-19.” https://www.astrazeneca.com/media-centre/press-releases/2020/azd1222hlr.html.