A/B testing

Required reading

Required viewing

Recommended reading

Key concepts/skills/etc

Key libraries

Key functions/etc

Pre-quiz

Introduction

The past decade has probably seen the most experiments ever run by several orders of magnitude with the extensive use of A/B testing on websites. Every time you are online you are probably subject to tens, hundreds, or potentially thousands, of different A/B tests. If you use apps like TikTok then this could run to the tens of thousands. While they have several interesting features, which we will discuss, at their heart they are still just surveys that result in data that need to be analysed. In this section of the notes we augment our main textbooks with a less formal book focused on A/B testing that is very popular in industry at the moment.

I can’t do much better than to quote their opening example Kohavi, Tang, and Xu (2020, 3).

In 2012, an employee working on Bing, Microsoft’s search engine, suggested changing how ad headlines display (Kohavi and Thomke 2017). The idea was to lengthen the title line of ads by combining it with the text from the first line below the title, as shown in Figure 1.1.

Nobody thought this simple change, among the hundreds suggested, would be the best revenue-generating idea in Bing’s history!

The feature was prioritized low and languished in the backlog for more than six months until a software developer decided to try the change, given how easy it was to code. He implemented the idea and began evaluating the idea on real users, randomly showing some of them the new title layout and others the old one. User interactions with the website were recorded, including ad clicks and the revenue generated from them. This is an example of an A/B test, the simplest type of controlled experiment that compares two variants: A and B, or a Control and a Treatment.

A few hours after starting the test, a revenue-too-high alert triggered, indicating that something was wrong with the experiment. The Treatment, that is, the new title layout, was generating too much money from ads. Such “too good to be true” alerts are very useful, as they usually indicate a serious bug, such as cases where revenue was logged twice (double billing) or where only ads displayed, and the rest of the web page was broken.

For this experiment, however, the revenue increase was valid. Bing’s revenue increased by a whopping 12%, which at the time translated to over $100M annually in the US alone, without significantly hurting key user-experience metrics. The experiment was replicated multiple times over a long period.

The example typifies several key themes in online controlled experiments:

In these notes, I’m going to use A/B testing to strictly refer to the situation in which we’re dealing with a tech firm, and some type of change in code. If we are dealing with the physical world then we’ll stick with RCTs. You may think that it’s easy to go to a workplace and say ‘hey, let’s test stuff before we spend thousands/millions of dollars’. You’d be wrong. The hardest part of A/B testing isn’t the science, it’s the politics.

Unique complications of A/B testing

Delivery

This is from Chapter 12 of Kohavi, Tang, and Xu (2020).

In the case of a RCT it’s fairly obvious how we deliver the treatment - for instance, make them come to a doctor’s clinic and inject them with the drug or a placebo. In the case of A/B testing, it’s less obvious - do you run it ‘server-side’ or ‘client-side’? E.g. do you just change the website - ‘server side’, or do you change an app - ‘client side’.

This may seem like a silly issue, but it affects two aspects:

  1. Release.
  2. Data transmission.

In the case of the effect on release, it’s easy and normal to update a website all the time, so small changes can be easily implemented in the case of server-side. However, in the case of client-side, let’s say an app, it’s likely a much bigger deal.

  1. It needs to get through an app store (a bigger or lesser deal depending on which one).
  2. It need to go through a release cycle (a bigger or lesser deal depending on the specifics of the company and how it ships).
  3. Users have the opportunity to not upgrade. Are they likely different to those that do upgrade? (Yes.)

Now, in the case of the effect on data transmission, again server-side is less of a big deal - you kind of get the data as part of the user interacting. But in the case of client-side - it’s not necessarily the case that the user will have the internet at the time they’re using your application, and if they do they may have limitations on the data uploads. The phone may limit data transmission depending on its effect on battery, CPU, general performance, etc. So then you decide to cache, but then the user may find it weird that some minor app takes up as much size as their photos.

The effect of all this is that you need to plan, and build this into your expectations - don’t promise results the day after a release if you’re evaluating a client-side change. Adjust for the fact that your results are conditional and gather data on those conditions e.g. battery level or whatever. Adjust in your analysis for different devices and platforms, etc. This is a lovely opportunity for multilevel regression.

Instrumentation

This is from Chapter 13 of Kohavi, Tang, and Xu (2020).

I would change the name of this from instrumentation, but I don’t have a good replacement. The point of this is that you need to consider how you are getting your data in the first place. For instance, if we put a cookie on your device then different types of users will remove that at different rates. Using things like beacons can be great (this is when you force the user to ‘download’ some tiny thing they don’t notice so that you know they’ve gone somewhere - see ‘email’ etc). But again, there are practical issues - do we force the beacon before the main content loads - which makes for a worse customer experience; or do we allow the beacon to load after the main content, in which case we may get a biased sample?

There are likely different servers and databases for different faces of the product. For instance, Twitter in Australia, compared with Twitter in Canada, compared with Twitter on my phone’s app, compared with Twitter accessed via the browser. Joining these different datasets can be difficult and requires either a unique id or some probabilistic approach.

Kohavi, Tang, and Xu (2020, 165) recommend changing the culture of your workplace to ensure instrumentation is normalised, which I mean, yeah.

Randomisation unit

This is from Chapter 13 of Kohavi, Tang, and Xu (2020).

What are we actually randomising over? Okay, again, this is something that’s kind of obvious in normal RCTs, but gets like really interesting in the case of A/B testing. Let’s consider the malaria netting experiments - either a person/village/state gets a net or it doesn’t. Easy (relatively). But in the case of server-side A/B testing - are we randomising the page, the session, or the user?

To think about this, let’s think about colour. Let’s say that we change our logo from red to blue on the ‘home’ page. If we’re randomising at the page level, then when the user goes to the ‘about’ page the logo could be back to red. If we’re randomising at the session level, then it’ll be blue while they’re using the website that time, but if they close it and come back then it’ll be red. Finally, if we’re randomising at a user level then it’ll always be red for me, but always blue for my friend. That last bit assumes perfect identity tracking, which might be generally okay if you’re Google or Facebook, but for anyone else is going to be a challenge - what if you visit cbc.ca on your phone and then on your laptop? You’re likely considered a different ‘user’.

Does this matter? It’s a trade-off between consistency and importance.

Case study - Upworthy

To see this in action let’s look at the Upworthy dataset Matias et al. (2019).

Kohavi, Ron, Diane Tang, and Ya Xu. 2020. Trustworthy Online Controlled Experiments: A Practical Guide to a/B Testing. Cambridge University Press.

Matias, J. Nathan, Kevin Munger, Marianne Aubin Le Quere, and Charles Ebersole. 2019. “The Upworthy Research Archive.” https://upworthy.natematias.com.

References