Introduction

As a data scientist, I have the privilege of working on deploying my models to production and the honor of testing their efficacy. This article is a culmination of all the experience I have with A/B testing as I’ve been setting up experimentation processes and platforms at various organizations with the aim of creating better models.

What is AB Testing?

Gronau et al provided a great definition for AB test:

Booming in business and a staple analysis in medical trials, the A/B test assesses the effect of an intervention or treatment by comparing its success rate with that of a control condition

But what are treatment and control?

Control group = A group of users/customers/people that will not receive any treatment or anything or they are status quo.
Treatment/Intervention Group = A group of users/customers/people that will receive a particular treatment or changes applied to it.

This leads to some interesting uses of the AB test!

Does the modification of a company website increase the number of online purchases? Does a new drug result in a lower mortality rate? These are just two examples of the kinds of questions that can be addressed with AB testing, a procedure popular not only in business and medical clinical trials, but also in fields such as psychology, neuroscience, and biology.

For instance, suppose a programmer alters code that should leave the appearance of a website unaffected. An AB test may be conducted to confirm that the code changes did not lead to unintended consequences.

Usually, we only AB test between one treatment and one control, what if there’s more than one treatment? That would be called multivariate AB testing.

Why do we do AB testing?

Spotify put it very simply:

To learn what works and what doesn’t. The learnings give us insights and fuel new product ideas

Focus on the word “learn”. That’s the only objective in an AB test, it is to learn. What do you do after learning? You apply it. Hence, the direct outcome of an AB test is the learnings that have been generated but in order to make an impact, you have to apply what you’ve learned. You don’ t do an AB test and expect the treatment to magically work after it, you do an AB test because you wanted to learn what works and what doesn’t.

AB testing is the gold standard method for causal inference as they’re on the highest levels of the evidence ladder by providing the clearest evidence.

The key to unlocking maximum business value is to understand causality, or in simple explanation: finding out why certain things happen and the causes of it.

How do we do AB testing?

This section will cover the essential steps in getting AB testing right and we will be utilizing the Frequentist Approach in AB testing which is the classic form of AB testing. Although there are certain things mentioned in our steps that certain party X might agree but party Y might not agree but at the end of the day, what we want is to maximize the confidence that we have in our AB test’s result and knowing that we have performed the necessary steps prior to the launch of the AB test.

Formulating our hypothesis and Determine Success Metrics

We will state our null and alternative hypotheses and success metrics:

Null hypothesis: States that there is no difference between the control and variant group. For example, would be that the two designs Control and Variant have the same efficacy, i.e. that they produce an equivalent click-through rate, or average revenue per user, etc.
Alternative Hypothesis: States that there is a difference between the control and variant group. Success Metric: The metric that will be impacted by supposed treatment. For example, it could be the first activation rate, conversion rate etc. The success metric will be used to derive p-value for treatment [More on p-value later].

Create Control and Variation Groups

We will determine the number of variation groups (including Control) based on the number of treatments to be administered. Once it has been determined, it would be used to calculate the required sample size for each variation. Why?

It’s essential that we determine the minimum sample size for your AB test prior to conducting it so that we can eliminate under coverage bias, bias from sampling too few observations.

Under coverage bias explanation by Formplus:

In other words, undercoverage happens when a significant entity in your research population has an almost-zero probability of getting selected into the research sample. For example, let’s say you’re conducting a product evaluation survey via Formplus to find out what users think about a product. To accurately gather data for this research, you’ll need to collect feedback from both new and existing users of the product. If any of these groups are excluded or poorly represented in your data sample, then your survey will suffer from undercoverage bias.

A big sample size doesn’t mean it would lead to statistical significance for our treatment but just helps us to have stronger confidence in our result and reduce biases. Only a strong treatment leads to statistical significance.

Calculating Required Sample Size for Groups and Determining Duration of Test

We can use a calculator to determine the required sample size and duration of testing: Calculator. The test should be run for at least 2 weeks to capture two business cycles from the weekends.

From the previous step, we should know how many groups including control will be created. Thus, we need to know how many users/visitors are required to allocate to them. This is called finding sample size.

In order to obtain meaningful results, we want our test to have sufficient statistical power. And, sample size influences statistical power. More sample size allows more sampling and allows us to gauge the power clearly. So, which means we need to get the appropriate sample size for our test through calculation!

Statistical Power

Most of the time, our distribution would have close or high overlaps between variation and control. Thus, we need to maximize be able to gauge the small differences with high confidence. Statistical power is the probability that the test rejects the null hypothesis when it should be rejected (correctly reject it). It is basically 1 minus beta. A common value for statistical power is 0.80 (so beta is 0.20). Used in accessing and controlling Type I and II errors. Type I error happens when we reject the null hypothesis when it should not be rejected. Type I error rate is the probability when Type I error happens, also known as significance level, or alpha. A common value for alpha is 0.05. This is called reliability. Type II error happens

A calculator can be used to derive sample size and duration with inputs:

Number of groups (including control) Statistical power (80% as standard)
Confidence level or P-value (95% or 0.05 respectively)
Weekly visitors/traffic/users (Approx.)
Weekly conversion (Approx.)

If the Bayesian approach was utilized: There’s no need to calculate the sample size and the minimum duration should still be kept at 2 weeks.

Last Step, Conduct Test and Analyze Result

No result peeking allowed! Just don’t do it! Read the FAQ if you want to know why.

Once the appropriate sample size and duration has been calculated. It is time to launch the test!

Visitors/users should be assigned randomly (Random Sampling) into the control or variant groups during or before the test. Randomness ensures that both groups are clones “on average”. This enables you to deduce causal estimates from AB tests because the only way they differ is the treatment.

Randomization or Random Sampling

Random sampling is one most common sampling techniques. Each sample in a population has an equal chance of being chosen. Random sampling is important in hypothesis testing because it eliminates sampling bias, and it’s important to eliminate bias because you want the results of your AB test to be representative of the entire population rather than the sample itself.

Once we’ve reached the required calculated duration, the result can be inputted into the calculator to calculate its statistical significance.

In order to determine the winner, we need to assess the statistical significance of our observations.

We will reject the Null Hypothesis if the p-value is < 0.05, and select the variant as the winner. Even though the null hypothesis has been rejected, it does not immediately establish an alternative hypothesis to be true but rather, we will evaluate it based on the statistical power to decrease false negatives

Statistical Significance and P-value explained

Assuming there is no underlying difference between Control and Variant, how often will we see a difference as we do in the data just by chance? The answer is P-value. Statistical significance is then determined by the p-value. Simply put, a low significance level means that there’s a big chance that our “winner” is not a real winner. Insignificant results carry a larger risk of false positives (known as Type I errors). Note: The p-value does not tell you the probability that B is better than Control as it does not tell you how big/small is the differences. The pvalue is just the probability of seeing a result (or a more extreme one) given that the null hypothesis is true. Or, “How surprising is this result?” or “Something weird is going on here”.

So, When Do You do AB Testing?

Tal Raviv, a product manager from Patreon gave a great checklist on the appropriateness of running AB test to avoid overusing it and not getting the right feedback. If we can provide justification based on the checklist given, we shall go ahead with the AB test

And cherry-picking some guidelines from the material given, here are some useful guidelines:

Run your experiment for the full length.
Monitor multiple metrics, but have one goal
1. The more metrics you test, the higher your chances of false positives
Don’t run tons of variants
1. The more variants, the fewer people in each one, and the lower your ability to detect a statistical effect
2. Stick to only a treatment and control most of the time, and don’t ever do more than three variations
Don’t segment after the fact looking for differences.
1. Do this with care and only if we have enough sample sizes for the segments. For example, if we only have a conversion of 87 vs 92 in two segments, it doesn't make sense to say that there’s something alarming in the result.
2. Avoid the temptation to splice up the data after the fact to find a new hypothesis
3. If you believe there's a difference outside of your hypothesis, run a separate test

Conclusion

In conclusion, running A/B tests effectively requires careful planning and adherence to best practices. By following a structured checklist and guidelines, such as running experiments for the full duration, focusing on one primary goal, and limiting the number of variations, we can ensure more reliable results. It's crucial to avoid over-segmenting data and to resist the urge to find patterns after the fact. Instead, if new hypotheses arise, they should be tested separately. By maintaining a disciplined approach, we can make informed decisions that truly benefit our projects.

FAQ

Can we change the experiment settings after the exp started?

No. Once the experiment is live, product teams cannot change any aspect of it. This is done so that the experiment results are not skewed and reduces the chance of giving false results. Users aren't allowed to be exposed to any modification of experiments until it has reached its natural end-state or it is aborted manually.

Will every experiment reach statistical significance?

No, not every experiment will reach statistical significance. There may be cases when the predetermined experiment duration has been reached but the Control and Treatment have no clear winner. This means that there is no significant difference to be found. And if we don’t reach statistical significance, we shall not segment after the fact looking for differences.

My experiment has not reached statistical significance, what can I do?

If the experiment has not reached statistical significance even after the duration of the experiment, then the experiment is not conclusive. Running the experiment for a more prolonged time is not allowed as that will skew the results. In this case, the product team has 3 options:

If the product team has a strong preference for the B variant, it can move 100% of traffic from A to B (also known as “non-statistical win”)
The product team might revise their hypothesis and run a new test
The product team can decide to stay with control and mark the hypothesis as false.

Is it advisable to see experiment results before the experiment ends?

This is called result peeking. No, it is not advised to see the results before the end of the experiment because the result is not final and the experiment has not yet reached a statistically significant point. Seeing the result is fine, but acting upon the results (prematurely ending the experiment before it reaches its predetermined natural end-state) is a wrong practice of A/B testing and will make the results invalid.

Does selecting the right success metrics matter?

It does. Success metrics should be evaluated very carefully as a success metric with bad intentions will yield a biased result. Eg. Using “stars” acquired as a success metric to compare between Control and Treatment whereby the control group will never get a “stars” so the treatment group will always be deemded a winner.

References

Informed Bayesian Inference for the A/B Test arxiv.org/pdf/1905.02068.pdf
engineering.atspotify.com/2020/10/29/spotif..
shopify.engineering/using-quasi-experiments..
When: reforge.com/brief/please-please-don-t-a-b-t..
Guidelines: reforge.com/brief/12-data-science-guideline..
cxl.com/blog/ab-testing-statistics
youtube.com/watch?v=vemZtEM63GY&t=394s
youtube.com/watch?v=Rsc5znwR5FA (It's good, trust me)
youtube.com/watch?v=J6kqvWnUE2Q&t=722s
dynamicyield.com/lesson/bayesian-testing
stats.idre.ucla.edu/other/mult-pkg/faq/gene..
cxl.com/blog/one-tailed-vs-two-tailed-tests

dynamicyield.com/glossary/probability-to-be..

All You Need To Know about AB Testing

Table of contents