A/B Testing - Common Mistakes - Simultaneous Tests
A/B Testing: Common Mistakes
Running more than one A/B test at the same time
When you are running an A/B test for a new feature, you wait to reach that magical 95% confidence level so you have statistical significance. You declare a victor and update your site accordingly. When you are running many A/B tests for multiple features, you wait for each of your tests to reach that magical 95% confidence level so you’ve reached statistical significance for each test. You declare the victors and update your site accordingly. Unfortunately, you’re actually not done. If you haven’t already, please read this previous post first which explains how a statistical test works.
What changes if I’m running many A/B tests for different features?
If there is a single A/B test running, 95% significance means that our observed difference or larger will only happen 5% of the time by chance if we assume that the treatment has no effect. However, if we have many A/B tests running, it is more likely that we will observe a large difference just by chance. If we use a 95% significance for each test, we have a larger than 5% chance of any single test having a large difference and accidentally rejecting our assumption.
Lets run some examples in R. In the previous post, we created two tests sets, both from the same distribution and saw how often we would’ve rejected the Null Hypothesis. With a 95% confidence interval, we correctly found that we would fail 5% of the experiments.
Now, we will create more test data sets from the same distribution. But this time, we will create 8 different test sets and run 4 tests at the same time. Lets see how often at least one of the tests fail the t-test.
set.seed(2015)
run_experiment_once <- function(x) {
# Test 1 and its p-value
data1 <- rnorm(100, mean=1, sd=2)
data2 <- rnorm(100, mean=1, sd=2)
p_value_1 <- t.test(data1, data2)$p.value
# Test 2 and its p-value
data1 <- rnorm(100, mean=1, sd=2)
data2 <- rnorm(100, mean=1, sd=2)
p_value_2 <- t.test(data1, data2)$p.value
# Test 3 and its p-value
data1 <- rnorm(100, mean=1, sd=2)
data2 <- rnorm(100, mean=1, sd=2)
p_value_3 <- t.test(data1, data2)$p.value
# Test 4 and its p-value
data1 <- rnorm(100, mean=1, sd=2)
data2 <- rnorm(100, mean=1, sd=2)
p_value_4 <- t.test(data1, data2)$p.value
# Since we only want to find if any of the tests fail, we only
# need to return the most significant test
min(p_value_1, p_value_2, p_value_3, p_value_4)
}
# sapply will repeat the experiment 10,000 times
result <- sapply(1:10000, run_experiment_once)
# "< 0.05" will compare the result of each experiment (the p-value) with 0.05 which will
# create a list of "TRUE" and "FALSE" values
reject.null.hypothesis <- result < 0.05
# sum() will add up the "TRUE" and "FALSE" values where TRUE=1 and FALSE=0. So this gives
# the number of "TRUE" values
true.count <- sum(reject.null.hypothesis)
# Finally, divide by 10,000 to get the percentage
true.count / 10000
## [1] 0.185
We can see that, with 4 simultaneous tests, we reject at least one test 18.5% of the time at a 95% significance level and not 5% as expected.
What do I do now?
There are multiple ways to fix this problem. However, the easiest way is to use a Bonferroni Correction. Let say we are running 4 A/B tests. Instead of looking for a 100% - 5% = 95% confidence level, we now look for a 100% - (5% / 4) = 98.75% confidence level. That is, we divide the 5% in the confidence level by the number of tests and compute a new confidence level to test for. This is usually a conservative correction, meaning that we are less likely to reject than necessary, but it is very easy to compute. Depending on the exactly situation, there are other correction that are less conservative, but they are outside the scope of this article.
So lets repeat the previous experiment when we ran 4 simultaneous A/B tests, but this time we apply the Bonferroni correction and look for 98.75% significance level.
set.seed(2015)
run_experiment_once <- function(x) {
data1 <- rnorm(100, mean=1, sd=2)
data2 <- rnorm(100, mean=1, sd=2)
p_value_1 <- t.test(data1, data2)$p.value
data1 <- rnorm(100, mean=1, sd=2)
data2 <- rnorm(100, mean=1, sd=2)
p_value_2 <- t.test(data1, data2)$p.value
data1 <- rnorm(100, mean=1, sd=2)
data2 <- rnorm(100, mean=1, sd=2)
p_value_3 <- t.test(data1, data2)$p.value
data1 <- rnorm(100, mean=1, sd=2)
data2 <- rnorm(100, mean=1, sd=2)
p_value_4 <- t.test(data1, data2)$p.value
min(p_value_1, p_value_2, p_value_3, p_value_4)
}
#Bonferroni Correction is applied here
sum(sapply(1:10000, run_experiment_once) < 0.0125) / 10000
## [1] 0.0506
Perfect! By using a Bonferroni correction, we reject any single experiment only 5% of the time, which falls in line with our original 95% confidence level.
But what happens as the number of tests change over time?
Say you have 10 tests running right now and you’re using a 99.5% confidence level. What if 8 of your tests end, leaving you with 2 tests? Then you update your confidence level to 97.5%, and suddenly one of the two remaining tests might have statistical significance! Personally, I would suggest staying conservative and keep using the 99.5% level for these two tests. This would imply using the highest confidence level that each test had during the life of the test.
Additional Information
Here are some wikipedia articles on correcting for multiple tests. I have also included a link about False discovery rates, which Optimizely is using in their new stats engine.
http://en.wikipedia.org/wiki/Familywise_error_rate
http://en.wikipedia.org/wiki/Bonferroni_correction
http://en.wikipedia.org/wiki/False_discovery_rate
Conclusion
If you are running many A/B tests, don’t forget to change your significance level. Otherwise, you’ll declare statistical significance when you don’t actually have it.