A/B Testing - Common Mistakes - Outliers
A/B Testing: Common Mistakes
Exploratory Data Analysis: Outliers
Outliers can easily cause problems with your A/B test. You may have seen strange anomalies with your data metrics, a particular metric being too high or too low compared to the others. You may have seen your statistical test first start significant and then become not significant. These problems may be coming from outliers in your data.
Let look at an example in R. We will create two data sets, the first data set less than the other one.
set.seed(2015)
# Create a list of 100 random draws from a normal distribution
# with mean 1 and standard deviation 2
data1 <- rnorm(100, mean=1, sd=2)
# Create a second list of 100 random draws from a normal distribution
# with mean 2 and standard deviation 2
data2 <- rnorm(100, mean=2, sd=2)
# Perform a t-test on these two data sets and get the p-value
t.test(data1, data2)$p.value
## [1] 0.0005304826
We can see this t-test will give a p-value of 0.0005 which is a significance level of 99.95%. Now lets add a single outlier into the first data set.
# append 1000 to the first data set only
data1 <- c(data1, 1000)
# Perform a t-test on these two data sets and get the p-value
t.test(data1, data2)$p.value
## [1] 0.369525
Now, you can see, even though we had 100 points in each data set, a single large outlier caused our data to become non-significant with a 63% significance level only
How do we fix this?
There are multiple ways to fix this problem. Student’s t-test is not robust against outliers and we can run Mann-Whitney U test instead. A deeper discussion of this approach is outside the scope of this article.
We can also detect the outliers and consider removing them. This approach needs to be taken very carefully. We should only remove data that does not come from our target population. If we see one example data point that is an outlier, it may be unlucky to see such a strange data point. However, it may also be unlucky to see only one one such data point. Therefore, outliers need to be removed only after careful examination. For A/B testing, this usually means removing data that is coming from bots and not humans. This can be difficult because not all bots and scripts report their User Agent properly.
Lets look at some diagnostic tools with R. We will create 100 points from the same distribution. Then, we will add 2 outliers to our data set.
set.seed(2015)
# Create a list of 100 random draws from a normal distribution
# with mean 1 and standard deviation 2
data <- rnorm(100, mean=1, sd=2)
# lets add two outliers
data <- c(data, 20)
data <- c(data, 40)
# Create an image with two plots side by side
par(mfrow=c(1, 2))
hist(data)
boxplot(data, main="Box plot of data")
On the left is a histogram. We can see the outlier at 40. It is more questionable if 20 is an outlier. For the boxplot on the right, the box itself contains the 25% to 75% of the data. The thick line in the middle of the plot is the median. The “whisker” at the top and bottom of the plot are the min and max of the data except for “outliers”. A good explanation of outliers in box plots in R can be found at the bottom of this page http://msenux.redwoods.edu/math/R/boxplot.php
So now we have found a few outliers in our data. Remember, it is important to carefully consider each point before removing them, since we easily could have seen more data at that point rather than only one. One technique to try is to perform the test again but with the point removed. If the test gives the same result, then we might as well leave the data point in.
Outliers in more than one dimension
If your data contains two variables, there is another type of outlier to look for. Lets look at this plot. It has 100 points again, but with two correlated variables. Then, we add a single outlier.
set.seed(2015)
# Create a list of 100 random draws from a normal distribution
# with mean 1 and standard deviation 2
data1 <- rnorm(100, mean=1, sd=2)
# Lets create a second correlated variable.
correlation <- 0.95
data2 <- correlation * data1 + sqrt(1-correlation) * rnorm(100, mean=1, sd=2)
#Lets add our outlier
data1 <- c(data1, -3)
data2 <- c(data2, 4)
par(mfrow=c(1, 1))
plot(data1, data2)
Here most of the data lies close to the lower-left to upper-right diagonal. We have a single point on the upper left of the plot. In any single dimension that particular point is right in the range of the data. But combined in two dimensions, it becomes an outlier.
We can find this point by computing something called Leverage. Though this is generally used to find outliers during linear regression, we can use it here to help detect some outliers.
# create a matrix with our two data sets
data_matrix <- matrix(c(data1, data2), nrow=101, ncol=2)
tail(data_matrix)
## [,1] [,2]
## [96,] 0.1888523 0.37660990
## [97,] -2.3504425 -2.08643945
## [98,] 0.9110532 0.06500559
## [99,] -1.0946785 -1.09385952
## [100,] -2.4602479 -2.40095114
## [101,] -3.0000000 4.00000000
# leverage is also knows as hat values
leverage <- hat(data_matrix)
tail(leverage)
## [1] 0.01121853 0.03700946 0.02901691 0.02292416 0.04241248 0.71791422
Above are the last 6 rows of the matrix and you can see our outlier as the last point. The corresponding leverage values are also given. You can see the very high leverage value for the last point. As a rule of thumb, leverage values that exceed twice the average leverage value should be examined more closely. However, for an A/B test, we have many observations and a wide range of leverage values. In this case, I would start examining the highest leverage points and work your way down.
Alternatives
As mentioned above, Student's t-test is not very robust to outliers. There are other tests that are more robust to outliers and are based on each observation's ranks instead of actual value. You can start looking at the Mann-Whitney U test and enter the world of non-parametric statistics
http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test
http://en.wikipedia.org/wiki/Nonparametric_statistics
http://www.originlab.com/index.aspx?go=Products/Origin/Statistics/NonparametricTests
http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Nonparametric/BS704_Nonparametric2.html
Additional Information
Here are some other links about leverage and other types of outlier detection
http://en.wikipedia.org/wiki/Leverage_(statistics)
http://en.wikipedia.org/wiki/Outlier
http://onlinestatbook.com/2/regression/influential.html
http://pages.stern.nyu.edu/~churvich/Undergrad/Handouts2/31-Reg6.pdf
Conclusion
Bots and scripts can cause problems in your A/B tests. It is important to try to detect these users in your data and remove them since they do not represent your target population.