A/B Testing – Nonparametric tests
A/B Testing
Nonparametric Statistics
Most A/B testing platforms use Student’s t-test to test for statistical significance. However, this test has assumptions that need to be met. It also has some known short comings. This is where the Mann-Whitney U Test comes in handy. It has fewer assumptions and a different set of short comings. Instead of using the data directly, this test will convert all data points into a rank by combining all test groups into one group and computing the combined rank. It then analysis the groups separately using the combined rank.
Normality of Mean Assumption
This is not the same as a normality of data assumption. This assumption is saying that if we hypothetically repeated this test many times and computed the mean each time, then the distribution of mean is Normal. This is called the Central Limit Theorem. It states that as you get more and more data points, the distribution of mean is more and more Normally distributed. This is true for any set of data, even if the data itself is not Normally distributed. However, if your data is not Normally distributed, then it takes more and more data before the Central Limit Theorem becomes accurate. More details about this are given in links below.
This assumption is not required of the Mann-Whitney U test. Since this test uses rank, it removes almost all details of the specific distribution of the data and this assumption is much easier to meet.
So if the data is not normally distributed, the Mann-Whitney U test is actually more “efficient” than the t-test and is almost as “efficient”" when the data is normally distributed. (Reference)
Lets look at an example. Lets look at the Poisson distribution with shape=0.2 and rate=10
x <- seq(0, 3, length=100)
y <- dgamma(x, shape=0.2, rate=10)
plot(x, y, type="n", main="Poisson Density Function (shape=0.2, rate=10)")
lines(x, y)
We can see this distribution is not normally distributed. Lets draw a sample of 1000 from this distribution, and also from a slightly different distribution. We will perform both a t-test and Mann-Whitney U test and get the p-values. Lets repeat this 100 times and find the mean of the p-values.
## [1] 0.1681351
## [1] 0.004257654
We can see that the t-test doesn’t have significance with an average p-value of 0.17. The Mann-Whitney U test has significance with an average of 0.004.
Lets try this again, but with something that looks more Normally Distributed. We will use a Poisson distribution but with a shape parameter=10 and rate parameter=10
x <- seq(0, 3, length=100)
y <- dgamma(x, shape=10, rate=10)
plot(x, y, type="n", main="Poisson Density Function (shape=10, rate=10)")
lines(x, y)
Lets run our experiment again.
p.value <- sapply(1:100, function(x) run_once(1000, 10, 10.4, 10))
mean(p.value[1,])
## [1] 0.04503111
mean(p.value[2,])
## [1] 0.05102875
We can see both tests have about the same signifiance. The t-test pvalue is 0.045 and the Mann-whitney U test is about 0.051.
Outliers
The t-test has problems dealing with outliers (Link). Mann-Whitney U test doesn’t suffer from this problem since everything is converted into ranks. Outliers are no different than any slightly large value.
Lets look at the previous example in the other post, but use the Mann-Whitney U test instead. We will create two data sets with different means. Then add a single outlier point to one of them.
set.seed(2015)
# Create a list of 100 random draws from a normal distribution
# with mean 1 and standard deviation 2
data1 <- rnorm(100, mean=1, sd=2)
# Create a second list of 100 random draws from a normal distribution
# with mean 2 and standard deviation 2
data2 <- rnorm(100, mean=2, sd=2)
# Perform a t-test on these two data sets
# and get the p-value
t.test(data1, data2)$p.value
## [1] 0.0005304826
# Perform a Mann-Whitney-U test on these two data sets
# and get the p-value
wilcox.test(data1, data2)$p.value
## [1] 0.0002706636
# append 1000 to the first data set only
data1 <- c(data1, 1000)
# Perform a t-test on these two data sets
# and get the p-value
t.test(data1, data2)$p.value
## [1] 0.369525
# Perform a Mann-Whitney-U test on these two data sets
# and get the p-value
wilcox.test(data1, data2)$p.value
## [1] 0.0004766358
We can see the Mann-Whitney U Test finds statistical significance before adding the outlier. After we add the outlier, the p-value increases slightly, but the result is still significant. The t-test has significance before the outlier, but after the outlier, the t-test loses significance.
Ties in the values
The Mann-Whitney U test works best if every value is unique. This is normally not a problem for continuous data. If you have many zeros in your data, or you have count data, this will result in many ties. There are various ways to resolve ties, but the results are no longer exact, but approximate. Approximate isn’t necessarily bad since the t-test is also approximate if the data is not normally distributed.
Lets repeat our first test. This test found statistical significance with the wilcox test. This time, we will round off our values to create ties and see how the test performs.
set.seed(2015)
run_once <- function() {
# Create a list of 100 random draws from an exponential distribution
# with rate=1
data1 <- rgamma(1000, shape=0.2, rate=10)
# Create a second list of 100 random draws from an exponential
# distribution with rate=2
data2 <- rgamma(1000, shape=0.24, rate=10)
# Perform a Mann-Whitney U test on these two data sets
a <- wilcox.test(data1, data2)$p.value
# Perform a Mann-Whitney U test after rounding off two data sets
b <- wilcox.test(round(data1*500)/500, round(data2*500)/500)$p.value
# Perform a Mann-Whitney U test after rounding off two data sets
c <- wilcox.test(round(data1*100)/100, round(data2*100)/100)$p.value
# Perform a Mann-Whitney U test after rounding off two data sets
d <- wilcox.test(round(data1*25)/25, round(data2*25)/25)$p.value
# Perform a Mann-Whitney U test after rounding off two data sets
e <- wilcox.test(round(data1*5)/5, round(data2*5)/5)$p.value
c(a, b, c, d, e)
}
p.value <- sapply(1:100, function(x) run_once())
# mean of continuous data
mean(p.value[1,])
## [1] 0.004257654
# mean of discrete data
mean(p.value[2,])
## [1] 0.01127112
# mean of discrete data
mean(p.value[3,])
## [1] 0.02937118
# mean of discrete data
mean(p.value[4,])
## [1] 0.08112481
# mean of discrete data
mean(p.value[5,])
## [1] 0.3527484
We can see here that the average p-value is 0.004 before we start rounding off the data. As we round off the data more and more, we create more and more ties and we can see that we lose significance.
Additional Information
http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test
http://www.statisticalengineering.com/central_limit_theorem.htm
http://spin.atomicobject.com/2015/02/12/central-limit-theorem-intro/
Conclusion
If your data is not Normally distributed or contains outliers, consider using the Mann-Whitney U test.