A/B Testing - Basics - Statistical Tests

If there are two data sets, with lots of variance in each data set, how do we tell if one data set is higher than the other on average? We can see that there is a difference in the average between the two data sets. But is that difference real or just by chance? If we hypothetically run this A/B test again, we will get two different data sets with two different averages, and new difference between the averages. The difference may be smaller or larger. What would that mean? Well, actually, we can use this idea of repeated hypothetical experiments to find if the difference is real or not.

The Null Hypothesis

First, we assume that the treatment doesn't do anything. Then statistics tells us how often we will encounter our observed data just by chance. If our observed data doesn't happen often just by chance, then we have evidence that our assumption is incorrect and that the treatment does do something.

Let me try to explain it with more technical terms. Let’s assume that both data sets are samples from the same distribution. This is called the Null Hypothesis and we assume that it is true. Then using statistics, we can estimate what would happen if we ran this experiment many times, getting different data sets each time, and look at the difference between the averages of the two data sets. Then statistics can tell us how often our observed difference (or larger) in our actual test data will happen if we assume the Null Hypothesis. If the size of the difference (or larger) that we observed does not happen often, then we have evidence that our assumption is not true. We say that the no-effect treatment assumption is not true and we “reject the Null Hypothesis”.

Examples: Two different distributions

Lets look at some examples written in R. Let the first data set be 100 Normally distributed points with mean=1 and standard deviation=2. Let the second data set be 100 Normally distributed points with mean=2 and standard deviation=2. What does Student’s t-test tell us?

set.seed(2015)
# Create a list of 100 random draws from a normal distribution 
# with mean 1 and standard deviation 2
data1 <- rnorm(100, mean=1, sd=2)
# Create a second list of 100 random draws from a normal
# distribution with mean 2 and standard deviation 2
data2 <- rnorm(100, mean=2, sd=2)
# Perform a t-test on these two data sets and get the p-value
t.test(data1, data2)$p.value

## [1] 0.0005304826

In this case, the data was actually created with two different distributions. We can see that, if we assume they came from the same distribution (Null Hypothesis), the t-test says 0.05% of the time we will observe data this far apart or further. This is a 99.95% significance level. So we reject the null hypothesis and declare the second data set to be higher than the first.

Now, lets move the second data set closer to the first data set. Lets change its mean to 1.3 and and keep the first data set mean at 1.0

set.seed(2015)
# Create a list of 100 random draws from a normal distribution 
# with mean 1 and standard deviation 2
data1 <- rnorm(100, mean=1, sd=2)
# Create a second list of 100 random draws from a normal
# distribution with mean 1.3 and standard deviation 2
data2 <- rnorm(100, mean=1.3, sd=2)
# Perform a t-test on these two data sets and get the p-value
t.test(data1, data2)$p.value

## [1] 0.3258681

Now, even though the data was created with two different distributions, the t-test shows that we have a 33% chance of observing data this far apart or further when we assume the Null Hypothesis. This is a 67% significance level. So we cannot reject the null hypothesis and declare that we don’t have a winner yet.

Examples: A single distribution

Let’s look at a different type of example. Let’s still create two different data sets but from the same distribution. We will repeat this experiment 10,000 times and see what happens.

set.seed(2015)
run_experiment_once <- function(x) {
    # Create a list of 100 random draws from a 
    # specific Normal distribution
    data1 <- rnorm(100, mean=1, sd=2)
    # Create a second list of 100 random draws from the 
    # same specific Normal distribution
    data2 <- rnorm(100, mean=1, sd=2)
    # Perform a t-test on these two data sets and get
    # the p-value
    t.test(data1, data2)$p.value
    # the p-value only will be returned from this function
}

# sapply will repeat the experiment 10,000 times
result <- sapply(1:10000, run_experiment_once)

# "< 0.05" will compare the result of the experiment
# (the p-value) with 0.05. This will create a list of
# "TRUE" and "FALSE" values
reject.null.hypothesis <- result < 0.05

# sum() will add up the "TRUE" and "FALSE" values where 
# TRUE=1 and FALSE=0. So this gives the number of "TRUE"
# values
true.count <- sum(reject.null.hypothesis)

# Finally, divide by 10,000 to get the percentage
true.count / 10000

## [1] 0.051

Even though the two data sets came from the same distribution, we still reject the Null Hypothesis 5.1% of the time. This falls in line with our 95% significance level. Remember we assume that the two distributions are actually equal, which we did in this example. Then we determine how often the difference (or larger) will occur by chance, which we selected to be 5%.

Additional Information

Just in case my explanation wasn't quite your style, here are some other links

http://en.wikipedia.org/wiki/Null_hypothesis

http://en.wikipedia.org/wiki/P-value

https://statistics.laerd.com/statistical-guides/hypothesis-testing-3.php

http://blog.minitab.com/blog/understanding-statistics/things-statisticians-say-failure-to-reject-the-null-hypothesis

Frequentist Perspective

I just want to mention that this explanation is called Frequentist Statistics. This is what almost every introductory statistics class teaches. The other type of statistics is called Bayesian Statistics. A larger discussion of the two branches of statistics is outside the scope of this article. I have included a few links below

http://stats.stackexchange.com/questions/22/bayesian-and-frequentist-reasoning-in-plain-english

http://www.quora.com/What-is-the-difference-between-Bayesian-and-frequentist-statisticians

http://simplystatistics.org/2014/10/13/as-an-applied-statistician-i-find-the-frequentists-versus-bayesians-debate-completely-inconsequential/

Conclusion

A statistical test tells us how often we would see data like ours given many hypothetical replicated experiments. If it doesn't happen often, we have evidence that the two groups are different.

Data? Science?

Data and Science, and hopefully both at the same time.

A/B Testing - Basics - Statistical Tests

What is a statistical test?

Related

Leave a Reply Cancel reply