Experimentation is a great way to boost performance. It’s becoming easier and easier to run tests on website design, marketing copy, product features and much else besides. Amazon is famous for running thousands of experiments day-in day-out, constantly learning and improving their products.
Tools like Optimizely’s Web Experimentation allow non-statisticians to easily run controlled experiments on websites to optimise KPIs such as dwell times, click-through rates or sales.
The outputs these tools provide can often be confusing. One such output is called a p-value. A p-value is a statistical concept which, like all statistical concepts, can be a little tricky to get your head around if you don’t spend all day looking at them.
In this post we unpack in simple terms what p-values are and how to interpret them.
A/B tests can be used to determine the best attributes for your website. A/B tests are a great way to boost website performance and to understand what ideas, colours and distractions your target audience responds well to.
An A/B test is an example of a ‘two sample hypothesis test’: that is, a statistical test that compares average values from two samples.
For example, is the conversion to sales rate higher for email campaign group A than group B?
Two samples are collected, for example by sending web visitors to two different versions of the website at random, or randomly selecting recipients for one of two email campaigns. Summary stats are calculated, for example average conversion rates. These stats are then compared.
Making judgements from data
You’ve run your experiment and collected the data. Your old site design (version A) gets a sales conversion rate of 5%; your new design (version B) gets a sales conversion rate of 10%.
In this experiment version B beats version A by 5 percentage points. But that does not necessarily mean B will beat A every time such an experiment is run. The result could have been due to random chance. What if you only had 20 visitors to your website in each sample? Is that enough evidence to make a decision?
Statistics is all about how you make these judgements. If you flip a coin five times, and each time you get heads, do you conclude the coin must be biased? Intuitively you know the larger the sample size, the more times you flip the coin, the more confident you can be in your judgement.
And p-values, in a neat single figure, help you make this call. A p-value is the probability you would have gotten a difference as big or bigger as the one observed across the two samples if no difference truly exists.
Assume for a moment there is in fact, across all potential visitors to your site, no preference for one version or the other. That is, if you took thousands of samples of visitors over and over again for months you’d find on average an equal chance of conversion to sales for both the A and the B version of your site.
But for any one of those samples there’s a small chance there would be a difference. This is called sampling variation and the probability of getting such a sample is the p-value.
If the p-value is small, say 1%, that means there is a small probability you identified a difference in performance of your two site versions by random accident. Therefore, you conclude there IS a difference.
It is typical to use the threshold of 5%: if your p-value is lower than 5% you conclude the difference observed in your experiment is significant.
Statistics is a tricky but increasingly valuable subject. It is very easy to misinterpret results and make bad decisions without sufficiently understand what the numbers are telling you.
DS Analytics are a data consultancy. We help our clients get value from their data. Get in touch to find out more or email us at firstname.lastname@example.org.