Whether you’re new to A/B testing or a seasoned practitioner, here are 10 things you should know about this essential method for quantifying the user experience.
- A and B refer to Alternate Versions: A/B testing is often called split-testing as you split a sample of users where half use one version, arbitrarily called A, and the other half use the other version, arbitrarily called B. You can split-test comparable designs, competing products or old applications versus new applications. When possible you should split and test simultaneously instead of sequentially (such as running treatment A for a week then B for the next week) as seasonal variation, holidays, weather and all sorts of other undesirable variables can impact your results.
- Anything Can Be Split-Tested: A/B testing is associated with websites using products like Google Website Analyzer. However, A/B testing can be conducted on desktop software or physical products. I think of A/B testing more generally and apply it along WITH, not instead of, many other user research methods. In fact, A/B testing is at the heart of the scientific method with a rich history of researchers randomly assigning different treatments to patients, animals and almost anything for hundreds of years.
More recently, split testing gained traction in direct mail marketing when different versions of printed material were sent to subsets of long lists of physical addresses. With physical marketing material you have the real costs of printing and shipping. By splitting the list you could see which image, tagline or envelope resulted in more sales or inquires. With electronic campaigns there’s little to no direct cost, just the opportunity cost of using the less optimal campaign. As with direct mail, by splitting the media into smaller pieces (headlines, links, buttons, pictures, pages, form fields) and testing each, you can identify which elements increases the intended Key Process Indicators (KPI’s)—signups, purchases, calls or just task completion.
- Understanding Chance: A fundamental principal when working with any subset or sample of a larger user population is the role of random chance. Just because you observe 5% of users purchase using treatment A and 7% purchase with treatment B doesn’t necessarily mean that when all users are exposed to treatment B more, or at exactly 7% will purchase. Statistical significance, now part of our lexicon, tells us whether the difference we observed is actually greater than what we’d expect from just chance fluctuations in the sample we selected.
- Determining Statistical Significance: To determine if two conversion rates (which are proportions expressed as percentages) are statistically different, use the A/B test calculator. It uses a statistical procedure called the N-1 Chi-Square test. It’s a slight derivation from the more common Chi-Square test that is taught in introductory statistical classes but it has been shown to work well for large (>10,000) and small (<10) sample sizes. For example, if 100 out of 5000 users click through on an email (2% conversion rate) and 70 out of 4800 click through on a different version (1.46% conversion rate), the probability of obtaining a difference this larger or larger, if there really was no difference, is 4% (p-value = .04). That is to say, it’s statistically significant—you just don’t see differences this large very often from chance alone.
Technical Note: When sample sizes get small (expected cell counts less than 1) the calculator uses the Fisher Exact Test, otherwise it uses the N-1 Chi-Square test, which is equivalent to the N-1 Two Proportion test that we teach in our courses. Some calculators will use just the Chi-Square test or Z-test (often called the normal approximation to the binomial) which generally work fine as long as sample sizes are reasonably large (and expected cell counts are large, usually above 5). See Chapter 5 in Quantifying the User Experience for a more detailed account of the formula.
- Confidence and P-Value: Many calculators, including ours, often convey statistical significance as confidence. This is usually done by subtracting the p-value from 1. For example, the p-value from the earlier example was 0.04 which gets expressed as 96% confidence. While the p-value and confidence level are different things, in this context, little harm comes from thinking of them in the same way (just be prepared for the more technically minded to call you out on it). The p-value is what you get after a test is run and tells you the probability of obtaining a difference that large if there really was no difference, while the confidence level is what you set before the test and affects the confidence interval around the difference [see below].
- Use a Two-Sided P-value: Many calculators, including ours, provide one and two-tailed p-values, also expressed at confidence. When in doubt, use the 2-sided p-value. You should only use the 1-sided p-value when you have a very strong reason to suspect that one version is really superior to the other. See Chapter 9 in Quantifying the User Experience for a more detailed discussion on the issue of 1 versus 2 tailed p-values.
- Sample Sizes: As with every statistical procedure, one of the most common questions is “What sample size do I need?” For A/B testing, the “right” sample size comes largely down to how large of a difference you want to be able to detect, should one exist at all. The other factors are the level of confidence, power and variability (values closer to 50% have higher variability). However, we can usually hold power and confidence to typical levels of 90% and 80% respectively and pick a reasonable range for conversion rates, say around 5%. Then we can just vary the difference between A and B and see what sample size we’d need to be able to detect a difference as statistically significant. The table below shows the sample size needed to detect differences as small as .1% (over half a million in each group) or as large as 50% (just 11 in each group).
Sample Size Difference Each Group Total A B 0.1% 592,905 1,185,810 5% 5.1% 0.5% 24604 49,208 5% 5.5% 1.0% 6428 12,856 5% 6.0% 5.0% 344 688 5% 10.0% 10.0% 112 224 5% 15.0% 20.0% 40 80 5% 25.0% 30.0% 23 46 5% 35.0% 40.0% 15 30 5% 45.0% 50.0% 11 22 5% 55.0%
Table 1: Sample size needed to detect differences from .1% to 50%, assuming 90% confidence and 80% power and conversion rates hovering around 5%.
One services to sample size planning is to take the approximate “traffic” you expect on a website and split it so half receives treatment A and half receives treatment B. If you expect approximately 1000 pageviews a day, then you’d need to plan on testing for about 13 days. At that sample size, if there was a difference of 1 percentage point or larger (e.g. 5% vs 6%) then that difference would be statistically significant [see the row in Table 1 that starts with 1.0%].
If you want to determine if your new application has at least a 20% higher completion rate than the older application, then you should plan on testing 80 people (40 in each group).
- Stopping Early: There is some controversy about stopping A/B tests early, rather than waiting for the predetermined sample size. The crux of the argument against peeking and stopping is that you’re inflating the chance of getting a false positive—saying there’s a difference between A and B when one doesn’t really exist. This is related to a problem called alpha inflation which we address in Chapter 9 of Quantifying the User Experience. For example, if you plan on testing for 13 days to achieve the total sample size of about 13k but after four days you check your numbers and see a statistically significant difference between conversion rates of 2% (40 out of 2000) and 3% (60 out of 2000). Do you stop or keep going?
If you are publishing a paper you should probably wait for the full 13 days, especially if you have a grant and need to spend the funds and to get any picky reviewers off your back. If you want to make a decision on which is the better version you should almost certainly stop and go with treatment B. There are merits to the argument that multiple tests will inflate your significance level and lead you to more false positives, however, a false positive in this case means saying there’s a difference when one doesn’t exist.
In other words, A and B might just be the same, so going with either one would be fine and it would be better to spend your efforts on another test! In fact, given these results, it’s highly improbably (less than a 3% chance) that A is BETTER than B. Even using very conservative adjustments to the p-value to account for alpha inflation (not that I recommend that), B will still be the better choice. In applied research, it’s usually picking the best alternative, not publishing papers.
You will essentially do no harm or better by going with B (in most cases) and by cutting your testing short you’re also reducing the opportunity cost of delaying the better element into your design. Only when stakes are high, the costs of switching to B over A are high (say it involved a lot of technical implementation), the cost of additional sample is low, and the opportunity cost of not inserting the better treatment is low should you keep testing for the full number of days.
- One Variable at a Time is Simple but Limited : The simplicity of A/B testing is also its weakness. While you can vary things like headlines, pictures and button colors one at a time, you miss out on testing all combinations of these variables. Multivariate analysis (also referred to as Full and Partial Factorials) allow you to understand which combinations of variables tested simultaneously generate the highest conversion rate. This is not a reason to exclude A/B testing, but rather understand that while you are making improvements, you could be making MORE improvements with multivariate testing.
- Statistical Significance Does Not Mean Practical Significance: With a lot of the focus on chance, statistical significance, optimal sample sizes and alpha inflation, it’s easy to get distracted and lose sight of the real reason for A/B testing: making real and noticeable improvements in interfaces. As you increase your sample size, the chance you will find differences in treatments as statistically significant increases. Table 1 shows that when you have over 10,000 users in each group, differences of less than a percentage point are statistically significant. For high transaction websites, a difference this small could translate into thousands or millions of dollars more.
In many cases though, small differences may go unnoticed and have little effect. So just because there is a statistical difference between treatments doesn’t mean it’s important. One way to qualify the impact of the statistical significance is to use a confidence interval around the difference as is done in the Stats Usability Package.
For example, with the observed difference of 1% between treatment A at 2% and treatment B at 3%, we can be 90% confident the difference, if exposed to the entire user population, would fall between 0.2% and 1.8%. Depending on the context, even a 0.2% improvement might be meaningful to sales or leads. Or, at most a 1.8% improvement might not be worth the cost of implementing the new change at all. Context dictates what makes a statistical difference of practical importance, but the confidence interval provides the boundaries on the most plausible range of that difference.