If you’re in User Experience, chances are you probably didn’t get into the field because of your love of math.
As UX continues to mature it’s becoming harder to avoid using statistics to quantify design improvements.
One of my goals is to help make challenging concepts more servicesable and accessible.
We have taught the material to over 1,000 professionals and students at companies and conferences. Much of it was the inspiration for our book Quantifying the User Experience, in which we present many counter-intuitive concepts that take time to master.
Here are five of the more critical but challenging concepts. We didn’t just pick some arbitrary geeky stuff to stump math geeks (or get you an interview at Google). These are fundamental concepts that take practice and patience but are worth the effort to understand.
1. Using statistics on small sample sizes: It’s a common misconception that is extremely hard to break. You do not need a sample size in the hundreds or thousands or even above 30 to use statistics. We regularly compute statistics on small sample sizes (less than 15) and find statistical differences. We are limited to only seeing large differences in our measures, such as 30% to 50% differences; however, in most early design stages we care the most about those large differences. Jim and I wrote a recent article for UX Magazine with several examples showing how we used statistics with small sample sizes to draw some meaningful conclusions.
2. Power: To compute the appropriate sample size when comparing products or designs you need to account for power. Power is sort of like the confidence level for detecting a difference—you don’t know ahead of time if one design has a higher completion rate than another. A difference could exist but you might not see that difference in the sample of users you test. The ability to detect that difference in a study is called power. The more power a study has, chances are greater of finding smaller differences between products and concluding they aren’t due to chance alone. We cover power and sample size in Chapter 7 of our book Quantifying the User Experience. The “bible” of power analysis is Jacob Cohen’s 1988 book Statistical Power Analysis for the Behavioral Sciences. I still use and reference this book, but it’s not exactly light reading or an easy introduction to the concept.
3. The p-value: There are books written about the letter “p.” The p-value stands for probability value. It’s the probability the difference you observed in a study is due to chance. I also call the p-value the punch line as it’s usually one of the only things a statistical test will spit out. Some examples of p-value are .012, .21 or .0001; a p-value of .012 indicates that there’s a 1.2% chance the difference observed between products is due to chance. Given that this is a pretty low percentage, in most cases, we’d conclude it’s not due to chance and call it statistically significant. By convention, journals and statisticians say something is statistically significant if the p-value is less than .05. There’s nothing sacred about .05 though, in applied research, the difference between .04 and .06 is usually negligible. Would you really reach a different conclusion if I said I was 94% confident there was a difference versus 96% confident?
4. Sample Size: Sample size calculation remains a dark art for many practitioners. There are many counterintuitive concepts, including power, confidence and effect sizes. One complication is that there are different ways to compute sample size. There are basically three ways to find the right sample size for just about any study in user research.
Problem Detection: If you’ve been in the field of User Experience for a few months then you’ve probably heard about the “test with five users” rule. The magic number five does apply to samples sizes in usability tests, but it’s only when you’re looking to uncover problems in an interface and if the problems are relatively easy to detect. Specifically, with five people, you will have an 85% chance of seeing problems if they affect at least 31% of the population.
Comparing: If you are comparing designs or competing products, the sample size is largely based on how large of a difference you want to detect. To detect small differences you need a larger sample size. The confidence level, power and variability of the population also play a role but it’s the size of the difference that matters most.
Precision: If you’re conducting a survey and want to estimate the prevalence of an attitude—such as whether users agree to statements—then you compute sample sizes based on how precise an estimate you want. It’s basically working backwards from a confidence interval. So you find the sample size you need to achieve a 10% margin of error around your metrics. If you need to cut your margin of error in half, then you need to roughly quadruple your sample size. A good place to start is the 20/20 Rule.
5. Confidence intervals get wider as you increase your confidence level: The “95%” in the 95% confidence interval you see on my site and in publications is called the confidence level. A confidence interval is the most plausible range for the unknown population mean. But you can’t be sure an interval contains the true average. By increasing our confidence level to 99% we make our intervals wider. The price for being more confident is that we have to cast a wider net. That is, if we want to be really sure we have an interval that contains the unknown completion rate or average satisfaction rating then we need to make the interval wider (assuming we hold the sample size and variability constant).