The difference is statistically significant.
When using statistics to make comparisons between designs, it’s not enough to just say differences are statistically significant or only report the p-values.
With large sample sizes in surveys, unmoderated usability testing, or A/B testing you are likely to find statistical significance with your comparisons.
What you need to know is how big of a difference was detected. The size of the difference gives you a better idea about the practical significance and impact of the statistical result.
For example, a new design that reduced the time to complete a task from 100 second to 50 seconds is a rather larger difference—a reduction in time of 50%. Compare that to another task in which the task time was reduced by 5 seconds, from 100 seconds to 95 seconds–a 5% reduction in time. With a large enough sample size both these differences can be statistically significant, but all things being equal, the 50% reduction in time represents a much larger difference. The p-value doesn’t tell you how large of a difference there is.
Completion rates also offer a generally intuitive unstandardized measure of effect size. Knowing that a task completion rate increased 30 percentage points from 50% to 80% is generally understandable. Context of course will provide even more insight on whether 80% is good enough.
However, we often work with companies that create their own rating scales to measure attitudes toward brands and products experiences. When we find a statistical difference, it’s helpful to know if it’s a big deal or just a minor difference. The raw mean difference is usually less intuitive. For example, is the difference of .4 points between and old design and new design small, medium or large? Using percentage differences is helpful (e.g. a 10% higher score) but it doesn’t take into account the variability in each sample—a critical ingredient of understanding chance variation.
Effect sizes are a systematic way of understanding how large differences are. They are particularly helpful when the underlying measure and context are not as familiar as task times or completion rates.
One of the most common ways to compute a standardized effect size is using a measure known as Cohen’s d. To compute Cohen’s d for differences between means you also need the standard deviations for one or ideally both groups being compared.
For example, in a recent comparative website study, we found statistically different attitudes in satisfaction on a 10 point scale between a new design and competitor. The new design had a mean of 8.8 (sd = 1.23) and the competitor had a mean of 7.5 (sd = 2.62). That’s a difference of 1.3 points. That sounds like a reasonably big difference but it’s hard to know unless you are familiar with 10 point scales.
To compute Cohen’s d, we’ll take the mean difference of 1.3 points and divide it by an average of the standard deviations (Note: The averaging is done on the variances (the standard deviation squared) which generates a pooled standard deviation of 2.05). This provides an effect size of d = .64.
In comparison, another question asked users their perceived difficulty of completing a task on a 7 point scale. The new design had a mean of 5.6 (sd = 1.2) and the competitor had a mean of 5.8 (sd =1.25). That’s a difference of .2 points and an effect size of d = .16.
We can also convert differences in task times into the same standardized effect size. For another study we compared the time to locate the price of a car on a 3rd party automotive website. Before the redesign the mean time was 136 seconds (sd = 69) and after the redesign a year later the mean time was 96 seconds (sd = 66). That’s a difference of 40 seconds and an effect size of d = .59.
Effect size formulas exist for differences in completion rates, correlations, and ANOVAs. They are a key ingredient when thinking about finding the right sample size.
When sample sizes are small (usually below 20) the effect size estimate is actually a bit overstated (called biased). To correct for this bias a slight adjustment to Cohen’s d is recommended, called Hedge’s g. I’ve created a free calculator that will compute both Cohen’s d and Hedges’s g for comparing means.
Interpreting Effect Sizes
This still begs the question as to if these are big or small effects. The best way to know is to compare these standardized effect sizes across a number of similar studies. Usually researchers have at best a handful of studies to make comparisons against–making comparisons more difficult.
Jacob Cohen was a prolific writer and researcher and did a lot of pioneering work in measuring and understanding effect sizes and offered some general guidance on interpreting them. He wrote the seminal book, Statistical Power Analysis for the Behavioral Sciences, which is still considered the Bible of power and sample size research today. From surveying the psychological literature he came up with the following rules of thumb for interpreting effect sizes when making comparisons between means: .2 are small, .5 are medium and .8 are large.
| Effect Size
|| Cohen’s d
|| Sample Size Needed
(80% Power, alpha = .10)
Using Cohen’s rough approximation, we can see of the three examples above, an effect size of .16 is smallish, and effect sizes of .59 and .64 are on the medium to large side.
These should be used at best as rough guides when interpreting effect sizes. Cohen himself warns against taking these too rigidly, although like many guides, in the absence of other information they become mandates instead of suggestions. You can think of effect sizes as differences in standard deviations. That is, an effect size of .5 is about a half a standard deviation difference.
Sample Size Planning & Effect Sizes
You can use effect sizes to determine the required sample size for detection using the typical conventions of 80% power and an alpha of .10. The table above also shows the approximate sample size you should plan for in each group to have a good chance of seeing those differences. For example, you should plan on having around 452 participants (226 in each group) to detect a smaller effect size of .2.
One of the biggest problems with comparative studies in user research and behavior studies in general is that the sample sizes often aren’t big enough to detect differences. While a lot of work may go into the design and coding of a new website, the resulting change often has a more muted effect on user behavior. The result is a lot of effort spent surveying users to find no significant differences in attitudes. Although, even a non-significant difference provides information you didn’t have–that a new or proposed design is not significantly worse than the existing design.
If you are only able to run a study with 100 participants in each group, it doesn’t necessarily mean you won’t find an effect size of .2 as statistically significant, it just means your chances are less than 80% (in this case the chance is about 46%). You can use the Usability Statistics package to compute sample sizes from particular effect sizes or raw differences. Computations are explained in Quantifying the User Experience, Chapter 6.
You also notice from the table above that with sample sizes less than about 30, you are limited to seeing big effects — big differences in measures. That’s not necessarily a bad thing as we usually care most about differences that are noticeable to users. If you plan a study with approximately 30 users, just don’t expect to detect small effects.
To help researchers in human computer interaction, we have started aggregating effect sizes in published and unpublished usability studies. Knowing what a large or small effect size is will help with sample size planning and set expectations for when results come in. We hope to have something published this year!