Best Practices for Using Statistics on Small Sample Sizes

Jeff Sauro, PhD

Some people think that if you have a small sample size you can’t use statistics.

Put simply, this is wrong, but it’s a common misconception.

There are appropriate statistical methods to deal with small sample sizes.

Although one researcher’s “small” is another’s large, when I refer to small sample sizes I mean studies that have typically between 5 and 30 users total—a size very common in usability studies.

But user research isn’t the only field that deals with small sample sizes. Studies involving fMRIs, which cost a lot to operate, have limited sample sizes as well[pdf] as do studies using laboratory animals.

While there are equations that allow us to properly handle small “n” studies, it’s important to know that there are limitations to these smaller sample studies: you are limited to seeing big differences or big “effects.”

To put it another way, statistical analysis with small samples is like making astronomical observations with binoculars. You are limited to seeing big things: planets, stars, moons and the occasional comet.  But just because you don’t have access to a high-powered telescope doesn’t mean you cannot conduct astronomy. Galileo, in fact, discovered Jupiter’s moons with a telescope with the same power as many of today’s binoculars.

Just as with statistics, just because you don’t have a large sample size doesn’t mean you cannot use statistics. Again, the key limitation is that you are limited to detecting large differences between designs or measures.

Fortunately, in user-experience research we are often most concerned about these big differences—differences users are likely to notice, such as changes in the navigation structure or the improvement of a search results page.

Here are the procedures which we’ve tested for common, small-sample user research, and we will cover them all at the UX Boot Camp in Denver next month.

Comparing

If you need to compare completion rates, task times, and rating scale data for two independent groups, there are two procedures you can use for small and large sample sizes.  The right one depends on the type of data you have: continuous or discrete-binary.

Comparing Means: If your data is generally continuous (not binary), such as task time or rating scales, use the two sample t-test. It’s been shown to be accurate for small sample sizes.

Comparing Two Proportions: If your data is binary (pass/fail, yes/no), then use the N-1 Two Proportion Test. This is a variation on the better known Chi-Square test (it is algebraically equivalent to the N-1 Chi-Square test). When expected cell counts fall below one, the Fisher Exact Test tends to perform better. The online calculator handles this for you and we discuss the procedure in Chapter 5 of Quantifying the User Experience.

Confidence Intervals

When you want to know what the plausible range is for the user population from a sample of data, you’ll want to generate a confidence interval. While the confidence interval width will be rather wide (usually 20 to 30 percentage points), the upper or lower boundary of the intervals can be very helpful in establishing how often something will occur in the total user population.

For example, if you wanted to know if users would read a sheet that said “Read this first” when installing a printer, and six out of eight users didn’t read the sheet in an installation study, you’d know that at least 40% of all users would likely do this–a substantial proportion.

There are three approaches to computing confidence intervals based on whether your data is binary, task-time or continuous.

Confidence interval around a mean: If your data is generally continuous (not binary) such as rating scales, order amounts in dollars, or the number of page views, the confidence interval is based on the t-distribution (which takes into account sample size).

Confidence interval around task-time:  Task time data is positively skewed. There is a lower boundary of 0 seconds. It’s not uncommon for some users to take 10 to 20 times longer than other users to complete the same task. To handle this skew, the time data needs to be log-transformed  and the confidence interval is computed on the log-data, then transformed back when reporting. The online calculator handles all this.

Confidence interval around a binary measure: For an accurate confidence interval around binary measures like completion rate or yes/no questions, the Adjusted Wald interval performs well for all sample sizes.

Point Estimates (The Best Averages)

The “best” estimate for reporting an average time or average completion rate for any study may vary depending on the study goals.  Keep in mind that even the “best” single estimate will still differ from the actual average, so using confidence intervals provides a better method for estimating the unknown population average.

For the best overall average for small sample sizes, we have two recommendations for task-time and completion rates, and a more general recommendation for all sample sizes for rating scales.

Completion Rate: For small-sample completion rates, there are only a few possible values for each task. For example, with five users attempting a task, the only possible outcomes are 0%, 20%, 40%, 60%, 80% and 100% success. It’s not uncommon to have 100% completion rates with five users. There’s something about reporting perfect success at this sample size that doesn’t resonate well. It sounds too good to be true.

We experimented[pdf] with several estimators with small sample sizes and found the LaPlace estimator and the simple proportion (referred to as the Maximum Likelihood Estimator) generally work well for the usability test data we examined. When you want the best estimate, the calculator will generate it based on our findings.

Rating Scales: Rating scales are a funny type of metric, in that most of them are bounded on both ends (e.g. 1 to 5, 1 to 7 or 1 to 10) unless you are Spinal Tap of course. For small and large sample sizes, we’ve found reporting the mean to be the best average over the median[pdf]. There are in fact many ways to report the scores from rating scales, including top-two boxes. The one you report depends on both the sensitivity as well as what’s used in an organization.

Average Time: One long task time can skew the arithmetic mean and make it a poor measure of the middle. In such situations, the median is a better indicator of the typical or “average” time. Unfortunately, the median tends to be less accurate and more biased than the mean when sample sizes are less than about 25. In these circumstances, the geometric mean (average of the log values transformed back) tends to be a better measure of the middle. When sample sizes get above 25, the median works fine.

0
    0
    Your Cart
    Your cart is emptyReturn to Shop
    Scroll to Top