Might Not Be a Magic Number but There Are Magic Ranges

Jeff Sauro, PhD • Jim Lewis, PhD

feature image showing horizontal gauge “What sample size do I need?”

We’ve all been trained from years of math education to expect a single answer to that question—a single sample size number.

But earlier, we warned against the quixotic quest to identify the one true sample size to use for UX research—the “magic number.”

Because sampling error is real but not insurmountable, you can and should have a plan for coming up with sample sizes. However, a single magic number is not the answer.

The right number (sample size) is a function of the research goal. In UX research, most research questions can be categorized into one of three goals:

  1. Finding problems/insights (Discovery)
  2. Estimating parameters (Estimating)
  3. Comparing parameters (Comparing)

But even within each of these broad goals, there still isn’t one magic number. In other words, there is no magic sample size number for any type of UX research.

There are, however, Goldilocks Zones (“magic ranges”) that balance the various statistical and logistical forces that drive appropriate sample sizes for different goals:

  1. Discovery: Key drivers are the discovery goal and the probability of detection.
  2. Estimation: Key drivers are the desired confidence level, the expected measurement variability, and the required precision (margin of error).
  3. Comparison: Key drivers are similar to those for estimation, but include power and the experimental design (between- or within-subjects) plus the number of tails for the test of significance (typically two except when testing against a set benchmark).

Avoid looking for one magic number that always works (like 5 or 30 or 999). Instead, start with the typical (“magic”) ranges for the research goals shown in Table 1.

Research GoalSample Size ComputationTypical Ranges
Discovery
Finding Problems
Discovery Model
(discovery goal, probability of detection)
5 to 20
Estimation
Estimating Parameters
Confidence Interval
(confidence level, variability, precision)
30 to 300
Comparison
Comparing Parameters
Hypothesis Test
(confidence level, variability, precision, power level, two-tailed test, experimental design—40–400 for within-subjects, 20–200 per group for between-subjects)
40 to 400

Table 1: Typical “magic” ranges for three key UX research goals.

In this article, we show how we arrived at these “magic ranges” by balancing the logistical (e.g., study cost) and statistical (e.g., data types) aspects of sample size planning.

Discovery

To get the magic range for discovery studies, we examined a graph we’d previously published that shows expected discovery (at least once) as a function of probabilities of detection for sample sizes from 1 to 25 and various problem probabilities (Figure 1).

Expected problem discovery rates by sample size for different problem detection probabilities (magic range from n = 5 to 20 highlighted in green).

Figure 1: Expected problem discovery rates by sample size for different problem detection probabilities (magic range from n = 5 to 20 highlighted in green).

The graph shows steeper discovery curves for problems that are likely to be uncovered than for those that are less likely. For example, if one out of every two people encounters a problem (50%), then by the time you watch five people, the likelihood of discovering that problem (seeing it at least once) is just over 95%. When the problem detection probability is 30% and n = 5, the discovery rate is over 80%. If one out of twenty people encounter a problem (5%), then the likelihood of discovering the problem with five people is just over 20%.

This means that when n = 5, the expectation within the constraints of the study (e.g., population of participants and tasks) is that you’ll discover about 95% of problems that affect 50% of the people, about 80% of problems that affect 30% of the people, and about 20% of problems that affect 5% of the people.

Across the wide range of problem probabilities in the graph, discovery rates are pretty good when n = 5. If your focus is on discovering problems that are likely to happen to at least a third of the people who have an opportunity to experience the problem, there’s not much point in having n > 5.

As the sample size increases from 5 to 10 to 20, however, there are significant increases in the discovery rates in the middle of the range of problem probabilities (from 5% to 15%), but the benefit achieved by increasing the sample size from 20 to 25 is much smaller (Table 2).

Problem Occurrence Probability
Range of n1%5%10%15%30%50%75%
From 5 to 2013%42%47%40%17% 3% 0%
From 20 to 25 4% 8% 5% 2% 0% 0% 0%

Table 2: Improvement in discovery rates from n = 5 to 20 and n = 20 to 25 for various problem probabilities.

We focus on what happens in the middle of the probability range because that’s where the differences due to sample size are the largest. When problem probabilities are very low (e.g., 1% or less), then the expected discovery rate is also low. When problem probabilities are very high (e.g., 75% or more), then discovery is almost certain even when n = 2.

If you’re planning to run iterative discovery studies, then you might lean toward the lower end of the magic range. If you’re only going to run one discovery study, you should probably plan at the higher end of the range.

Your decision also depends on the smallest problem probability for which you want to have a reasonable chance of discovery. For example, the claim that you can discover 85% of problems by observing five participants only applies when the probability of problem occurrence (given the bounds of the study’s tasks and types of participants) is a bit higher than 30%. If you need to detect less frequently occurring problems, you’ll need a larger sample size (e.g., with 20 participants, you are likely to discover over 85% of problems that have a probability of occurrence of 10% and almost 60% of problems with a probability of 5%).

Estimation

Since the publication of Benchmarking the User Experience, we have produced numerous tables to guide sample size planning, most recently and comprehensively in Chapter 17 of Surveying the User Experience. Table 3 is based on that book’s Tables 17.3 and 17.4 (with green highlighting for sample sizes from 30 to 300), organized by increasing sample sizes.

Sample Size
Binary
s = 50%
90% Confidence
Binary
s = 50%
95% Confidence
0–100-Pt Rating Scale
s = 25
90% Confidence
0–100-Point Rating Scale
s = 25
95% Confidence
1023.1%26.3%14.517.9
2017.3%20.1%9.711.7
3014.4%16.8%7.89.3
4012.6%14.8%6.78.0
5011.3%13.4%5.97.1
6010.4%12.3%5.46.5
709.6%11.4%5.06.0
809.0%10.7%4.75.6
908.5%10.1%4.45.2
1008.1%9.6%4.25.0
1506.7%7.9%3.44.0
2005.8%6.9%2.93.5
3004.7%5.6%2.42.8
4004.1%4.9%2.12.5
5003.7%4.4%1.82.2
10002.6%3.1%1.31.6

Table 3: Sample size estimates for binary and rating scale data for 90% and 95% confidence intervals. Green highlighting indicates sample sizes between 30 and 300.

What you get with sample sizes between 30 and 300 depends on the type of metric (binary vs. rating scale) because they differ in their typical standard deviations (50% for binary data like completion rates, 25 for rating scales that have been interpolated to 0–100-point scales). It also depends on the desired level of confidence (90% or 95%), but note that it matters less as sample sizes increase.

For binary data with n = 30, you’ll get a margin of error of ±14.4% with 90% confidence or ±16.8% with 95% confidence. For either confidence level, the margin of error is around ±15%, which is adequate for many research contexts. At the higher end, with n = 300, the margins of error for 90% and 95% confidence intervals for binary data are, respectively, ±4.7% and ±5.6%. Getting more precise estimates requires increasing the sample size by hundreds to thousands of additional people, which is rarely worth it.

Because rating scale data is less variable than binary data (about 25% of the range of the scale), the margins of error in the “magic range” are lower. For 90% confidence at the lower boundary of the range (n = 30), the margin of error is just under ±8, and for 95% confidence, it is just over ±9. At the higher end (n = 300), the margins of error for 90% and 95% are, respectively, ±2.4 and ±2.8. Further reducing the margins of error for rating scales to ±1 or 2 is usually prohibitively expensive due to required increases in sample size.

Comparison

Sample size computations get trickier when comparing estimated parameters because the appropriate sample size for a test of significance depends on power, experimental design (between- or within-subjects), and one- or two-tailed tests. For our comparison tables, we assume two-tailed testing, set power to its usual value of 80%, and provide separate sample size estimates for (1) binary and rating scale data, (2) 90% and 95% confidence for a range of critical differences (analogous to margins of error for confidence intervals), and (3) between- and within-subjects studies.

Within-Subjects

We’ll start with the slightly simpler computations for within-subjects designs in Table 4.

Sample Size
Binary
s = 75%
90% Confidence
Binary
s = 75%
95% Confidence
0–100-Pt Rating Scale
s = 25
90% Confidence
0–100-Point Rating Scale
s = 25
95% Confidence
10----21.524.9
2042%--14.516.5
3033%38%11.713.2
4029%33%10.011.4
5026%29%8.910.1
6023%27%8.19.2
7022%25%7.58.5
8020%23%7.07.9
9019%22%6.67.5
10018%21%6.37.1
15015%17%5.15.8
20013%15%4.45.0
30011%12%3.64.1
4009%11%3.13.5
5008%10%2.83.1
10006%7%2.02.2

Table 4: Sample size estimates for binary and rating scale data for 90% and 95% confidence when conducting within-subjects comparisons. Green highlighting indicates sample sizes between 40 and 400.

Because binary measurement is very coarse (based on scores of just 0 or 1) and, for within-subjects comparisons, the appropriate test of significance is the McNemar test of dependent proportions, it takes very large sample sizes to be able to detect small differences (n > 1,000 to reliably detect differences of 5%). At the low end of the range (n = 40), the critical difference (smallest difference the study can detect at the specified confidence level) is 29% with 90% confidence and 33% with 95% confidence. At the higher end (n = 400), the critical differences are about 9% for 90% confidence and 11% for 95% confidence.

For rating scales with n = 40, the critical differences are about 10 points with 90% confidence and about 11 with 95% confidence. When n = 400, the critical difference for 90% confidence is 3.1 points and for 95% confidence is 3.5. Investing in additional participants would lead to little improvement in sensitivity but significant increase in cost.

Between-Subjects

It’s well known that within-subjects comparisons have an advantage in sensitivity over between-subjects experiments because there is only one source of variance. A further increase to the estimated sample sizes comes from there being at least two groups per comparison, so in Table 5, we show the sample size per group and the total sample size for two groups, with the magic range based on the sample size for two groups. Even if you have more than two groups in the study, at some point you’ll likely be comparing two at a time.

Sample Size/
Group
Sample Size/
2 Groups
Binary
s = 50%
90% Confidence
Binary
s = 50%
95% Confidence
0–100-Pt Rating Scale
s = 25
90% Confidence
0–100-Point Rating Scale
s = 25
95% Confidence
51098%--43.550.5
102057%64%29.033.1
153046%52%23.326.5
204040%45%20.122.7
255036%40%17.920.2
306032%36%16.318.4
357030%34%15.017.0
408028%32%14.015.9
459026%30%13.214.9
5010025%28%12.514.1
7515020%23%10.211.5
10020018%20%8.810.0
15030014%16%7.28.1
20040012%14%6.27.0
25050011%13%5.66.3
50010008%9%3.94.4

Table 5: Sample size estimates for binary and rating scale data for 90% and 95% confidence when conducting between-subjects comparisons. Green highlighting indicates sample sizes in the range of 40 to 400 for two groups.

For between-subjects comparisons of binary data, the appropriate test of significance is the N−1 two-proportion test. Compared to within-subjects comparisons with the McNemar test in Table 4, the between-subjects comparisons (assuming two groups) in Table 5 have critical differences that are about 35% larger and therefore less sensitive than the same sample size in a within-subjects study.

At the low end of the range (n = 40 made up of two groups of 20), the critical difference (smallest difference the study can detect at the specified confidence level) is 40% with 90% confidence and 45% with 95% confidence. At the higher end (n = 400 made up of two groups of 200), the critical differences are about 12% for 90% confidence and 14% for 95% confidence.

The mathematical relationship between the t-test used to assess within-subjects differences in rating scales and the t-test used to assess between-subjects differences is very simple, so the critical differences in Table 5 are twice those in Table 4.

For rating scales with n = 40, the critical differences are 20.1 points with 90% confidence and 22.7 with 95% confidence. When n = 400, the critical difference for 90% confidence is just over 6 points, and for 95% confidence, the difference is 7.

Despite the lower sensitivity of between-subjects comparisons, the competing forces of logistics and statistics still support focusing on between-subjects comparisons of two groups that have combined sample sizes from 40 to 400 participants, especially at the higher end of the range. Increasing the sample size to 500 or even to 1,000 simply doesn’t lead to much improvement in the ability of the tests to detect statistically significant differences.

Balancing Statistics and Logistics

Magic sample size numbers for UX research aren’t always wrong, but they’re rarely right. Instead of searching for one magic number to use for all research questions, start your sample size planning with more appropriate magic ranges that are customized for different research questions. Then find the best spot inside that range to satisfy your specific research needs. If, however, nothing inside the range accomplishes your goals, then it’s time to look outside the “magic range.”

The boundaries we set around these magic ranges aren’t set in stone but are informed by our UX research experience.

Choosing a Lower Bound

The lower bound of a magic range is driven by the least acceptable level of precision.

Discovery Studies

For discovery studies, precision refers to the smallest problem probability for which there is a reasonable chance (e.g., 80%) of detection in the study. When n = 5 in a discovery study, the precision for an 80% likelihood of discovery (at least once) is 27.5%. That means there is a good chance of discovering problems that happen to about a quarter of the population represented by the sample, a decent chance of detecting problems with slightly lower frequencies of occurrence, and a very good chance of detecting more frequent problems.

Estimation Studies

For estimation studies, precision is the size of the margin of error. For the magic range in Table 3, we focused on the precision for binary data because it is more variable than rating scale data. When n = 30 with 95% confidence, the margin of error for binary data is ±16.8% and for rating scale data is ±9.3 points (when rescaled from 0 to 100 points). We often use 90% confidence in our planning, but as shown in Table 3, there isn’t much difference in the margins of error as a function of this small difference in confidence (±14.4% for binary data and ±7.8 points for rating scales).

Comparison Studies

For comparison studies, precision is the critical difference—the smallest difference that will be found to be statistically significant. Because the magnitudes of the critical differences are greatly affected by the experimental design (within- or between-subjects), we created separate tables for them (Tables 4 and 5). All other things being equal, comparison studies are less precise than estimation studies, so we set the lower bound to n = 40.

For within-subjects comparisons with 90% confidence, this lower bound allows significant detection (p < .10) of differences of 29% for binary data and 10 points for rating scales. For between-subjects comparisons, these critical differences are, respectively, 40% and 20 points. Using 95% rather than 90% confidence consistently raises these critical differences, but not by much.

Choosing an Upper Bound

The upper bound of a magic range is driven by the diminishing returns associated with improvements in precision as sample sizes get larger.

Discovery Studies

As shown in Figure 1, increasing sample sizes increases the likelihood of problem detection. When n = 20, the likelihood of discovery is 80% for problems that will happen to 7.8% of participants—a significant improvement over the lower bound of n = 5, where the precision is 27.5%. An additional five participants (n = 25) reduces that precision only to 6.3%, so in many research contexts, the cost of the additional five participants would not justify the small (1.5%) gain in precision.

Estimation Studies

With an upper bound at n = 300 (Table 3), the binary margin of error for 90% confidence is just under ±5% and just over ±5% for 95% confidence, and respectively, just under and over ±2.5 for rating scales. Increasing the sample size to 400 would reduce the binary margins of error by less than 1% and rating scale margins of error by less than a half a point—little gain for the additional expense.

Comparison Studies

At the upper bound of n = 400 (Tables 4 and 5), there is little difference between the estimates for 90% and 95% confidence. Averaging across the confidence levels, the within-subjects precision (critical difference) for binary data is about 10% and just over 3 points for rating scales. For between-subjects, the binary precision is about 13%, and for rating scales, it is about 6 points. Increasing sample sizes to 1,000 participants improves precision to an extent (about 6.5% and 2.1 points within subjects; about 8.5% and 4.1 points between subjects), but this requires more than twice the number of participants.

Summary and Discussion

These graphs and tables make it very clear that there is no “magic number” for UX research studies. There isn’t even one “magic range.”

By considering competing statistical and logistical considerations, we’ve recommended ranges for three of the most common high-level UX research goals: discovery, estimation, and comparison.

When planning a UX research study, the first step is to identify the research goal, which establishes the magic range (5 to 20 for discovery, 30 to 300 for estimation, 40 to 400 for comparison).

The magic ranges are too wide for precise planning, so the next step is to look inside the magic range to find the best balance between precision and cost for your specific study. Only if you can’t find that balance inside the magic range should you look outside the range.

For example, suppose you need to conduct a between-subjects comparison of two groups and have a budget for only 15 participants per group. Consulting Table 5, this total sample size of 30 means you can detect, with 90% confidence, a binary difference of 46% or a rating scale difference of 23 points. If that level of precision is satisfactory, go ahead and run the study. If not, then you should save your money. As David Salsburg wrote in his book on the history of statistics, The Lady Tasting Tea (p. 265):

A careful examination of resources available often produces the conclusion that it is not possible to answer that question with those resources. I think that some of my major contributions as a statistician were when I discouraged others from attempting an experiment that was doomed to failure for lack of adequate resources. For instance, in clinical research, when the medical question posed will require a study involving hundreds of thousands of patients, it is time to reconsider whether that question is worth answering.

On the other hand, there might be a legitimate reason to plan for a binary margin of error with 90% confidence of ±2.6% or ±1.3 points, even though that would require n = 1,000 (Table 3). When the need justifies the expense, run the study.

And don’t you wish you could go back to those math tests and provide a range instead of that one number? The answer is between 30 and 300!

0
    0
    Your Cart
    Your cart is emptyReturn to Shop
    Scroll to Top