Sample Size in Usability Studies: How Well Does the Math Match Reality?

Jeff Sauro, PhD

We’ve written extensively about how to determine the right sample size for UX studies.

There isn’t one sample size that will work for all studies.

The optimal sample size is based on the type of study, which can be classified into three groups:

  1. Comparison studies: Comparing metrics for statistical differences
  2. Standalone studies: Estimates a population metric (such as completion rate or perceived ease)
  3. Problem discovery: Uncovering problems in an interface (this is the classic usability study)

And it’s the sample size needed for the problem discovery study type (the classic usability study) where one of the more enduring and misunderstood controversies in UX research comes from. That is, the magic number five.

There are times when five users will be sufficient, but many times it will fall far short. In usability studies where the goal is to uncover usability issues, five users can be enough. You can use a mathematical model to help plan how many users you should test based on the likelihood of problem occurrences.

Jim Lewis and I wrote extensively about the math in Chapter 7 of Quantifying the User Experience. But it’s easy to get lost in the math and qualifications of problem occurrence rates and likelihood-of-detection rates. In this article I’ll use some real usability test data to see what happens with just five users.

Five Users Reveal Most of the Common Issues

The plain-language way to think of what five users can do for you is this: with five users, you’ll see MOST of the COMMON problems. The two key words are most and common. It’s wrong to think that five users will uncover all common problems—or even worse, all problems.

To help provide some idea about what it means to see most of the common problems with five users, I pulled together usability data from seven larger-sample usability studies, some of which we conducted in the last year.

The studies were a mix of in-person moderated studies and unmoderated studies with videos. All studies had one or two researchers coding the usability issues. Across the seven studies there were four different researchers (two conducted multiple studies).

The interfaces tested included three consumer websites, one that featured real-estate listings and two that were rental car websites (Budget and Enterprise, which we tested in 2012). Two interfaces were quite technical web-based applications for IT engineers. There was one mobile website for a consumer credit card. For the final interface, study data was sent to us anonymized with only the problem matrix; we knew few details about the problems or interface other than it was a B2B application. The sample sizes were relatively large compared to typical formative studies, with the smallest having 18 participants and the largest 50.

The datasets contain a mix of usability problems (e.g., users struggle to find “Add-ons” when renting a car) and insights (users suggested some new features but didn’t encounter any problems) that were collected to fulfill specific research requests for each project. These datasets provide a reasonable range of typical usability issues and insights reported in usability reports and offer a good range of different usability problem coding types (including both granular and broader issues) and facilitation styles.

Table 1 shows an overview of the seven datasets, including the type of interface, the sample size, the number of problems, and the average (unadjusted) problem occurrence. The problem occurrence is unadjusted in that it’s just the average of all problems across the number of users. For example, if there are three problems, one that affected 10 out of 30 users (33%), another that affected 20 out of 30 (67%) and another that affected only 1 out of 30 (3%), the average problem occurrence is the average of all three, or 34%.

This is different from the adjusted problem occurrence that takes into account the number of unique issues and may be a better estimate of problem occurrence, especially when the sample sizes are smaller than the ones in this analysis. Jim describes this in detail in Chapter 7 of Quantifying the User Experience.

App TypeSample SizeIssues FoundProblem Occurrence (Unadjusted)
Credit Card App50250.07
B2B IT App301310.12
B2B IT App301410.08
Enterprise Web45330.12
Budget Web38240.13
B2B App18410.23
Real-Estate Web20110.20
Total231406

Table 1: Seven large-sample usability studies, interface type, sample size, number of problems/insight discovered, and average (unadjusted) problem occurrence.

One of the first things you should notice from Table 1 is that the average problem occurrence for all studies are well below .31 (or 31%), which was the average problem occurrence Nielsen & Landauer found when they reviewed usability studies in the 1990s. This low problem occurrence is partially explained by the large sample sizes we used for this analysis. Nielsen and Landauer had an average sample size of 20.4 (13, 15, 20, 24, and 30) compared to this dataset, with an average of 33 participants. But the lower occurrence is also likely a consequence of interfaces (or parts of interfaces) that have less common usability issues. Three were consumer-based highly trafficked websites and two were B2B IT apps that tested only a small portion of the “onboarding” experience.

406 Total Problems and Insights Uncovered

Across all the datasets, there were 406 unique issues uncovered from the 231 users. Some issues were encountered by a lot of users within a study, whereas most were encountered by only a few or even one of the users. I sorted the problems by frequency to identify four tiers of problems as shown in Table 2.

The first tier is the most common problems. I used the Nielsen threshold of 31% as the lower bound. This is how “common” is defined in the five-user context, which is the first important qualification about the five-user convention.

The average problem occurrence for these 35 common problems was 54%—meaning on average, each problem was encountered by around half of the users in the study.

Problem TierNumber Of ProblemsAvg. p%% Found in 1st 5 Users# Found in 1st 5 Users
31% to 100%3554%91%32
20% to 30%2724%63%17
5% to 19%16410%40%65
<5%1803%22%39
All40638%153

Table 2: The 406 problems and insights sorted by their frequency of occurrence.

The next tier included problems that were encountered by 20% to 30% of participants. These 27 problems had an average problem occurrence of 24%. The next two tiers contained most of the problems but were encountered by a smaller percentage: 5% to 19% had 164 problems and less than 5% had 180 problems. This last group also contains the “one-off” problems that only one user encountered.

The final two columns in Table 2 show the percentage of problems we would have seen if we stopped testing at five users. You can see that for the most common problem tier, the first five users uncovered 91% of the problems (32 of the 35). For the second-most common problem tier, five users still uncovered the majority of the issues (63%) but less than the first tier. As the average problem occurrences dropped in the final two tiers, the percent of problems seen in the first five also, as expected, went down. Five users only uncovered 40% of problems that affected between 5% and 19% of users and only 22% of the least common problems. Keep in mind that problem frequency (how many users encounter an issue) should be treated separately from how impactful (or severe) the problem is on users.

The final row in Table 2 shows that five users uncovered 38% (153) of the 406 problems across all datasets. This should illustrate how five users will find most of the common issues, but certainly not most of the problems, especially in cases when the problems only affect a small percentage of users.

Discussion and Takeaway

A review of seven usability problem datasets revealed:

Five users revealed most of the most common issues. The first five users uncovered the vast majority (91%) of the most common issues. The most common issues were defined as those that impacted 31% to 100% of all users in their respective usability study. These issues were on average quite common, with an average problem occurrence of 54%. For the next most common tier of problems, five users still uncovered MOST of these issues, 17 of the 27 (63%).

Five users didn’t uncover most problems. Of the 406 issues, the first five users uncovered 38% of the issues, leaving most of the issues not discovered. The first five users only uncovered 22% of the least common issues (those that affected 5% or fewer users). It may sound counterintuitive, but the difference is all about the frequency of problems. Many researchers and articles either neglect to consider, or don’t understand, the importance of problem frequency when considering using only five users for a study. Five users will find MOST of the COMMON issues, not most of ALL issues.

Results were expected despite loose controls. The formula for uncovering problems (the binomial probability formula) assumes all things stay the same from user 1 through user 100 and beyond. Sort of like flipping the same coin the same way 100 times. That means the tasks need to stay the same, the interface doesn’t change, the evaluators don’t change, and the user type (their experience especially) doesn’t change. If any one of these changes as you test more users, you’ll likely start seeing new issues and the formula won’t be as reliable. Of course, these sorts of controls may seem too idealistic for the often ad hoc nature of usability testing. What’s interesting about this analysis is that there weren’t necessarily tight controls over all these aspects. The evaluators and tasks stayed the same, but the consumer websites may have changed in subtle ways due to A/B testing experiments or natural fluctuations in pricing or availability (rental car changes and real-estate listings). Despite the looser controls (being more ecologically valid), the formula still showed that a few users will uncover most common issues even under these typical testing conditions.

Users’ prior experiences affect problem discovery. One of the most unpredictable parts of testing with many users is that people’s variable experiences with a product (or the domain) can have unexpected results. For example, we’ll often see the first five or six users interact with an interface in a similar way but then the seventh or tenth user will interact with it quite differently. The result is a new set of problems not seen with the first few users. Often (but not always) this is a function of the participant’s different experience with the product or the domain. For example, they may use the product a lot less or a lot more, or just use the interface in a different way. This can happen even when we try and control for years of experience and frequency of use. For hard-to-find user populations, recruiting criteria might get relaxed to gather some insights. Even if you recruit for users with NO experience, participants may have experience with similar products or know more about the domain than others (e.g., have more IT admin knowledge). Usually diversity of user types is a good thing for uncovering more problems even though it may make predicting the number of problems less stable.

The evaluator effect may play a role. We have written about how different evaluators (facilitators and researchers) tend to uncover different problems even when watching the same users. We didn’t explore how that may impact the results here; it would most likely reduce the number of common issues uncovered if different evaluators are used. This is a subject of a future analysis.

More data and simulations in future analyses. In this analysis we only examined what would happen if we stopped at the first five chronological users in each study. A future analysis can also examine what percent of problems would be encountered by looking at different combinations of five users, for example, in a Monte Carlo type simulation.

0
    0
    Your Cart
    Your cart is emptyReturn to Shop
    Scroll to Top