If you ask five users to take a look at a website or application you will find usability problems. If you fix those problems then ask another five users you will get another set of problems. Over time there will be fewer and fewer problems found, but a new set of users will still continue to find new problems. Why? Because each user is doing slightly different things with slightly different parts of your interface. Only certain combinations of functions and actions will reveal problems with the user experience (most problems aren’t inherent to the code).

Most people can understand that there is a diminishing return with testing users. Fewer believe you can actually quantify the percent of problems found and so are dubious when they hear claims such as five users can detect 85% of problems.  There’s good reason for the skepticism–they’re right, you can never know the total number of problems (if you did you’d go and fix them). Instead, you can only quantify the percent of problems found given problems that affect a certain percent of your users given a specific set of tasks.

So after testing five users you have only found 85% of problems that affect 31% or more of your users given those tasks. The sample size computation based on the binomial works well given the condition you don’t switch tasks, switch users or use open ended exploratory tasks. This is a limitation with the mathematical model, but every scientific model is an oversimplification of the real-world. That means all models are wrong but some are useful. The binomial model is useful because it is simple and familiar and works provided we don’t try and overstate our results.

So open ended requests like “go shopping on the website for a few minutes” or “take a look at the site and tell us what you think” mean users are likely to encounter vastly different parts of your interface. Using this unfocused strategy would be like giving a survey to your users but changing some or all of the questions and answer choices while you’re still collecting data (not recommended).

Not having defined tasks is like changing the questions in a survey while you’re still collecting data.

Even if you have specific tasks you will still only uncover some of the problems (most of the obvious ones but few of the not-obvious ones). For example, if you ask five members of your subscriber base to add the same product to the shopping cart on your website you will see most of the obvious problems.  Don’t be surprised if after three weeks someone complains about a problem with a field on your shopping cart.  You didn’t find all problems with five users; you only found the obvious ones. So the problem this user reported is likely experienced by fewer than 31% of your users. Even if it only affects 1 out of 100 users if you can fix it you should, especially if it is a critical problem.

So is the five user heuristic even useful?  It is if you:

  1. Know who your users are
  2. Have users perform realistic closed-ended tasks with clear objectives (e.g. add a 40 inch Samsung Flat-Screen TV to the shopping cart).
  3. Know that with five users you have only identified 85% of the more obvious problems (those affecting more than a third of all users) and just a few of the less obvious problems.
  4. If you change the users or tasks you start over

If you decided you only had time to test your shopping cart, don’t be surprised if you get complaints about your registration page, contact form or search screen—you didn’t test these.

If you need to be sure you’ve found more than the more obvious problems then you need to test more than five users.

If you need to be sure you’ve found more than just the more obvious problems then you need a larger sample size.

And even if you’ve diligently tested 37 users on the same closed ended task you will still see new problems. Why?  Because the problems being discovered are affecting a smaller and smaller percent of your users (less than 5%). On a website that gets thousands of visitors a day that means you’ll see new problems not found in testing rather quickly and so testing with a larger sample size might be necessary.

When your run out of money, time and patience testing, know that there are still problems out there waiting to be encountered by your users. But have comfort knowing these problems are affecting a smaller and smaller percent of your users and move on to finding and fixing other parts of the application.