Wondering about the origins of the sample size controversy in the usability profession? Here is an annotated timeline of the major events and papers which continue to shape this topic.
The Pre-Cambrian Era (Up to 1982)
It’s the dawn of Usability Evaluation and the first indications of diminishing returns in problem discovery are emerging.
1981: Alphonse Chapanis and colleagues suggest that observing about five to six users reveals most of the problems in a usability test
1982: Wanting a more precise estimate of a sample size than 5-6, Jim Lewis published the first paper describing how the binomial distribution can be used to model the sample size needed to find usability problems. It is based on the probability of discovering a problem with probability “p” for a given set of tasks and user population given a sample size “n.”
The Dark Ages (1983-1989)
As big hair 80’s bands proliferate little happens in the sample size literature.
The Cambrian Explosion (1990-1994)
Use of GUI’s explode and the need for more precision in sample size estimates generates multiple papers which reassuringly propose using the Binomial to model sample sizes.
1990: Robert Virzi details three experiments at the HFES conference replicating earlier work from Nielsen. His paper explicitly uses 1-(1-p)n which is the same binomial formula Jim Lewis used 8 years earlier. He later published these findings in more detail in a 1992 Human Factors paper. The two papers state:
- Additional subjects are less and less likely to reveal new information
- The first 4-5 users find 80% of problems in a usability test (avg. p of .32)
- Severe problems are more likely to be detected by the first few users.
1991: Wright and Monk also show how using 1-(1-p)n can be used to identify sample sizes in iterative usability testing
1993: Jakob Nielsen and Tom Landauer in a separate set of eleven studies found that a single user or heuristic evaluator on average finds 31% of problems. Using the Poission distribution they also arrive at the formula 1-(1-p)n
1994: Jim Lewis was sitting in the audience of Robert Virzi’s 1990 HFES talk and wondered how severity and frequency could be associated. His 1994 paper confirmed Virzi’s first finding—the first few users find most of the problems, partially confirmed the second (his p was .16 not .32) . His data did not show that severity and frequency are associated. It could be that more severe problems are easier to detect or it could be that it is very difficult to assign severity without being biased by frequency. There has been little published on this topic since then.
Dot-Com Boom (1995-2000)
Usability goes mainstream and people are too busy counting stock options to write much about sample sizes.
2000: Nielsen publishes the widely cited web-article: “Why you only need to test with five users“, which summarizes the past decade’s research. Its graph comes to be known as the “parabola of optimism.”
While still sitting in Aeron chairs and counting near worthless options, skepticism builds over the magic number five.
2001: Jared Spool & Will Schroeder show that serious problems were still being discovered even after dozens of users (disagreeing with Virzi’s but agreeing with Lewis’s findings). This was later reiterated by Perfetti and Landesman. Unlike most studies, these authors used open-ended tasks allowing users to freely browse up to four websites looking for unique CD’s.
2001: Caulton argues that different types of users will find different problems and suggests including an additional parameter for the number of sub-groups of users.
2001: Hertzum and Jacobsen caution that estimating an average problem frequency from the first few users will be inflated
2001: Lewis provides a correction for estimating the average problem occurrence from the first 2-4 users
2001: Woolrych and Cockton argue that problems don’t uniformly affect users so a simple estimate of problem frequency (p) is misleading. Instead they state a new model is needed to account for the distribution of problem frequency.
2002: Carl Turner, Jim Lewis and Jakob Nielsen respond to criticisms of 1-(1-p)n at a panel at UPA 2002
2003: Laura Faulkner also shows variability in users encountering problems. While on average five users found 85% of problems in her study, some combinations found as few as 55% or as much as 99%.
2003: Dennis Wixon argued the discussion about how many users are needed to find problems is mostly irrelevant and the emphasis should be on fixing problems (RITE method).
2003: A CHI Panel with many of the usual suspects defends and debates the legitimacy of the “Magic Number 5”
Clarifications (2006 – Present)
2006: In a paper based on the panel at UPA four years earlier, Carl Turner, Jim Lewis and Jakob Nielsen review the criticisms of the sample sizes formulas but show how it can and should be legitimately used.
2006: Jim Lewis provides a detailed history of how we find sample sizes using “mostly math, not magic.” It includes an explanation of how Spool and Schroeder’s results can be explained by estimating the value of p for their study and putting that value into 1-(1-p)n .
2007: Gitte Lindgaard and Jarinee Chattratichart using CUE-4 data remind us that if you change the tasks you’ll find different problems.
2008: In response to calls for a better statistical model, Martin Schmettow proposes the beta-binomial to account for the variability in problem frequency but with limited success.
2010: I wrote an article visually showing how the math in the binomial predicts sample sizes fine–the problem is in how it’s often misinterpreted. The article reiterates the important caveats made for the past decades about the magic number 5:
- You won’t know if you’ve seen 85% of ALL problems, just 85% of the more obvious problems (the ones that affect 31% or more of users).
- The sample size formula only applies when you test users from the same population performing the same tasks on the same applications.
- As a strategy don’t try and guess the average problem frequency. Instead, choose a minimum problem frequency you want to detect (p) and the binomial will tell you how many users you need to observe to have a good chance of detecting problems with at least that probability of occurrence.
If you approach sample sizes this way you avoid the problem of the variability in problem frequency and don’t have to make any assumptions about the total number of problems in an interface.
The Pre-Cambrian Era (Up to 1982)
- Al-Awar, J., Chapanis, A., and Ford, R. (1981). Tutorials for the first-time computer user. IEEE Transactions on Professional Communication, 24, 30-37.
- Lewis, J. R. (1982). Testing Small System Customer Setup. in Proceedings of the Human Factors Society 26th Annual Meeting (pp. 718-720). Santa Monica, CA: HFES. [pdf]
The Cambrian Explosion (1990-1994)
- Virzi, R. A. (1990). Streamlining the design process: running fewer subjects. Proceedings of the Human Factors Society 34th Annual Meeting (pp. 291-294). Santa Monica, CA: HFES.
- Wright, P. C., and Monk, A. F. (1991). A cost-effective evaluation method for use by designers. International Journal of Man-Machine Studies, 35, 891-912.
- Virzi, R. A. (1992). Refining the test phase of usability evaluation: How many subjects is enough? Human Factors, 34, 457-471.
- Nielsen, J., & Landauer, T. K. (1993). A mathematical model of the finding of usability problems. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp.206-213). Amsterdam: ACM.
- Lewis, J. R. (1993). Problem discovery in usability studies: A model based on the binomial probability formula. In Proceedings of the Fifth International Conference on Human-Computer Interaction (pp. 666-671). Orlando, FL: Elsevier. [pdf]
- Lewis, J. R. (1994). Sample sizes for usability studies: Additional considerations. Human Factors, 36, 368-378.[pdf]
- Caulton, D. A. (2001). Relaxing the homogeneity assumption in usability testing. Behaviour & Information Technology, 20, 1-7. [pdf]
- Spool J., & Schroeder W. (2001). Testing web sites: five users is nowhere near enough, CHI ’01 extended abstracts on Human factors in computing systems, March 31-April 05, Seattle, Washington. [pdf]
- Perfetti, C., & Landesman, L. (2001). Eight is not enough. Retrieved July 15, 2010 from
- Turner, C. W., Lewis, J. R., & Nielsen, J. (2002). UPA Panel: How many users is enough? Determining usability test sample size
- Wixon, D. (2003) Evaluating usability methods: why the current literature fails the practitioner, interactions, v.10 n.4, July + August.
- Lewis, J. R., 2001, Evaluation of procedures for adjusting problem-discovery rates estimated from small samples. International Journal of Human-Computer Interaction, 13, 445-479.[pdf]
- Hertzum, M. & Jacobsen, N. J. (2003 – corrected version, original published in 2001). The evaluator effect: A chilling fact about usability evaluation methods. International Journal of Human-Computer Interaction, 15, 183-204. [pdf]
- Woolrych, A. & Cockton, G., (2001), Why and when five test users aren’t enough. In Vanderdonckt, J., Blandford, A. and Derycke A. (eds.) Proceedings of IHM-HCI 2001 Conference, Vol. 2 (Toulouse, France: Cépadèus Éditions), pp. 105-108. [pdf]
- Bevan, N., Barnum, C., Cockton, G., Nielsen, J., Spool, J., and Wixon, D. 2003. The “magic number 5”: is it enough for web testing?. In CHI ’03 Extended Abstracts on Human Factors in Computing Systems (Ft. Lauderdale, Florida, USA, April 05 – 10, 2003). CHI ’03. ACM, New York, NY, 698-699
Clarifications (2006 – Present)
- Turner, C. W., Lewis, J. R., & Nielsen, J. (2006). Determining usability test sample size. In W. Karwowski (ed.), International Encyclopedia of Ergonomics and Human Factors (pp. 3084-3088). Boca Raton, FL: CRC Press. [pdf]
- Lewis, J. R. (2006). Sample sizes for usability tests: mostly math, not magic. interactions 13, 6 (Nov. 2006), 29-33.
- Lindgaard, G., & Chattratichart, J. (2007). Usability testing: what have we overlooked?. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (San Jose, California, USA, April 28 – May 03, 2007). CHI ’07. ACM, New York, NY, 1415-1424. [pdf]
- Schmettow, M. (2008), “Heterogeneity in the Usability Evaluation Process,” in Proceedings of the 22nd British HCI Group Annual Conference on HCI 2008: People and Computers XXII: Culture, Creativity, Interaction – Volume 1, ACM, Liverpool, UK, pp. 89-98. [pdf]
- Sauro (2010) Why you only need to test with five users (explained) Retrieved July 15, 2010
flikr Photo Credits
|UX Measurement Boot Camp : Three Days of Intensive Training on UX Methods, Metrics and Measurement Aug. 7th-9th 2019|