The discerning usability analyst should employ a mix of both qualitative and quantitative methods when discovering usability problems. The risks of relying heavily on a qualitative approach can lead to a severe misdiagnosis especially when usability problems are difficult to detect. This article is a response to Nielsen’s “The Risk of Quantitative Studies” and shows how the problems voters had with the “butterfly-ballot” in the Florida 2000 election would not have been detected with popular discounted qualitative methods. The problems with relying on one-size-fits all usability guidelines such as “testing with only five users” and the inherent bias of pay-for-hire guru’s are also discussed.

Introduction

In Jakob Nielsen’s most recent article “The Risk of Quantitative Studies”[8] Jakob presents some valid points on the limits of quantitative methods yet his points are lost in a sea of bombastic exaggerations and over-generalizations. Jakob essentially warns against using unnecessarily “number-fetishism” via half-baked quantitative methods and not to go “fishing for a significant p value.” Yet risks are inherent to every method of usability engineering. There are as many risks (in my contention more risks) by relying solely or even heavily on qualitative methods and dismissing the importance of quantitative analysis in usability testing. The emphases in Nielsen’s article are clear:

  1. Try as you might to use quantitative methods, you’re not experienced enough so you should hire a usability guru to do it for you.
  2. Any new ideas in the usability literature are probably hyped-up by researchers trying to get published, whereas true understanding of usability has been shown already and very little new can come from this “very stable field.”

There is no doubt that a qualitative problem discovery method is an invaluable approach to identifying problems that may fall between traditional quantitative methods especially during early phases of usability improvements. However, in many cases a quantitative approach that employs statistics is necessary such as when trying to show a usability improvement in a new release of a product or when the likelihood of detecting a problem is low.

Any method improperly applied and interpreted can lead to erroneous conclusions–this doesn’t necessarily make the methods inappropriate or invalid. A well intentioned usability analyst using qualitative methods can misdiagnose a problem and solution just as easy as she can find a spurious correlation. A responsible usability analyst should employ a mix of both qualitative and quantitative methods in discovering and fixing usability problems—there are inherent risks to both methods: understanding the risks and mitigating them is the solution, not dismissing their efficacy. Understanding when and why a technique is germane is the mark of an experienced analyst. Blindly following any guideline from an expert or guru is always dangerous and can lead to misguided efforts.

One should approach usability analysis in a similar way as one would approach picking a stock to purchase. The analogue is helpful because when your money is on the line you really want to know the best techniques. For example, you can look at the quantitative metrics of a stock (P/E ratio, EPS, Growth Rates) but a prudent investor cannot ignore qualitative aspects like knowing the CEO isn’t a scrupulous businessperson. Warren Buffet once said his approach is 80% quantitative from Benjamin Graham and 20% qualitative from Philip Fisher. Sloppy work is sloppy work no matter what your p value is or how much you paid a high-priced guru.

Qualitative Studies: Even More Intrinsic Risks

Testing hundreds of users is time consuming and expensive. Doing so is not required to use quantitative methods or statistics. Instead, statistics are used to understand and manage your uncertainty about a problem. A research hypothesis should be clearly defined and the appropriate methodology should be used to test the hypothesis. The careful analyst should always be aware of the limitations of their data, the “observer effect” and other unknown factors. Take the example given by Jakob [8] about the “Butterfly-Ballot” problem:

The “butterfly ballot” in the 2000 election in Florida is a good example: a study of 100 voters would not have included a statistically significant number of people who intended to vote for Al Gore but instead punched the hole for Patrick Buchanan, because less than 1% of voters made this mistake. A qualitative study, on the other hand, would likely have revealed some voters saying something like, “Okay, I want to vote for Gore, so I’m punching the second hole … oh, wait, it looks like Buchanan’s arrow points to that hole. I have to go down one for Gore’s hole.” Hesitations and almost-errors are gold to the observant study facilitator, but to translate them into design recommendations requires a qualitative analysis that pairs observations with interpretive knowledge of usability principles.

Let’s hypothetically use Jakob’s recommendation of using 5 users [7] to test the ballot (he really only recommends 3-4 for testing [1], but we’ll use 5). To test this ballot we would have watched five users attempt to cast a vote for their candidate. Also assume that we split our sample of five into two users who intended to vote for Gore, two for Bush and one for Buchanan. Because we’ve sub-grouped to be thorough we probably need 3-4 users for each intended vote since the two votes for Bush in this case will reduce our chances again of detecting the problem with Gore/Buchanan voters, but we’ll keep things simple with a total of five users.

According to Tog’s article “one out of ten users had some problem with this design with approximately one in one hundred failing completely[14].” After observing our sample of five users we most likely would not have detected users having the alignment problem and most certainly would not have observed a user “failing completely” regardless of what method we used. Let’s say that one of the users encountered another problem (they didn’t understand part of the instructions). After observing only five users we’d have a totally different sense of its state of usability. Herein lies the larger problem with only approaching user testing qualitatively. The analyst doesn’t know if the problems he’s found are 1 out of 5 or 1 out of 100 resulting in “gut-feel” recommendations—sometimes you’re right, sometimes you’re wrong.

Even Tog in his article suggests “A professional usability tester running twenty subjects in a 10 minute test each would have amply shown that something was seriously amiss” [14]. Actually according to the formula Nielsen uses to derive sample size, you would have needed to observe 22 users and even then you would only a have 90% chance of seeing this problem once. If you wanted to see this problem twice (with a 90% likelihood), you would need to observe 37 users! Tog doesn’t cite where he derived the 1 in 10 probability of having some problem but the 1% problem of complete voting failure has been documented [17]. To actually observe a user completely failing to punch the hole of their intended candidate (1 in 100 chance), you would have to observe 225 users! To see how these numbers are derived see the sidebar Deriving a Problem Discovery Sample Size.

You also would have seen that the fictitious instructions problem encountered by the one user was not as common a problem (a false-positive). At this point you would look to confirm or deny the hypothesis with more users or simply change the design. During any research investigation there are chances that you failed to identify a problem or identified a false-problem (Type I and Type II errors); it’s a fundamental problem of any empirical investigation. The use of quantitative analysis and statistics allow us to better understand the chances that we’re missing a problem or falsely identifying a problem that doesn’t exist. What’s more Tog goes on to state:

Studies show that most users don’t make mistakes when confronted with bad interfaces, they just slow way down. (The Florida results bear this out. One in ten people have trouble with the ballot, but only one in one hundred end up making an error.)[14]

I’m not sure what “studies” Tog is referring to but I agree with the implication: You need to take measurements of task time! In other words, quantitative methods in this case will provide information qualitative methods cannot.

Blindly computing numbers to produce attractive charts or fishing for a p-value is poor research. Making usability testing look like magic that only an oracle of usability can perform is misguided. While taking quantitative measures such as time on task the analyst should also be carefully observing the user then afterward ask about their hesitations or problems they might have had. Use the new information in addition to your quantitative measures to rebuild your hypothesis as you continue to test more users and look to confirm or deny that the problem exits.

The Pay-for-Hire Guru Bias

Every researcher and analyst has a bias and the prudent reader should identify and understand the implications of that bias when reading any research report (including this one). A drug study sponsored by a drug company should raise a red flag as should a report of online advertising sponsored by an advertising agency—as Jakob rightly points out [8]. The publication bias of “highlight[ing] new and interesting stories”[8] is real and should not be dismissed, however, such a bias can be mitigated over time as other researches attempt to replicate results or uncover flaws in the methodology. Time is the final arbiter of veracity. More research should always be encouraged not discouraged.

In a similar vein of cautious investigation one should also interpret with caution the reports of guru’s and consulting agencies that necessarily must survive on commissions. Their bias can be to bolster research that supports their methods and dismiss data that discounts their methods. You can see this in every issue of popular business publications such as The Harvard Business Review. There is a never-ending flow of “new” and “unique” approaches to identifying the next “aha” in business development or a new technique that will push you ahead of your competition. Of course the techniques are always just the tip of the ice-berg, to really use them effectively you need to hire the high-priced firms and gurus that evince those “new” and “unique” ideas.

One-Size Doesn’t Fit all in Usability

The Florida Ballot example among many other examples is reason to be cautious about one-size fits all usability testing guidelines such as “You only need to test with five users.” [7]

Today, almost everyone who does user testing has concluded that they learn most of what they’ll ever learn with about five users. [8]

I’m not sure who Jakob is talking to, but since the publication of his “Curve of Optimism” [6,7] there have been several articles pointing out significant limitations of his formula for calculating the number of users you need to test. Most notably are:

  1. More severe problems are not detectable by the first few users.[2]
  2. The formula fails to detect both frequency and severity of problems experienced by users. [18]
  3. The likelihood of detecting all problems is not equal—some problems require a lot more users to uncover than others [18]
  4. You need to know what usability problems you’re looking for ahead of time.[18]
  5. With a small sample (n<5) the formula overestimates the number of problems you have discovered. To correct this problem you need to re-estimate your discovery rate p after 2 users then again after 4 users by using an advanced statistical treatment to correct the overestimation of p. [3]
  6. The formula has limitations when applied to open-goal searching across multiple websites. [13]
  7. The formula is not applicable for testing usability improvements across versions through task times, error counts, completion rates or measurements of comparable user performance (validation) only for the more fuzzy “insights” [1]
  8. Research from Molich et al suggest that problem discovery is not as simple as it sounds with different evaluators producing varying degrees of problems and severities.[5]
  9. The butterfly-ballot example as explained in the sidebar.

Different Articles, Conflicting Views

Jakob’s valuable contribution to the field of HCI is unquestionable. His Alertbox articles of late are sending out mixed messages. Just a few months prior, I whole-heartedly agreed with Jakob’s article on the value of using Six Sigma Quality Assurance methods in usability engineering.[11] His recent article seems to undercut that suggestion: “quantitative studies are often too narrow to be useful and are sometimes directly misleading.” [8] In Advocating Six Sigma, Jakob advises: “We’d be wise to adapt some of the Six Sigma methodologies to aid our quest for improved Web quality.” [11] Six-Sigma’s main tenet is that if you don’t measure something you truly don’t know it. In short-quantitative methods allow you to make usability improvements unattainable through qualitative methods. This concept is made very clear on the link Jakob provides to find more information on Six Sigma:

Six Sigma is a rigorous and disciplined methodology that uses data and statistical analysis to measure and improve a company’s operational performance by identifying and eliminating “defects” in manufacturing and service-related processes. From isixsigma

In the Six Sigma article Jakob explains that “Time-on-task is particularly important because the company is paying for employees’ time as they slowly slug their way across the intranet.”[11] Does Jakob propose qualitatively measuring task time and showing an improvement? And again will only five users be sufficient to show your users are actually “more efficient at completing a task”?

I usually advocate qualitative usability studies, because usability’s main goal is to drive the design. For formal quality assurance, however, you must run quantitative studies to collect hard numbers that show how well or poorly your design scores on the usability criteria you defined above.[11]

I’ll give Jakob the benefit of the doubt and assume he’s somehow making a distinction between Quality Assurance Usability Testing and the rest of the Usability Testing. To me, there is no such distinction as his Butterfly Ballot example shows. The diligent analyst should have approached the Florida Butterfly Ballot Usability Test like any usability test and have at hand all the quantitative and qualitative methods that provide for a thorough understanding of potential problems. When done properly a usability test should provide both qualitative problem descriptions and quantitative measures such as frequency of occurrence, impact on task completion rates and task time.

A problem occurring only 1 in 100 times is difficult to detect yet the consequences of it are real. This is where Six Sigma adds real value. The defect rate of 1 out of 100 or 1% translates into a sigma value of 3.82 (2.32 with a 1.5 shift). When you have half-a-million opportunities for a defect that translates into 5000 voters casting the wrong vote. For a defect that can change an election, the sigma value needs to much closer to 6, meaning only 1.7 out of 500,000 should cast the wrong vote. It’s easy to identify problems after they’ve occurred and shown how your “expert method” would have detected it. Only through applying rigorous analyses such as Six Sigma or other appropriate quantitative methods ahead of time will such problems be identified and remedied.

Bad Quantitative = Bad Qualitative

In a side-bar clarification to The Risk of Quantitative Studies, Jakob goes further to explain the problem of statistical analysis in “Probability Theory and Fishing for Significance”[9] by using an example most of us encountered in an undergraduate statistics class.

This is why it is not valid research to conduct a study, collect lots of data about lots of variables, and then claim significance because some of the variables seem to correlate. Doing so is exactly the same as tossing lots of quarters, then reporting on the few coins that had an unusual outcome.[9]

I’m not sure Jakob’s motivation for the article [8] but this example presents a dangerous rationale. It implies that usability practitioners who use statistics somehow cannot tell the difference between an unfair coin fluke from the binary-probability formula and relevant user behavior. One should always be aware of the limitations of research methods—confounding effects, covariates and normality assumptions to name a few–but these limitations should not prevent them from being used. Was Nielsen tossing quarters when he reported that “There is a strong positive association between users’ average task performance and their average subjective satisfaction…” in his 1994 article “Measuring Usability: Preference vs. Performance”[12] one of many papers relying heavily on analyzing “a bunch of variables and looking for a correlation”? I certainly don’t think so.

Discount Usability’s Cost, not its Methods

As more companies understand the importance of User Centered Design methods and use them as part of their product development, the easily detected low-hanging fruit identified through “expert-reviews” and other discount methods will provide less and less usability value. Usability practitioners will need to continue to refine their skills and understand the importance of quantitative assessments—something you can’t teach a product manager from a “Three-Day Usability Boot-Camp.” The only thing about usability testing that should be discounted is the cost, not the depth of analysis; discounting your methods by relying on qualitative methods when a thorough quantitative analysis is warranted will result in discounted results and a discounting of your reputation. In time the discounted methods will clear-themselves “off the HCI’s shelves.[19]”

References

  1. Bevan, Nigel; Barnum, Carol; Cockton, Gilbert; Nielsen, Jakob; Spool, Jared; Wixon, Dennis “The “magic number 5”: is it enough for web testing?” in CHI ’03 Extended Abstracts Conference on Human factors in Computing Systems, p.698-699, April 2003
  2. Lewis, James “Sample Sizes for Usability Studies: Additional Considerations” in Human Factors 36(2) p. 368-378, 1994
  3. Lewis, James “Evaluation of Procedures for Adjusting Problm-Discovery Rates Estimated from Small Samples” in The International Journal of Human-Computer Interaction 13(4) p. 445-479 December 2001
  4. Lewis, James “Testing Small System Customer Setup” in Proceedings of the Human Factors Society 26th Annual Meeting p. 718-720 (1982)
  5. Molich, R et al. “Comparative Evaluation of Usability Tests.” In CHI 99 Extended Abstracts Conference on Human Factors in Computing Systems, ACM Press, 83-84 1999
  6. Nielsen, Jakob and Thomas K. Landauer, “A mathematical model of the finding of usability problems,” Proceedings of the SIGCHI conference on Human factors in computing systems, p.206-213, April 24-29, (1993)
  7. Nielsen, Jakob “Why you only need to test with 5 users” Alertbox (2000) http://www.useit.com/alertbox/200319.html
  8. Nielsen, Jakob “Risks of Quantitative Studies” Alertbox (2004) http://www.useit.com/alertbox/20040301.html
  9. Nielsen, Jakob “Probability Theory and Fishing for Significance” Alertbox (2004) Sidebar to Risks of Quantitative Studies http://www.useit.com/alertbox/20040301_probability.html
  10. Nielsen, Jakob “Understanding Statistical Significance” Alertbox (2004) Sidebar to Risks of Quantitative Studies http://www.useit.com/alertbox/20040301_significance.html
  11. Nielsen, Jakob “Two Sigma: Usability and Six Sigma Quality Assurance” Alertbox (2003) http://www.useit.com/alertbox/20031124.html
  12. Nielsen, Jakob and Levy, Jonathan “Measuring Usability: Preference vs. Performance” in Commications of the ACM, Volume 37 p. 66-76 April 1994
  13. Spool, J. and Schroeder, W. “Testing Websites: Five Users is Nowhere Near Enough. in CHI 2001 Extended Abstracts, ACM, 285-286 (2001)
  14. Tognazzini, Bruce “The Butterfly Ballot: Anatomy of a Disaster” in AskTOG (2001) http://www.asktog.com/columns/042ButterflyBallot.html
  15. Virzi, Robert, “Streamlining the design process: Running fewer subjects” in Proceedings of the Human Factors Society 34th Annual Meeting p. 291-294 (1990)
  16. Virzi, Robert, “Refining the Test phase of Usability Evaluation: How many subjects is enough?” in Human Factors (34) p 457-468 1992
  17. Wand, Jonathan et al “Voting Irregularities in Palm Beach County” November 28, 2001 http://elections.fas.harvard.edu/election2000/palmbeach.pdf,
  18. Woolrych, A. and Cockton, G., “Why and When Five Test Users aren’t Enough,” in Proceedings of IHM-HCI 2001 Conference, eds. J. Vanderdonckt, A. Blandford, and A. Derycke, Cépadèus Éditions: Toulouse, Volume 2,105-108, 2001
  19. Woolrych, A. and Cockton, G., “Sale must end: should discount methods be cleared off HCI’s shelves?” in Interactions Volume 9, Number 5 (2002)