After I conducted my first usability test in the 1990’s I was struck by two things:
- just how many usability problems are uncovered and
- how some problems repeat after observing just a few users
In almost every usability test I’ve conducted since then I’ve continued to see this pattern.
Even after running 5 to 10 users in a moderated study, there are usually too many problems for even the most dedicated and well funded development team to address. Providing a prioritized list is an obvious and essential approach.
Problems can be prioritized by both how many users encountered the problem (frequency) and the impact the problem has on performance or other key metrics (severity).
For about as long as the modern usability profession has, well, been a profession, an important question has been asked: If you only test with say 5 to 10 users, are you more likely to see the critical usability problems in those first few users? Put more directly, are problem frequency and problem severity correlated? That is, small sample sizes will uncover the more frequently occurring issues (this is demonstrated easily with probability), but if frequency and severity are correlated, then small sample sizes will also uncover the most frequent and severe issues.
The relationship between problem frequency and severity has been the subject of an ongoing discussion in usability labs and to a lesser extent, the literature. Most famously, Robert Virzi found[pdf] that more severe usability issues tended to happened more frequently across two usability studies reported in 1990 and 1992. He found a positive correlation between problem severity and frequency (r = .463). In other words, those first few users were likely to uncover the more critical issues.
A conclusion from these findings is that practitioners conducting usability tests would need fewer users to detect more severe problems. In his studies, virtually all the problems rated high in severity were found with the first 5 participants! This is important as many lab-based usability studies today are still run with a small number of participants, typically between 4 and 10.
When attempting to replicate Virzi’s findings, Jim Lewis did find support for the idea of using small sample sizes for uncovering problems, but he failed to find a similar relationship between the frequency of a problem and its severity. In 1994, Lewis examined[pdf] the usability data from 15 participants attempting tasks on productivity software (word processing, spreadsheets, calendar, and mail). The correlation he found between severity and frequency was not significant (r = .06).
Lewis recommended treating severity and frequency as independent. That is, a usability problem is just as likely to be one of low severity as it is of high severity. Despite the obvious importance of this topic, the only other study we’ve found that has addressed this issue was one by Law and Hvannberg in 2004. Their results supported Lewis in finding no correlation. We decided to investigate this relationship with some of our datasets.
We looked at usability problems from nine usability tests conducted on websites and mobile applications. The tests included both in-person and remote moderated data on ecommerce websites, a sports merchandise website, an iPhone and iPad app from a cable provider, and an ecommerce website used on a tablet.
To help reduce the bias of knowing problem frequency before rating problem severity, we had multiple evaluators (between 2 and 4) rate the severity of the problems independently on a 3 point severity scale with defined levels (1=minor, 2=moderate, 3=critical).
Problem severities were then aggregated and an average problem severity was generated. For example, a problem from one study was “Software screenshots appeared interactive,” which received a moderate severity rating (2) from one evaluator and a minor (1) from another evaluator who did not observe the sessions. The average problem severity from these two evaluators was a 1.5. The average problem severity for all problems was then correlated with the problem frequency for each of the problem sets.
The correlations for each study are shown below. For example, in Study 1, 75 issues were reported from observing 17 users. Four evaluators rated the severity of the issues and the correlation between severity and frequency was r = .09. This correlation is both low and not statistically different than 0.
Table 1: Polychoric correlations for the nine usability studies. * Indicates correlations statistically different than 0 at the p < .05 level. The Fisher transformation was applied to the correlations before averaging.
The correlations range from a low of r = -.39 to a high of r = .47 with an average correlation of r =.056. This average is not statistically different from 0. Of the two datasets that were significantly different than 0 (studies 7 and 8), one showed a significant negative correlation! That is, for study 7, less severe problems actually happened more frequently than more critical ones.
Study 8 had the highest positive correlation between frequency and severity which is similar and not statistically different than the correlation reported by Virzi twenty years ago (r = .463). One possible reason for the correlation was the evaluators may have remembered some of the more frequently occurring issues when rating severity. In fact, in every study, one of the evaluators assigning severities WAS the facilitator for the test—so we should expect at least some influence and therefore some positive correlation.
What’s more, all of the evaluators used in these studies work in our lab and therefore had some idea about what issues were more frequent, even if they didn’t facilitate the studies.
To help mitigate the bias, we sent the 29 problems from Study 8 to an independent evaluator with decades of experience conducting usability tests. He was provided the same three point rating scale and same problem descriptions the two evaluators also received. His correlation between severity and frequency was positive, but at a smaller r = .21, which was not statistically different than 0.
The advantage of looking at multiple studies using different devices, facilitators, and evaluators is that we don’t need to rely on a single study with its potential flaws and idiosyncrasies to draw a conclusion about the relationship between frequency and severity. Here are some of the key takeaways:
- Frequency & severity aren’t correlated: The analysis of these nine studies suggests there is as much evidence that more severe problems happen less frequently than trivial ones as there is evidence that more severe problems happen more frequently.
- The first few users are NOT more likely to find the more critical issues: With little evidence supporting a correlation, it means those first five users are NOT more likely to uncover the more severe issues.
- Small sample size testing is still valuable: Just because the first few users won’t be more likely to uncover more severe problems DOES NOT mean that testing with smaller sample sizes should be dismissed. Problem severity ratings are subjective and the first few users still will uncover the most frequent issues (it’s basically a mathematical tautology: high frequency issues will be seen more of the time).
- Frequent issues often just appear more critical: When a problem affects a lot of users, even a trivial one, it just seems to be more critical, even if the impact is minimal on the experience. This co-mingling of the concepts affects our ability to accurately judge a correlation. While this is a problem for assessing the correlation between frequency and severity it’s probably not that harmful in practice.
- It’s difficult to independently test severity and frequency: In actual practice it doesn’t make sense NOT to have the facilitator of a usability test assign the severity ratings. Even the best written problem descriptions are difficult to understand without context. To mitigate the bias we have at least one additional evaluator rate the severity and then average them.
- Don’t be surprised by a correlation: Given the biases and co-mingling of frequency and severity, don’t be surprised to see a positive correlation in your data. We were surprised that most studies didn’t show a positive correlation despite these biases.
- Problem severity ratings can be inconsistent: One of the reasons we recommend averaging ratings is that it’s a difficult and subjective job to assign severity ratings. For one study we had the same evaluators re-assign severity ratings after a two-day delay. While these intra-rater ratings correlated reasonably high (r ~ .5) it still showed that there is inherent unreliability in this task. Future studies may examine the impact of using more reliable rating methods and examine its impact on the correlation with problem frequency.