Does this man need
back surgery?
Does this woman have
breast cancer?
Does this website have
usability problems?

Chances are you’re not qualified to answer the first two questions but probably able to provide some answers about the third.  This image comes from the Hotel Pennsylvania website. It was the subject of the Comparative Usability Evaluation (CUE-4). Seventeen usability teams independently evaluated the website and generated few overlapping problems.

When you get an X-ray , an MRI or mammogram, a trained radiologist interprets the image and tells you if there is a problem. In a usability evaluation a trained usability expert tells you what the problems are based on their examination of  the  interface or their analysis of data from watching users perform tasks

The CUE studies show us that if you have multiple usability experts review or conduct usability tests of the same website, you’re likely to get different lists of problems.  Would you get a different diagnosis if another radiologist read your image?  A similar question was asked three years ago by Jared Spool in the Journal of Usability Studies, who inferred that you would expect to get a similar diagnosis.

It turns out you would be likely to get different diagnoses. The degree of disagreement differs depending on the type of image and what the doctor is diagnosing. For some tasks there is reassuringly high agreement and on others there is troubling low agreement. The same medical image can generate diagnoses of pending death to a clean bill of health.  
Some radiologists see something ambiguous and are comfortable calling it normal. Others see something ambiguous and get suspicious. For decades there have been studies on the degree of disagreement in medical imaging. 

The first image above is from an MRI of a spine from a 1994 study with a high level of both disagreement and inaccuracy.

The second image is a mammogram. Mammography has one of the highest false-positive rates and observer variability in medical imaging[pdf].

The problem of observer variability isn’t unique to medical imaging. A comprehensive bibliography now almost 20 years old contains hundreds of examples across every medical specialty (Elmore 1992).  Despite some methodological flaws and criticisms of individual studies, there’s a clear need to understand and improve the variability in medicine. The stakes are high.  

The Perception Problem

While life and death are rarely consequences in usability evaluations, both medical imaging and usability suffer from the same challenge: a perception problem.  Much of the variability between evaluators doesn’t come from inadequate equipment (Morae or MRI’s) but from disagreements on whether something is perceived as a problem.

The first studies of variability between usability evaluators appeared in 1998 and have received the same sort of criticism and concern as those in the medical literature.  Since then there have been at least a dozen studies showing low agreement among usability evaluators. While it is discouraging to see that there is such a high level of disagreement between usability evaluators it is reassuring that we’ve begun examining the issue relatively early in the life of the profession. There are two important lessons to take from the medical literature:

  1. disagreement depends on the task and the tools.
  2. where there is more judgment there is more disagreement

Tasks and Tools

Physicians do many things and have many tools to help them. Usability engineers do many things and have a few tools. Finding problems in an interface is a signature task of the usability evaluator.  In this task there is good evidence that there is low agreement.  What about the other things usability evaluators do:

  • Measuring completion rates and task times in benchmark testing
  • Proposing a design solution
  • Rating problem severity
  • Agreeing on whether a problem is real or a false positive.

More data is needed on the agreement in these tasks. There are already some clear recommendations on how to improve practices.

Education & Certification

Will certification or training improve agreement? Yes, to some extent:  it will add a barrier to entry that may skim off some of the bottom performing practitioners.

Will more education help? While most usability evaluators haven’t spent as much time in school as radiologists, there are quite a few with PhDs and Masters—meaning more schooling is unlikely to make things better. Education and certification can’t fix the problem of perception.

What makes usability more complicated is that it’s not just about spotting abnormalities in a static image or with a single person—it’s about understanding the interaction between many people and changing interfaces. Not all people act the same given the same set of controls, images and labels.  

Many Paths up the Same Mountain

High agreement would be nice to have but it’s not necessarily the goal of usability evaluation. There is some evidence that any usability evaluation is better than none. Jim Lewis calls this a “hill climbing” activity. There are many paths up the same mountain—and evaluators are going to disagree on the best path. Finding and fixing problems tends to generate better interfaces. Finding more problems and reporting fewer false-positives would of course make for more usable interfaces and a more credible practice. 

Medical imaging has been able to add value despite disagreement.  Usability evaluations can as well. We should continue to understand the sources of our expert disagreement and find ways to improve them. We should also keep in mind that even with vast improvements in technology and methods, there will always be disagreement due to the perception problem.