Frameworks for Classifying UI Problems

Jeff Sauro, PhD • Jim Lewis, PhD

Feature image with architectural softwareFinding and fixing problems is a core activity of much of UX research (similar to identifying and preventing software bugs and product defects).

The problems found while users attempt tasks are often broadly referred to as UI problems because the friction points tend to involve issues that blur the lines between bugs, functional deficits, and more traditional usability issues.

If you observe enough people using an interface to attempt something like setting up a printer, purchasing clothing, booking a flight, shopping for groceries, or finding a hotel, you’ll start to notice patterns in the problems that people encounter.

In fact, categorizing interface problems and finding common themes is what led to the development of heuristic evaluation (HE) by Nielsen and Molich (1990), including their now-famous ten heuristics (e.g., Visibility of System Status). After their initial publication, Nielsen (1994) refined the heuristics [pdf] based on a factor analysis of 249 usability problems.

There have been other attempts to group problems, such as Weinschenk and Barker’s 20 heuristics, and there are even more granular guidelines, like Smith and Mosier’s 944 UI design guidelines (1986), Apple’s Human Interface Guidelines [pdf], and Microsoft’s Windows User Interface Guidelines [pdf].

High-level heuristics (like Nielsen and Molich’s) are relatively easy to apply to some UX activities (e.g., design guidance and problem discovery). Their suitability for UI problem classification is less clear due to their level of description and focus on traditional usability issues.

Researchers have reported overlaps among the categories when there are multiple evaluators (e.g., Georgsson et al., 2014), and there are gaps associated with some UI issues (e.g., system reliability and responsiveness). Such overlaps and gaps can contribute to misclassification of UI problems, potentially affecting the identification of the most appropriate design solutions.

But why categorize?

If we understand the common problems, identify their root causes, and design to prevent them, we all will have more usable interfaces. Specifically, a good categorization system should help find the root causes of the problem, provide guidance for its resolution, assess the potential impact if not resolved, and enable tracking of problems over time.

In this article, we review several frameworks for categorizing usability problems. For a categorization scheme to be effective, it needs to be reliable, valid, and not too difficult to implement. That is, evaluators should be able to consistently categorize the same issues and this categorization should lead to better products (something a lot harder to measure).

We start back in the late 1990s.

Usability Problem Taxonomy (UPT)

The Usability Problem Taxonomy (UPT) was described by Keenan et al. (1999). The taxonomy was derived from a review of 400 usability problems from five software projects.

The UPT framework is made of two components: artifacts and tasks. These two high-level components have five subcomponents (Artifacts: Visualness, Language, Manipulation; Tasks: Task-Mapping, Task Facilitation). There are 20 end nodes, such as “Feedback Messages” and “Object (screen) layout.” Problems can be (but are not necessarily) classified twice, both in the artifact and in the task component.

For example, a problem observed with a design, such as “OK Button is a different size on different screens,” is classified under Visualness > Object Appearance and is considered fully classified. Because there’s no information about an observed task behavior (e.g., a user had trouble with an OK button), it has no classification for the task.

Keenan et al. implemented a web-based tool with instructions to help with its implementation. They conducted a reliability study to see how consistently different evaluators used the UPT framework. In their study, seven evaluators classified the same 20 problems pulled from the original problem set used to create the UPT framework. All evaluators had limited experience with UPT.

The agreement between evaluators was measured using kappa on the primary category level only and not on subcategories (the authors felt the cell sizes were too low). Kappas ranged from .095 for the task component (slight agreement) and .403 for the artifact component (fair to moderate agreement). They found that more than half of the classifiers agreed on 17 of the 20 artifact classifications and 16 of the 20 task classifications.

A brief note about kappa: There are different methods for assessing the magnitude of interrater agreement. One of the best-known is the kappa statistic (Fleiss, 1971). Kappa measures the extent of agreement among raters that exceeds estimates of chance agreement. Kappa can take values between −1 and 1 and is often interpreted with the Landis and Koch guidelines (poor agreement: ≤ 0, slight: 0.01–0.20, fair: 0.21–0.40, moderate: 0.41–0.60, substantial: 0.61–0.80, almost perfect agreement: 0.81–1.00).

User Action Framework (UAF)

Another approach, the User Action Framework (UAF), was developed by Andre et al. (2001). The UAF was an extension of the UPT (with one author in common with Keenan et al., 1999) based on Norman’s Theory of Action model. In the UAF, evaluators progress through a classification path that can require up to six decisions (levels in the decision tree) to get to an end-node description.

In a study of the reliability of the framework, ten usability experts used the UAF to classify 15 different usability problems. The authors reported kappas that were smallest at the lowest part of the decision tree (Level 6) and highest at the top of the tree (Level 1). All values of kappa were statistically significant (p < .01), Level 1: .978, Level 2: .972, Level 3: .783, Level 4: .762, Level 5: .719, Level 6: .299, Aggregated overall: .583). Using the guidelines for interpreting kappa, agreement was almost perfect for Levels 1 and 2, substantial for Levels 3–5, and fair for Level 6.

In a follow-up study, nine of the same ten usability experts categorized the same problems using Nielsen’s ten heuristics with kappa = .325 (fair; p < .01) for the heuristic evaluation, a value similar to the lowest level of the UAF but substantially lower than the higher levels.

Khajouei et al. (2011) described an extension to the UAF using severity ratings. In their validation study, two expert evaluators categorized 57 problems into 29 different UAF classes. They reported high kappa (0.94) at the first level of the hierarchy (consistent with Andre et al.) but didn’t calculate it for the final levels.

Hornbæk and Frøkjær (2007) included the UAF in a study of techniques for matching usability problem descriptions. Participants in the study were 52 undergraduate computer science students taking a class in human-computer interaction (HCI), all of whom had performed at least one think-aloud usability study. There was a learning curve, but based on participants’ descriptions of their experience using the method, the authors concluded, “After learning the UAF, participants seem to find this technique quite effective and clear” (p. 512).

Classification of Usability Problems (CUP)

The method was developed by Hvannberg and Law in 2003. It involves ten attributes, some of which do not require judgment (e.g., Identifier, Defect Removal Activity) but four that do require judgment (e.g., Failure Qualifier, Expected Phase, Actual Phase, Cause). In an initial validation study, two evaluators used CUP to categorize 39 problems from a usability test with ten participants and 52 problems from a heuristic evaluation. For the usability test, kappas for the four judgment attributes were .036 (poor) for Cause, .537 (moderate) for Expected Phase, .632 (substantial) for Actual Phase, and .746 (substantial) for Failure Qualifier. Results for the heuristic evaluation were similar (e.g., −.06 for Cause, .44 for Actual Phase, .47 for Expected Phase, .54 for Failure Qualifier) but consistently smaller than those for the usability test.

In a follow-up study by Vilbergsdóttir et al. (2006), the authors pulled 21 problems from a usability test of eleven users (five teachers and six students) on a learning management system. Eight evaluators, six students, and two developers were selected to categorize the problems using CUP. Evaluators were presented with 1–2 hours of instructions on using CUP (slides and booklets). They reported low kappa statistics from the novices, from −.27 for the Expected Phase to a high of .31 for Problem Severity (.25 for Failure Qualifier). For the two expert evaluators, the kappas increased from .22 (Expected Phase) to .33 (Problem Severity) to .50 (Failure Qualifier). The novice participants rated the ease of use and intention to use CUP, indicating that perceived ease of use was a dominant driver of whether they intended to use CUP. The evaluators felt this classification method took too long to use.

The authors included a check on the validity to see how well CUP helps developers understand and prioritize problems. Based on interviews, they reported that developers were more focused on creating intuitively holistic solutions to a set of problems than on systematic individual problem corrections.

Orthogonal Defect Classification (ODC)

For the Orthogonal Defect Classification (ODC) method, Geng et al. (2014) extended UTP by creating seven usability problem attributes: (1) artifact-related problem, (2) task-related problem, (3) problem trigger, (4) problem in learning, (5) problem in performing given tasks by users, (6) user perception, and (7) problem severity. They used 70 usability problems pulled from a heuristic evaluation and a usability test of 20 participants. They reported three experts independently classified the problems in the seven attributes with “agreement of 90%” but they did not explain how they calculated 90%, nor did they compute kappa.

Open-Source Usability Defect Classification (OSUDC)

Yusop et al. (2020 [pdf]) implemented an updated version of the UPT for open-source software. They felt Keenan’s approach to classification relies on high-quality defect descriptions that are rarely present in open-source usability defect reports. They collected 377 usability defects from open-source products (e.g., Mozilla Thunderbird, Firefox) into what they call an Open-Source Usability Defect Classification (OSUDC) taxonomy. In a combination of the existing frameworks, the taxonomy sorts by defect type, effect attribute, and failure qualifier.

To validate the OSUDC, the authors conducted a validation study with five usability problems selected from a problem list. In an online survey, 41 evaluators (mostly students) with some training in HCI but not in problem identification were given an overview of the OSUDC framework and asked to categorize each of the five problems.

They computed agreement at the primary defect category level (not on the end nodes) and reported a kappa of .304 (fair).

UI Tenets and Traps

A recent addition to the categorization of problems is a tool called UI Tenets and Traps (initially developed in 2009 by Michael Medlock and Steve Herbst and offered externally in 2017). Medlock and Herbst built a set of 26 “Traps” that are categorized under nine Tenets (e.g., Understandable, Comfortable, Responsive). They have made learning and applying the Traps very easy by providing a nice-looking set of portable cards available for purchase on their website.

Nothing has been formally published on Tenets and Traps, but it was built in a way similar to the UAF—synthesizing observations of lots of problems over years. While speaking with the authors in a personal communication, they described their origin story. As seasoned researchers at some of the biggest software companies, they’d seen a lot of product failures. For example, while the two were at Microsoft, problems with the Windows Phone’s initial release were predictably bad. They saw the same issues again in Xbox and Windows 8 projects. Different products were being released by the same company, but the same types of problems were cropping up in earlier releases and getting fixed only in subsequent releases.

The Trap cards apply not just to websites and mobile apps but to physical products and Voice User Interfaces (VUIs) such as Alexa. They also include more than just traditional usability issues. For example, is waiting for an ad or defaulting to a more expensive product on a website a problem? They are features but fall into Traps—people are annoyed by them.

Similar to other frameworks, they are distilled from existing UI heuristic tools and research. We love to evaluate and use new frameworks, methods, and tools, and we think Tenets and Traps shows promise. We’ll look to answer some questions in future articles, such as:

  • Does knowing Traps help identify problems?
  • What is the agreement level for Tenets (high-level categories) or Traps (low-level categories)?
  • Are there Traps missing (source for classification gaps)?
  • Do some Traps cover the same ground (source for classification overlaps)?
  • Are they better than using Nielsen and Molich’s ten heuristics or other UI problem classification methods?
  • How effectively do they prevent problems during design or categorize existing ones after discovery?

Summary and Discussion

We reviewed several frameworks for categorizing usability problems.

A core theme of system and software adoption is usefulness and usability, so methods for UI problem classification need to achieve both goals. They should help identify problems and point to appropriate resolutions (i.e., be useful). They must not be too time-consuming or hard to understand for researchers or developers (i.e., be usable). They should also be reliable (i.e., different evaluators working with the same information should arrive at the same classification decision).

Table 1 shows a summary of the overall kappa coefficients for a measure of agreement between the evaluators (if it was reported). The kappas generally fall within a range of .30 to .58 (fair to moderate agreement).

Usability Problem Taxonomy (UPT)0.403
User Action Framework (UAF)0.583
Heuristic Evaluation (HE, from UAF study)0.325
Classification of Usability Problems (CUP)0.360
Open-Source Usability Defect Classification (OSUDC)0.304
Orthogonal Defect Classification (ODC)na
UI Tenets and Trapsna
Table 1: Summary of kappa reliability for various UI problem classification frameworks.

The key takeaways from this literature review are:

Frameworks differ in their strengths and weaknesses. Of the seven frameworks we reviewed (listed in Table 1), the most reliable (highest kappa) is the UAF, and the least reliable were OSUDC and HE. Reliability, though important, is only one framework attribute that should affect adoption. Other key drivers of adoption, such as perceived ease and usefulness, have not been evaluated for all these frameworks. The CUP study included a validation step to assess the impact of the problem taxonomy with developers, but it was primarily a qualitative activity derived from interviews with development teams. Computer science undergraduates reported initial difficulty using the UAF, but after some experience found it effective and clear.

Overall, reliability was modest. Most frameworks provided some information about the reliability using the kappa coefficient, and in some cases, multiple kappas were reported. In general, reliability was modest, with the highest overall kappa for UAF (.58, on the border between moderate and substantial). Overall kappas for UPT, HE, CUP, and OSUDC ranged from .304 to .403 (solidly fair to bordering on moderate).

Reliability is higher at higher levels in frameworks. Frameworks with multiple levels of categorization like the UAF typically show high agreement at the top level. That high level of agreement can be a bit deceiving, as the few options at the highest level in a classification tree mean fewer opportunities for disagreement. Agreement is often reduced significantly as more levels are added. (Although for UAF, agreement was substantial to almost perfect for the first five levels, only dropping to fair for Level 6.)

Some training is required. All frameworks required some level of familiarity and training. Typical training in the reliability studies lasted from a few minutes to a few hours. Some of the findings suggested that more training would improve reliability and future uptake.

Usability problem descriptions matter. In the reliability studies, common problems involved usability problem descriptions that were vague or partial. Not surprisingly, this led to more disagreement. This issue was partially addressed by the problem description forms developed for CUP, but the interrater agreement was only fair.

Is UI problem classification necessary? Some of these classification schemes (UPT, UAF, CUP) have been around for over two decades, and it’s been almost ten years since the publication of ODC. Yet we don’t see anything near widespread practitioner adoption of UI problem classification methods beyond heuristic evaluation. This might be due to pressure on UX practitioners to produce results and recommendations as quickly as possible or indifference to UI problem classification from software developers (e.g., not perceived as useful). This is one reason why we are particularly interested in continuing research on the reliability, ease, and usefulness of UI Tenets and Traps.

    Your Cart
    Your cart is emptyReturn to Shop
    Scroll to Top