Assessing the Reliability of UI Trap Cards

Jeff Sauro, PhD • Steve Jenks, PhD • Dylan Atkins • Jim Lewis, PhD

Feature image with UI trap card illustrationsWould having a system for classifying usability problems be helpful to UX researchers and designers? Would it reduce the evaluator effect?

Categorization frameworks have been around for decades, but in our experience, they haven’t seen a lot of adoption by commercial development teams. There are probably a few reasons for this: they might not be perceived as helpful, they can be difficult to learn, and they take time to use.

For a method to be valid (in this case, to improve problem detection and correction), it first needs to be reliable. That is, independent evaluators using a categorization framework should come to the same or similar conclusions when presented with the same problems.

In an earlier article, we reviewed different methods for categorizing problems uncovered in usability testing. The potential benefits of using a method or framework for categorizing problems are to help find the root causes of the problem, assess the potential impact if it’s not resolved, and provide an easier way to track problems over time.

In that analysis, however, we found only modest to good agreement as measured using the kappa coefficient for some frameworks. Additionally, some data suggest that these frameworks can be a bit difficult to apply, require more training, and take time.

One of the more recent frameworks we reviewed is UI Tenets and Traps. It was developed in 2009 by Michael Medlock and Steve Herbst, both alumni of Microsoft and currently working at Meta and Amazon respectively.

They have found success in training teams to use the UI Tenets and Traps—so much so that Microsoft continues to offer their employees formal training in its use, even after Michael and Steve left. The method is widely used, and it is formally taught at companies like Meta and Amazon.

Part of the reason for the success may be the delivery of the framework. There are 26 color-coded cards with up-to-date examples using hardware and software products. Each card describes a “trap” (common lower-level design problems that degrade user experiences, such as accidental activation, unnecessary steps, and unwanted disclosure) and indicates the associated tenet (high-level general attributes of good UI design such as comfortable, efficient, and discreet). More than 10,000 decks have been sold to researchers around the world (Herbst, personal communication, Dec 27, 2023).

We purchased a set of the trap cards in 2018 and wanted to get a sense of how could be used to systematically identify and code UI problems for projects we work on.

While it’s hard to assess the validity of using a framework—details of product success metrics are usually confidential—we certainly can assess interrater reliability.

Reliability Study Details

To assess the reliability of the UI Tenets and Traps framework, we conducted two studies following a process similar to the earlier studies on the other frameworks we summarized.

In Study 1, five user experience researchers working at MeasuringU used the Tenets and Trap cards to categorize a preexisting list of problems. All five evaluators have experience conducting usability studies and coding and describing problems. The dataset they used included 85 unique problems from two datasets:

  • An airline website with 46 problems from 17 users
  • A printer setup study with 39 issues from 12 users

In Study 2, four evaluators watched five think-aloud videos of participants in an unmoderated usability study using an online restaurant reservation website to uncover usability problems. Independently, they assigned a trap card to each issue uncovered.

Study 1 Results

In Study 1, because the issues were already identified and available to all evaluators, we assessed reliability using three measures: the average any-2 agreement rate, the number of traps used per issue, and kappa.

Average Agreement Rate

The average any-2 agreement rate was developed by Hertzum and Jacobsen (2003) to assess the consistency in the number of usability problems uncovered between all pairs of evaluators. When the number of problems is fixed, as they are in this study, the computation of the agreement rate between two evaluators is simplified, so you only need to divide the number of same classifications by the total number of items being classified.

For each pair of evaluators, we coded a 1 if both assigned the same trap card to a usability problem and a 0 if they assigned different ones. We repeated this for all 85 problems and averaged across the ten possible pairs of evaluators to compute an average agreement rate.

For example, Table 1 shows Evaluators S and N agreed on 42 of the trap cards (42/85 = 49% agreement). The lowest agreement between evaluators was 34% (D and G, S and G), and the highest agreement was 54% (N and E). Across all pairs, the average agreement rate was 44%. This level of agreement is comparable to our other assessments of any-2-agreement from controlled studies where the focus is on problem discovery.

SNGDE
S8549%34%36%45%
N428542%48%54%
G29368534%45%
D3141298552%
E3846384485
Table 1: Number and percent agreement on trap cards assigned between all pairs of evaluators for the 85 usability problems.

Number of Traps Used Per Issue

Another way to assess reliability is to see how much variability there was in the number of trap cards associated with each issue. If all evaluators selected the same trap, the agreement would be perfect. In contrast, if all five evaluators selected different traps (five traps), there would be complete disagreement. Table 2 shows the number of traps and percent of all traps associated with each issue. Only 13% of issues had perfect agreement (one trap), 58% had one or two trap cards, and only 4% (three traps) had complete disagreement. We don’t have data from other frameworks to compare with this level of agreement.

Number of Traps
By IssueNumber of Issues% of 85
11113%
23845%
32428%
4 911%
5 3 4%
Table 2: Number of traps and percentage used by issue.

Kappa

The third measure of agreement we used was kappa, a commonly used measure of interrater reliability that corrects for chance agreement in a way that the average agreement rate does not. It was also the most common measure reported in our earlier analysis of frameworks. This allows us to easily compare the Trap Card framework to other studies.

We computed Fleiss’ kappa (a statistic for multiple simultaneous evaluators versus Cohen’s kappa, which is for only two evaluators). Kappa can take values between −1 and 1, which are often interpreted with the Landis and Koch guidelines (poor agreement: ≤ 0, slight: 0.01–0.20, fair: 0.21–0.40, moderate: 0.41–0.60, substantial: 0.61–0.80, almost perfect agreement: 0.81–1.00).

The kappa for this study showed a fair level of agreement between the researchers (κ = .378, 95% CI: .357 to .399, p < .001) and a moderate level of agreement when the traps were collapsed to tenets (κ = .455, 95% CI: .415 to .495, p < .001).

Table 3 shows the kappas from this study with published kappas for other frameworks, arranged from highest to lowest. It shows that UI Tenets have the second highest kappa (.455, as expected because these are done at a higher level). The UI Traps have the fourth highest kappa (.378), suggesting comparable reliability to other frameworks using these evaluators with only a modest level of training.

FrameworkKappa
Usability Action Framework (UAF)0.583
UI Tenets0.455
Usability Problem Taxonomy (UPT)0.403
UI Traps0.378
Classification of Usability Problems (CUP)0.360
Heuristic Evaluation (HE, from UAF study)0.325
Open-Source Usability Defect Classification (OSUDC)0.304
Orthogonal Defect Classification (ODC)na
Table 3: Kappas for UI Tenets and Traps with published kappas for other frameworks.

Study 2 Results

Some evaluators in Study 1 felt that if they had observed a problem rather than just reading about it, knowing more details and understanding the context may have affected which trap they selected, leading to better evaluator agreement.

Consequently, we wanted to know how agreement might change if evaluators used the trap cards to discover and classify problems, not just categorize existing ones. In January 2023, four evaluators (S, G, D, and N from Study 1) independently reviewed five think-aloud videos of users using specific criteria to book a reservation on a popular restaurant reservation website (Opentable.com).

Each evaluator noted any issues and the when each issue occurred. In total, 89 events were noted across the five participant videos. Two evaluators then reviewed each issue and timestamp to remove duplicates and consolidate the observations into 27 distinct issues.

We assessed reliability using two measures—the average any-2 agreement rate and the number of traps used per issue. We couldn’t compute kappa because data collected in a discovery study are not amenable to that analysis.

Average Agreement Rate for Problems Detected

To assess the agreement rate In Study 1, we computed a simplified version of the average any-2 agreement rate as described by Hertzum and Jacobsen (see Table 4). In Study 2, we computed the standard average any-2 agreement rate, which is the average of usability issues in common across the five videos averaged across the six pairs of evaluators. For example, Evaluators S and G uncovered 38 total issues, 14 of which were the same, resulting in a 58% agreement rate (14/(18 + 20 − 14) = 14/24 = 58%). The lowest agreement rate was 46% (G and D) and the highest was 58% (S and G). Across all pairs, the average any-2 agreement rate was 55%. This agreement rate was similar to the average (about 59%) from our literature review of controlled studies in which evaluators watched the same videos. This suggests the trap cards don’t necessarily have a large effect (for better or worse) on the agreement rate between independent evaluators, although another study would need to be conducted with a control group not using trap cards to better isolate their impact on problem discovery rates and agreement.

SGDN
S1858%57%55%
G142046%57%
D12111555%
N12131116
Table 4: Number and percentage of issues found in common (any-2 agreement) between all six pairs of evaluators.

Average Agreement Rate for Trap Cards Used

We can compute the agreement rate on the trap cards only for problems that were uncovered by more than one evaluator. Table 5 shows that 19 of the 27 problems (70%) were uncovered by at least two evaluators. We then computed the average any-2 agreement rate for these problems detected by 4, 3, and 2 evaluators (also shown in Table 5).

# Evaluators Uncovering%NumAverage Agreement
130%8na
215%425%
326%746%
430%842%
Table 5: Number and percentage of problems found by 1 to 4 evaluators and their associated agreement rate for traps.

For example, one of the eight problems all evaluators identified was “When switching tabs, information in search box was removed,” and all four associated the same trap “System Amnesia” with this problem (100% agreement).

But another problem found by all four evaluators was “No visual cue for how to leave a picture (X missing in some scales),” but there was disagreement on this trap with two evaluators selecting “Invisible Element” and two selecting the closely related trap “Effectively Invisible Element” resulting in only two of the six pairs of four evaluators in agreement (33% agreement rate).

We repeated this matching exercise for the 19 problems that were uncovered by more than one evaluator, which resulted in an average agreement rate of 40%, about the same as the 44% in Study 1. This comparable agreement rate suggests uncovering problems with the trap cards doesn’t necessarily have a large impact on the reliability of the trap cards compared to applying them to an existing problem set (at least using the any-2 agreement measure).

Number of Traps Used Per Issue

Table 6 shows the percentage of trap cards associated with each issue for  Study 1 and Study 2. In Study 2, a higher percentage of issues had only a single trap card associated with it (44% vs. 13%), showing more consistency in applying the trap cards when they were used to uncover issues. This large difference was statistically significant (N−1 two-proportion test, p < .001).

Additionally, Table 6 shows that 85% of the issues had one or two trap cards compared to 58% for Study 1, suggesting at least nominally higher agreement when using the trap cards after observing the events rather than coding from existing problem descriptions (when using this measure of agreement).

# of Traps Used in Study 2# IssuesStudy 2
% of 27
Study 1
% of 85
11244%13%
21141%45%
3 2 7%28%
4 1 4%11%
5 1 4% 4%
Table 6: Number and percentage of issues that had 1 to 5 trap cards assigned by researchers compared to the results from Study 1 (see Table 2).

Discussion and Summary

In our analysis of the reliability of the UI Tenets and Traps across two studies, we found:

UI traps have comparable reliability to other frameworks. Kappa, the most common measure of agreement, was .38 at the trap level (fair agreement) and .46 at the tenet level (moderate agreement), consistent with published findings for other frameworks.

Any-2 agreement for trap cards was good in both studies. The average any-2 agreement in Study 1 (for existing problems) was 44%, and in Study 2 (new problems found by 1+ evaluators) any-2 agreement was 40%. While there isn’t a historical standard for any-2 agreement rates like there is with kappa, our prior work using this metric for uncovering problems from videos showed that rates in the range of 40%–60% are on the high side.

Any-2 agreement was similar for categorizing existing problems and finding new ones. Some evaluators in Study 1 felt that having access to more details about the problems (including videos) would increase the agreement/consistency of using the trap cards. However, in Study 2, where evaluators used the trap cards in conjunction with uncovering problems, we found roughly the same any-2 agreement (40% in Study 2 vs. 44% in Study 1). However, Study 2 used only five videos, so a replication with other videos (and additional statistical analyses comparing agreement rates with a larger dataset) might have different results.

The number of traps used per issue shrank for new issues. While the any-2 agreement rates were comparable for existing problems (Study 1) and new problems (Study 2), evaluators did use fewer trap cards per issue for the newly uncovered problems in Study 2. The use of fewer trap cards per issue indicates better agreement; using only one trap card for a problem signals perfect agreement. In Study 1, only 13% of the problems had a single trap card. In contrast, 44% (12 of 27) had a single trap card.

More training will likely increase reliability. When there is low agreement among evaluators using a rubric, a natural next step is to increase the amount of training and time spent using the method. The evaluators in this study had no prior experience with the trap cards, but as they progressed through the problems, they felt they got better at identifying the most appropriate traps.

High reliability doesn’t mean high validity. This study established at least an adequate, if not high, reliability for a method used by evaluators with minimal training. However, it’s unclear if higher reliability increases the method’s validity (e.g., do trap cards lead to uncovering more issues, better products?).

The studies lacked control groups. We didn’t have a control group for either study that would allow us to better isolate the effects of the trap cards on both problem discovery agreement rates and agreement rates for the trap cards. Future studies may apply a control group.

0
    0
    Your Cart
    Your cart is emptyReturn to Shop
    Scroll to Top