The fundamental goal of usability testing is to produce highly usable products and services.

That’s an uncontroversial statement.

Where things can get a bit confusing is how different approaches to usability testing have different ways of achieving that goal.

In earlier articles we have described the different types of usability tests but many types still share common goals.

In this article, we’ll first revisit the common distinction between formative and summative usability testing and dig deeper into how they work together to progress toward improved usability.

We then distinguish between the three primary goals of usability testing: discovering problems, comparing against benchmarks, and comparing against other interfaces. We also show how these evolve from the classic view of formative and summative testing.

Formative vs. Summative Testing

One of the most common distinctions in usability testing is between formative and summative tests.

“Formative” refers to usability studies in which the primary activity is the detection and, through iterative design, elimination (or reducing the impact) of usability problems. Summative refers to usability studies in which the primary activity is collecting objective and subjective measurements related to the accomplishment of a set of tasks.

This distinction comes from Educational Theory where the emphasis is on evaluating and improving the performance of students rather than interfaces. But even this distinction needs more refinement.

Two of the foundational books on usability testing are Rubin’s (1994) Handbook of Usability Testing: How to Plan, Design, and Conduct Effective Tests and Dumas and Redish’s (1999) A Practical Guide to Usability Testing. Here’s how each of them characterizes the goal of usability testing:

  • Rubin (1994): “The overall goal of usability testing is to identify and rectify usability deficiencies existing in computer-based and electronic equipment and their accompanying support materials prior to release.”
  • Dumas and Redish (1999): “A key component of usability engineering is setting specific, quantitative, usability goals for the product early in the process and then designing to meet those goals.”

Rubin describes a formative process; Dumas and Redish a summative one. These processes are not in direct conflict, but they do suggest different focuses that can lead to differences in practice. For example, a focus on measurement typically leads to more formal testing (less interaction between observers and participants); whereas a focus on problem discovery typically leads to less formal testing (more interaction between observers and participants).

Thus we have a distinction between diagnostic problem discovery (formative testing) and measurement tests (summative testing). Furthermore, there are two common types of measurement test: comparison against objectives (benchmarks) and comparison of products. Let’s dig into problem discovery tests first.

Goal 1: To Discover Problems

The primary activity in diagnostic problem discovery tests is the discovery, prioritization, and resolution of usability problems. This is the most common testing goal and likely what most practitioners (and the broader product development community) have in mind when they talk about usability testing.

Problem discovery tests aren’t necessarily devoid of measurement. The most common measurements typically comprise

  • Problem frequency: Percent/number of participants who encountered a problem (often with confidence intervals), and
  • Problem severity: The impact on the participants.

Because the focus is not on the precise measurement of the performance or attitudes of the participants, problem discovery studies tend to be informal, with a considerable amount of interaction between observers and participants.

The number of participants in each iteration of testing should be fairly small, but the overall test plan should be for multiple iterations, each with some variation in participants and tasks. When the focus is on problem discovery and resolution, the assumption is that more global measures of user performance and satisfaction will take care of themselves.

Some typical stopping rules for iterations are a preplanned number of iterations or a specific problem discovery goal, such as “Identify 90% of the problems available for discovery for these types of participants, this set of tasks, and these conditions of use.” We’ll cover this more in an upcoming article.

Goal 2: To Compare Against a Benchmark

Benchmarks are standard points of reference that make measures more meaningful. They are used across many fields (e.g., hardware performance and business profitability), including usability testing.

In our experience, while many researchers use metrics in their usability tests, many don’t necessarily compare their use of metrics against one or more benchmarks.

Studies that have a primary focus of comparison against quantitative benchmarks include two fundamental activities:

  1. Development of the usability benchmarks, typically done for specific tasks (post-task metrics) and for the study (post-study metrics), as described below.
  2. Iterative testing to determine whether the product has exceeded the benchmarks.

Benchmarks can be set at the task and study level. In practice, the most common task metrics are

Successful task completion rates (effectiveness)

Mean task completion times (efficiency)

Mean participant satisfaction/ease ratings (satisfaction).

The most common study-level metrics are some form of a standardized questionnaire that measures attitudes, including:

And often behavioral intention, using measures such as:

  • Intent to Recommend (NPS)
  • Intent to revisit/repurchase.

Practitioners could consider many other measurements, including but not limited to

  1. Number of user errors
  2. Number of repeated errors (a user committing the same error more than once)
  3. Number of tasks completed within a specified time limit
  4. Number of wrong menu choices
  5. A variety of additional subjective measurements.

Setting the Benchmarks

The first steps are setting the appropriate benchmarks (e.g. post-task ease scores above 5.6) and determining whether the results statistically exceed those benchmarks.

Ideally, the benchmarks should have an objective basis and shared acceptance across the various stakeholders, such as marketing, development, and test groups. The best objective basis for benchmarks is data from previous usability studies of predecessor or competitive products.

We cover both steps in more detail in Benchmarking the User Experience and Chapter 4 of Quantifying the User Experience.

Meeting or Exceeding Benchmarks

After the benchmarks have been set, the next step is to collect data to determine whether the product has met its goals. In either moderated or unmoderated usability studies, representative participants perform the target tasks in the specified environment as target measurements are recorded to the extent possible within the constraints of a more formal testing protocol, and details about any usability problems that occur are captured.

Sample sizes will be larger than in problem discovery tests to achieve enough precision to determine whether a benchmark has been exceeded. This is done using either a statistical test (e.g., one-proportion test for completion rates) or through visual inspection of confidence intervals (discussed in detail in Chapter 4 of Quantifying the User Experience).

Goal 3: To Compare Against Another Interface

A second type of measurement test (and the third goal) is to conduct usability tests for the purpose of direct comparison of two or more interfaces. Interfaces can range from physical products to websites to mobile apps for enterprise software. The comparison may also be with a prior study where the goal was a comparison against a benchmark.

Interfaces are compared using one or multiple task- or study-level metrics as described in the benchmark study type section. For example, interfaces can be compared using task-level metrics to determine which had the higher completion rate (effectiveness), had the lower average task time (efficiency), or had the highest scores on the UMUX-Lite (measures of usefulness and ease).

Metrics can be compared individually using the recommended statistical tests (usually t-tests and 2- proportion tests; see Chapter 5 of Quantifying the User Experience) and confidence intervals. Comparing multiple simultaneous metrics (e.g. completion rates, time, and ease) can be more complicated and either requires the careful use of multivariate statistical tests (using software such as SPSS) or combining the metrics into an aggregated score, such as the Single Usability Metric (SUM).

To achieve statistical significance, sample sizes also tend to be larger (dozens to hundreds). The actual sample size is a function of how large of a difference exists. Large differences in interfaces (e.g., a completion rate of 80% for one interface and 25% for another) need smaller sample sizes. Small differences (e.g., completion rate of 80% vs. 85%) require larger sample sizes.

As with a benchmark test, this type of study will be more formal than a problem-discovery study, but there is usually ample opportunity to identify usability problems. If you’re studying competitors who have made different design decisions, this type of test helps you understand where these design decisions have degraded or improved the user experience relative to your design, primarily at the level of the UX metrics you’ve collected, but also with regard to specific usability problems. Investigating the competitive landscape in this way can reveal opportunities for product improvement (goldmines) and aspects of designs to avoid (landmines).



Sign-up to receive weekly updates.