Does Thinking Aloud Uncover More Usability Issues?

Jeff Sauro, PhD • Jim Lewis, PhD

Feature Image of a microphone and computer screenOne of the most popular UX research methods is Think Aloud (TA) usability testing.

In TA testing, participants speak their thoughts while attempting tasks. The process helps researchers identify usability problems and potential fixes. But is the process of thinking aloud necessary to uncover problems and insights?

Earlier, we investigated the effects of TA on UX metrics and found it may have some unintended consequences. We found it has little effect on study-level metrics such as the SUPR-Q® and UX-Lite®. But thinking aloud also

So, there are some clear unintended consequences of having participants think aloud while performing tasks. Do the perceived benefits outweigh these costs? It may seem like a question with an obvious answer (such as “thinking aloud takes longer”), but in our experience, the benefit to addressing these questions with data is not to answer the question with a simple yes or no but to quantify an effect of the method. For example, does TA modestly or substantially improve the number of problems found?

The benefits of TA may also depend on what people say when they think aloud. Reading task instructions and the occasional um and ah may not provide a lot of insight for a researcher. This is especially the case when a moderator is not present to prompt a participant to think aloud.

In our earlier analysis of coding what people say during unmoderated TA studies, we found that the bulk of utterances were the more helpful Level 2 and Level 3 verbalizations, which are more likely to reveal confusion—the precursor to problems. On average, less than 20% of participants’ verbalization time was spent reading text (for example, task instructions and on-screen text). This suggests an opportunity to garner more insights from what people are saying and not just doing.

Although the TA method has been around for decades with the use of an attending moderator, recent technological advancements in the past 15 years have allowed participants to remotely think aloud using software (like our MUIQ® platform) without the need for a moderator. We continued our investigation into the effects of thinking aloud on data collected during these unmoderated studies by examining the number of problems uncovered in thinking aloud versus silent task attempts.

Comparing TA vs. Non-TA Problems Study

We used a subset of the data we collected and reported on in our earlier articles on the effect of TA on time and dropout rates. Between January and September 2022, we conducted ten studies with participants from an online US-based panel using the MUIQ platform. In each study, participants were randomly assigned to a TA or non-TA condition. In both conditions, participants attempted one task on a website. The websites were in the travel, real estate, and restaurant industries. We selected four datasets (United, United repeat, Zillow, and OpenTable) for our analysis. We covered details about several of these studies (including descriptions of several of the tasks) in our earlier article on TA-related dropout rates.

We used six evaluators to uncover the usability issues and insights (coded with initials). All evaluators had some experience with detecting usability issues and conducting usability tests. The experience ranged from being relatively new (one evaluator had observed 10–20 sessions before this study) to highly experienced (over 1,000 sessions).

The evaluators were instructed to identify problems and insights by watching all the videos in batches (all TA or all non-TA). They varied the order they watched the type of video, sometimes starting with the batch of TA first or non-TA first. Between 12 and 24 videos were coded within each condition (TA vs. non-TA) and website. This ranged between 30 and 41 videos per website, totaling 374 videos evaluated.

Examples of identified problems included a participant trying to select a seat from the seat selection legend (Video 1) and a participant expecting seat selection to occur earlier in the task flow (Video 2).

Video 1: A user tries to make a selection from the seat legend.

Video 2: A participant expects seat selection would have happened earlier.

Given the inherent variability between evaluators, we had multiple evaluators code the same sets of some videos, leaving 153 unique videos split between TA and non-TA. Because there were different numbers of videos in each condition and website (e.g., 17 TA versus 24 non-TA videos on, we used the number of problems per participant video to control for the different numbers of videos.

Table 1 shows the number of videos watched and the number of issues coded for each condition, website, and evaluator. We calculated the number of problems found per video and compared the ratio between TA videos and non-TA videos.

WebsiteEvaluatorVideos# Issues# Prob/VideoVideos# Issues# Prob/VideoRatio
UnitedY 17 352.06 24 301.25 65%
E 16 311.94 24 291.21 60%
C 17 382.24 24 371.54 45%
D 16 221.38 23 110.48188%
ZillowE 18 231.28 110.92 39%
C 18 221.22 12 191.58−23%
Y 18 110.61 12  80.67 −8%
OpenTableE 23 291.26 18 150.83 51%
W 23 251.09 18 150.83 30%
United 2S 21 130.62 20  80.40 55%
Overall1872491.331871830.98 50%

Table 1: Number of problems identified by evaluator (initial used) per video and study for both TA and non-TA conditions.

For example, evaluator Y identified 35 unique issues across 17 TA videos in the United study compared to 30 issues from 24 non-TA videos. That works out to 2.06 problems per TA video compared to 1.25 problems per non-TA video—65% more problems uncovered per TA video than per non-TA video for this evaluator. We see a similar pattern across evaluators.

Across the ten datasets, more problems per video were uncovered in the TA condition in eight of the ten datasets, and on average, 50% more problems were identified in the TA videos. The average number of problems per video was 1.33 for TA and .98 for non-TA (36% more discovery in TA). This difference is statistically significant using a paired t-test (t(9) = 3.13, p = .01).

The advantage of using different evaluators on the same dataset is that we can see a clear variation within the same videos. For the study, the number of issues uncovered by the four evaluators varied quite a bit. The TA condition had a low of 22 to a high of 38 issues uncovered. The range was even more substantial (from 11 to 37) in the non-TA condition.

This is another example of the evaluator effect, where all the evaluators watching the same videos will uncover some issues in common, but each evaluator will also uncover different issues not found by the others. If we were to compare different sets of evaluators and videos, the effects of thinking aloud on problem discovery would be confounded because of this expected variation among evaluators. Would differences be because of the evaluators or because of thinking aloud? By using a within-subjects approach in this study, we were able to control for differences between evaluators and better isolate the effects of thinking aloud.

When speaking with the evaluators, a common comment was that TA helped clarify ambiguous situations in a video. For example, was someone confused or just pausing while attempting a task? The verbalizations (usually L2 and L3) added that context.

We used the number of problems identified by each evaluator as a primary metric. But are they really problems? It’s hard to know. It could be that TA is generating false positives—that is, the evaluator might think it’s a problem because of what they hear. It’s a more complicated analysis to verify the validity of problems. However, this analysis suggests that whatever an evaluator considers a problem, more of them will likely be revealed with TA than non-TA videos collected in an unmoderated usability test.

Summary and Discussion

An evaluation of 153 videos of participants thinking aloud compared to not thinking aloud revealed that

TA videos generated 36–50% more problems. The average number of problems identified per video by each evaluator was 1.33 for Think Aloud videos and .98 for non-TA unmoderated videos, 36% more for TA. The mean of the ratios for the combinations of evaluator and website was 50% more for TA. Regardless of the analytical method, it seems clear that problem discovery is substantially greater when participants think aloud.

There was clear variability between evaluators. As expected, different evaluators uncovered a different number of issues even after watching the same participant videos. This difference was seen for both TA and non-TA conditions. A future analysis will dig deeper into the disagreement between videos and investigate whether TA helps or hurts agreement.

This wasn’t a double-blind study. It’s possible that having evaluators watch a TA video and a non-TA video on the same website may predispose or anchor them to finding more problems in the TA video (especially if they felt TA was already a better method for uncovering problems). A future analysis could attempt to blind evaluators from both the study intention and use separate videos for TA and non-TA videos (which would require a larger sample size).

It’s unclear whether the problems identified are all real problems. We used the number of problems as a primary metric. It could be that evaluators in this type of study tend to generate false positives at a higher rate in TA. It’s a complicated topic: what makes something a legitimate problem? Also, in this analysis, we have not looked at the overlap in problems between methods or between evaluators (a possible topic for future research).

    Your Cart
    Your cart is emptyReturn to Shop
    Scroll to Top