10 Key Takeaways from the Latest Research on Thinking Aloud in Usability Testing

Jeff Sauro, PhD • Jim Lewis, PhD

January 9, 2024

In Think Aloud (TA) testing, participants speak their thoughts while attempting tasks. The process is meant to help researchers identify usability problems and potential fixes. It’s a distinctive method in UX research.

Despite its popularity, there are many open research questions about the efficacy and potential side effects of think-aloud research. Researchers still disagree on the “correct” way to think aloud (e.g., Fan et al., 2020; Molich et al., 2020; O’Brien & Wilson, 2023), and there are many different TA methods, with the most recent being unmoderated TA (no attending observer).

In the last two years, we’ve researched the unmoderated TA method extensively, contributing to the field’s understanding and advancing the method. Here is a summary of ten key findings from our research.

TA identifies around 30% more problems. For such a popular method, you would think there would be well-documented data showing the benefits of having participants think aloud compared to attempting tasks silently. There wasn’t, but in our evaluation of 153 videos, split between TA and non-TA, we found that evaluators uncovered 36–50% more problems with think-aloud tasks than with silent task performance. This suggests TA is more effective at uncovering problems than silent task attempts, likely due to participant utterances clarifying behaviors that would otherwise be hard to interpret (balanced against some risk that utterances might occasionally be misleading).
A lot of TA time is spent being silent. Participants don’t speak constantly during a typical TA session. In a detailed coding of 27 TA videos, we found that only about a third of the task time included verbalizations.
Most verbalizations describe actions. In our coding of 27 TA videos, we found that the bulk of verbalizations (50%) involved what Ericsson and Simon called Level 2 verbalizations—participants describing actions as they were attempting a task. A smaller but still substantial percentage (27%) of the verbalizations involved participants explaining their decisions (Level 3), which might provide more insights into problems. But do these verbalizations help or hinder agreement?
TA might slightly improve agreement rates. The so-called evaluator effect is the phenomenon where different researchers tend to find a large number of different problems (typically only about 30% are in common). We looked to see whether participant verbalizations led to higher agreement rates. We found the average any-2 agreement was higher for TA than non-TA using the same evaluators (41% vs. 34%), and slightly (not statistically) more problems were uncovered by all four evaluators in TA than non-TA (18% vs. 14%).
Only around 10–20% of online panelists participate in TA. Not everyone is comfortable sharing their screen, microphone, or web camera. Our analysis across multiple datasets and over 1,000 participants found that roughly 9% of participants will provide a usable think-aloud video when there is a delay between indicating a willingness to participate and the invitation to the study. For delayed invitations, if you need roughly ten usable think-aloud videos, expect to invite around 111 participants.
Country and age affect participation in TA studies. Large dropout rates are not only logistically challenging but they may also lead to systematic bias depending on who ultimately participates. While there weren’t substantial differences in the few demographic variables we collected in our studies, we did find statistical differences between UK and US participants and between age ranges. Roughly 64% of the “No” group were 34 or younger compared to 39% “Completed” from the same age range. The participants in the youngest cohort in our analysis (18–24) were more than twice as likely to decline as to participate (31% vs. 14%).
Asking people to think aloud doubles the dropout rate. Across four online studies with 314 participants randomly assigned to TA or non-TA conditions, we found that participants who begin an online study and are then asked to think aloud are more than twice as likely to drop out (50% vs. 19%) compared to participants who are asked to share their screen only (no request to think aloud or share their webcam). The consequence of having a high dropout rate means you should plan to have more than roughly double the number of prospective participants start a study if you ask them to think aloud.
TA has little effect on study-level metrics such as the SUPR-Q^® and UX-Lite^®. Across six post-study metrics from ten remote unmoderated studies involving 423 participants, we found that the effect of thinking aloud had little impact. When there was an impact, it tended to lower metrics very slightly.
TA does impact attitudinal data, such as perceptions of ease, at the task level. In the same ten studies, we found that thinking aloud tends to modestly depress post-task attitudinal metrics. Both post-task ease (the SEQ) and post-task confidence ratings were statistically significantly lower when aggregated across the studies. The differences, however, were not perfectly consistent (some studies showed opposite effects), and the effect was smaller (5% lower) than with task time (which was ~20% longer for TA). Thinking aloud had negligible effects on task completion rates in most studies, but it had a large impact in two studies. In one study, the completion rate for non-TA was significantly higher than for the TA condition (a difference of 24%, p < .10). That was balanced, however, by a 23% difference in the opposite direction in another study.
TA increases task time by about 20%. In another analysis of the ten studies, we found in nine of the studies that participants took longer to complete tasks when thinking aloud (total task duration regardless of success). In only three studies were the differences statistically significant, but ANOVA indicated a strong and significant main effect of method. On average, the difference was around 16% longer for participants in the TA condition. We found the same pattern when looking only at successful task attempts (e.g., significant main effect of method). After removing the roughly 25% of attempts that didn’t meet the success criteria, participants in the TA condition took about 20% longer to complete the task.

Overall, the benefits of TA in unmoderated usability testing (Takeaways 1, 3, and 4) seem to outweigh the negative aspects (Takeaways 5, 7, 9, and 10), especially when the focus is on problem discovery and when behaviors that would otherwise be hard to interpret are clarified by what the participant says. When the focus of a usability study is something other than maximizing problem discovery and interrater reliability, for example, keeping sessions as short as possible or dropouts as low as possible, then non-TA will be the better choice.

10 Key Takeaways from the Latest Research on Thinking Aloud in Usability Testing

You might also be interested in

Sign-up for our weekly newsletter.

Platform

MUiQ^®: The Platform for UX Research

Blog

Most Popular

Most Recent

Upcoming Events

Visit us at UXPA International 2024

Visit us at UXPA Boston 2024

Books

Surveying the User Experience

Benchmarking the User Experience

Customer Analytics For Dummies

Quantifying The User Experience: Practical Statistics For User Research

10 Key Takeaways from the Latest Research on Thinking Aloud in Usability Testing

You might also be interested in

Sign-up for our weekly newsletter.

Platform

MUiQ®: The Platform for UX Research

Blog

Most Popular

Most Recent

Upcoming Events

Visit us at UXPA International 2024

Visit us at UXPA Boston 2024

Books

Surveying the User Experience

Benchmarking the User Experience

Customer Analytics For Dummies

Quantifying The User Experience: Practical Statistics For User Research

MUiQ^®: The Platform for UX Research