Effect of Thinking Aloud on UX Metrics: A Review of The Evidence

Jim Lewis, PhD and Jeff Sauro, PhD

Speaker and phoneThink Aloud (TA) usability testing is a popular UX research method.

Having participants speak their thoughts as they attempt tasks helps researchers understand possible sources of misunderstandings so they can identify and potentially fix problems.

The signature method of having users think aloud can trace its roots back to psychoanalysis and work from Freud (psychoanalysis), Wundt (introspection), and Skinner (behaviorism). In UX research, think aloud has moved from “fixing” people to fixing an interface.

In the history of psychology, psychoanalysis and introspection were developed and practiced at the turn of the 20th century. Behaviorism supplanted introspection in the early to middle of the 20th century, focusing only on external behaviors and avoiding speculation about internal mental processes. After that, cognitive psychology began to replace behaviorism as the dominant approach in experimental psychology, but even through the late 1970s, there was still a reluctance to treat participant verbalizations as data. That changed in 1980 with the publication of Ericsson and Simon’s influential paper, “Verbal Reports as Data.”

Ericsson and Simon provided evidence that certain kinds of verbal reports could produce reliable data. They identified three levels of verbalization:

Level 1: Verbalizations readily available in verbal form during task performance that does not require additional cognitive processing (e.g., reading written text aloud).

Level 2: Verbalizations readily available in nonverbal form during task performance that requires no or minimal additional cognitive processing (e.g., participants describing their actions while performing a task).

Level 3: Information not currently in the participant’s attention that requires substantial additional cognitive processing (e.g., participants asked to report all perceived traffic hazards while they are driving a car; participants asked to report feelings, motives, or explanations).

Level 3’s retrospective verbalizations can affect thought processes and behavior (famously demonstrated by Nisbett & Wilson, 1977), so Ericsson and Simon excluded them from their conception of TA. Interpreting Level 3 verbalizations as data can be risky because, when asked by a moderator, people may feel obliged to provide an answer. In practice, however, most TA sessions include all three levels of verbalizations and other interactions with the moderator (as opposed to unmoderated usability testing).

Although the TA method may be fruitful for identifying problems in an interface, does it cause unwanted side effects in what we measure? In our earlier eye-tracking study, we found that TA affected where and how people looked at website home pages (corroborating earlier findings). But does TA also interfere with other more traditional UX metrics being collected? Do TA participants take longer, rate tasks harder or easier, complete tasks faster or slower? Are the results consistent or inconsistent?

Five Published Studies on Thinking Aloud and Task Time

Despite the popularity of think-aloud studies, there’s surprisingly little published about the effects of this methodology on UX metrics relative to silent (non-TA) task performance. Our review of the literature identified five studies that all included an examination of task time differences along with some comparison of other UX metrics.

Bowers and Snyder (1990) compared standard and retrospective TA protocols when performing tasks differing in the minimum number of opened windows on one of two monitors that varied in size and resolution. The experiment used a between-subjects design for protocol and monitoring (12 participants in each condition) and within-subjects for the windowing tasks. The retrospective TA protocol took place in two stages: (1) task completion without TA, and (2) a review of task videos during which participants produced verbal reports about what they were doing and thinking. Completion times, number of steps, and ease ratings for standard TA and the non-TA stage of the retrospective protocol were not significantly different.

In a series of educational experiments, Berry and Broadbent (1990) had participants either silently or thinking aloud make repeated computerized queries to solve a problem as quickly as possible (e.g., from a list of 16 factories and their pollutants, determine which factory is responsible for polluting a river by “testing” for pollutants one at a time). They found evidence that the process of thinking aloud during problem-solving invoked cognitive processes that reduced the number of queries needed to solve the problem, but only if people were (1) given verbal instructions on how to perform the task, and (2) required to justify each action aloud. On the other hand, participants in the verbalization group needed more time to complete the task (the TA group was about 62% slower than the non-TA group). All participants eventually completed all tasks.

Wright and Converse (1992) compared silent with TA usability testing protocols with the specific goal of seeing whether Level 3 verbalizations affected usability testing metrics such as completion times. Their participants were assigned to either a TA or non-TA group, who then attempted four increasingly difficult tasks with the MS-DOS operating system (popular in the 1980s and early 90s), such as counting files in a directory to changing an executable program via a hex editor. The results indicated that the TA group committed fewer errors and completed tasks about 43% faster than the silent group, and the difference between the groups increased as a function of task difficulty. There were no significant differences between TA and non-TA for workload (NASA TLX) or ratings of ease.

In research comparing silent task completion with two TA protocols, Hertzum, Hansen, and Andersen (2009) found that TA task completion was significantly longer than non-TA. In a completely within-subjects design, eight participants completed four tasks using four websites (two Danish TV station websites and two online bookstores). The task set was made up of two searches for facts and two tasks that required some assessment. Hertzum et al. employed an elaborate counterbalancing scheme that provided experimental control over all the variables except the order in which participants experienced the TA protocols.

In the first session, participants worked on two assigned tasks, silently on one and thinking aloud on the other (counterbalanced order of presentation). The TA protocol was consistent with the Ericsson and Simon (classic) method (only Level 1 and 2 verbalizations). In the second session, participants worked on the two remaining tasks, silently on one and thinking aloud on the other with a relaxed TA protocol that allowed Level 3 verbalizations. The time spent on assessment tasks was longer than on fact tasks. Averaging across task types, classic TA took about 37% longer than non-TA, and relaxed TA took about 85% longer than non-TA. For both TA and non-TA, task completion was longer in the second than the first session, but because this order was not counterbalanced, this result cannot be attributed to the different protocols (e.g., relaxed TA was generally longer than classic, but the increase in time also occurred for the non-TA protocol, which was methodologically the same across sessions). NASA TLX ratings indicated a generally higher workload for classic TA relative to non-TA and a significantly higher workload for relaxed TA relative to non-TA. There were no significant differences in the correctness of task solutions.

Olmsted-Hawala et al. (2010) used a double-blind procedure to investigate the effect of different TA procedures on successful task completion, completion times, and satisfaction. The TA conditions were classic TA, relaxed TA, and coaching TA (moderators could freely probe participants), and there was a non-TA control. Outcomes were similar for non-TA, classic, and relaxed conditions. The coaching TA condition had higher successful task completions and satisfaction ratings.

The key aspects of each of these research papers are summarized in Table 1.

Research PaperFindingSample SizeExperimental DesignTasks
Bowers & Snyder (1990)No difference in steps, time, or ease for the two protocols (concurrent vs. retrospective TA) 48—probably college students (ad in Virginial Tech school newspaper)Between-subjects for TA conditions, moderatedLarge number of windowing tasks in 3 difficulty levels and 4 blocks
Berry & Broadbent (1990)Most efficient solution with TA (fewer questions asked) but about 62% slower time 24—Oxford University participants panel age 18–45Between-subjects for TA conditions, moderatedComputerized query task (e.g., river pollution problem)
Wright & Converse (1992)TA 43% faster with fewer errors; no effect on ease or workload ratings24—mix of university students, faculty, staff, and volunteersBetween-subjects for TA conditions, moderatedFour-disc management tasks of varying difficulty
Hertzum, Hansen, & Andersen (2009)TA generally slower (37% for classic TA with Level 1 and 2 verbalizations; 85% slower for relaxed TA with Level 3 verbalizations); higher workload ratings for TA conditions; no difference in task success8—daily computer usersWithin-subjects for TA conditions, moderatedFour information search tasks, two fact and two assessment tasks counterbalanced across four TV and bookstore websites
Olmsted-Hawala, Murphy, Hawala, & Ashenfelter (2010)No significant differences in completion time for non-TA (silent), classic TA, relaxed TA, or coaching TA; higher task completion and satisfaction for coaching80—adults from U.S. Census Bureau Usability Lab volunteer panelBetween-subjects for TA conditions (double-blind), moderated8 simple “find” tasks searching for data on the U.S. Census Bureau website

Table 1: Summaries of studies comparing completion times for TA and non-TA research protocols.

Other papers in the TA literature have compared different TA protocols (e.g., Krahmer & Ummelen, 2004; Hertzum & Holmegaard, 2013; McDonald, Mcgarry, & Willis, 2013; Alhadreti & Mayhew, 2017, 2018). We did not include those papers in this review because either they did not include a non-TA condition for comparison with TA or they had unique experimental manipulations affecting task times.

Summary and Discussion

The completion time outcomes of these studies were inconsistent. A research question common across all five studies was whether thinking aloud during task performance increased task completion time relative to a non-TA condition. There are three possible outcomes to a question like this: (1) slower completion time when thinking aloud, (2) faster completion time when thinking aloud, or (3) no difference (ignore for this discussion the distinction between statistical and practical significance). Three early attempts to answer this research question reported all three outcomes (no difference from Bowers & Snyder, 1990; slower task completion from Berry & Broadbent, 1990; and faster task completion from Wright & Converse, 1992). Two later papers did not resolve the question (slower task completion from Hertzum et al., 2009; no difference in Olmsted-Hawala et al., 2010).

The experimental design of Berry and Broadbent was very different from the other studies. Most TA research used tasks that focused on the UX of computer usage. The research of Berry and Broadbent did use a computer, but the focus was not on the UX of computer use; instead, it was on the role of instructions and thinking aloud in solving a problem efficiently (in fewer steps and less time). Indeed, they found that the combination of giving instructions on how to solve the puzzle in fewer steps plus verbalization led to problem solution in significantly fewer steps than verbalization alone, but working out that more efficient path required more time than a simpler brute force approach.

Verbalization levels were different or manipulated in the studies with computer-related tasks. Bowers and Snyder (1990) reported low-level verbalizations. Wright and Converse (1992) explicitly encouraged Level 3 verbalizations. Hertzum et al. (2009) and Olmsted-Hawala et al. (2010) manipulated verbalization with conditions that matched Ericsson and Simon’s (1980) protocols and were more relaxed to obtain Level 3 verbalizations. The surprise in Wright and Converse’s study was that despite the general expectation that Level 3 verbalizations would slow down task performance, their results indicated that Level 3 verbalizations reduced the time needed to complete  operating system tasks, especially when the tasks were relatively difficult. In contrast, Hertzum et al. reported slower task completion for TA and Olmsted-Hawala et al. reported no significant differences.

Generally, there were no differences in subjective ratings, but not always. Both Bowers and Snyder (1990) and Wright and Converse (1992) reported no differences in ease ratings, as did Olmsted-Hawala et al. (2010) for satisfaction ratings. For NASA TLX workload ratings, Wright and Converse reported no difference, but Hertzum et al. (2009) reported TA participants reported higher workloads than non-TA.

Most studies did not include information about successful task completion rates. Hertzum et al. (2009) reported no differences among conditions in successful task completions for their fact tasks (there were no criteria for success in the assessment tasks). Completion rates were higher in Olmsted-Hawala et al. (2010) for the coaching condition, but there were no significant differences among the other conditions (non-TA, classic TA, and relaxed TA).

Clearly, more data is needed. These outcomes illustrate the danger of relying on the results of any one study, no matter how carefully designed and executed, to draw conclusions about this type of research question. Based on these studies, it’s impossible to guide the expectation of UX practitioners and researchers regarding whether thinking aloud does or does not increase task completion times relative to a non-TA protocol. Also, the current literature does not provide any insight into how TA and non-TA protocols would affect completion times in unmoderated usability studies that by definition have no moderators interacting with participants to elicit verbalizations (a research gap that we plan to address in future articles).

    Your Cart
    Your cart is emptyReturn to Shop
    Scroll to Top