Does Thinking Aloud Affect Task Metrics?

Jeff Sauro, PhD • Jim Lewis, PhD

One of the most popular UX research methods is Think Aloud (TA) usability testing.

Having participants speak their thoughts while working on tasks helps researchers identify usability problems and potential fixes. But does the added burden of speaking while attempting a task make the experience harder and affect perceptions of the website or app being evaluated? Or does articulating thoughts have a negligible or even positive effect on perception?

In an earlier article, we investigated the effect of thinking aloud on task times in remote unmoderated usability testing. It might seem obvious that thinking aloud would increase task completion time, but findings in five earlier studies on moderated TA have been inconsistent, and some researchers concluded that the additional cognitive effort involved in TA could even make task completion faster. However, the results of our research indicate that, on average, it took participants about 20% longer to complete a task when thinking aloud in an unmoderated study compared to participants who attempted tasks silently.

In our review of the five published studies that compared silent usability testing with moderated TA, we found mixed results for the impact of TA on other metrics (not just task completion time). Generally, these studies found no differences in subjective ratings, but not always. While there were no differences reported in ease ratings (two studies) or satisfaction ratings (one study), there was one study that reported higher NASA TLX workload ratings for participants who were thinking aloud.

Most studies did not include information about successful task completion rates. The two that did reported no differences between TA and non-TA conditions.

In this article, we’ll review the same dataset of ten studies used in the task-time analysis to examine the possible impacts of TA on other task-level UX metrics collected during the studies: ease, confidence, and completion.

Analysis of TA on Task Metrics from Ten Studies

Between January and September 2022, we conducted ten studies with participants from an online U.S.-based panel using the MUIQ® platform. In each study, participants were randomly assigned to a TA or non-TA condition. In both conditions, participants attempted one task on a travel, real estate, or restaurant website (e.g., United.com, Zillow.com).

We have covered details about several of these studies (including descriptions of several of the tasks) in our earlier article on TA-related drop-out rates. Each of these ten studies had between 40 and 60 unique participants. During our screening and recruiting process, we ensured that participants couldn’t take a study more than once. Furthermore, we ensured that participants in the TA condition were actually thinking aloud by reviewing the videos. Additional data preparation details are available in the article on TA task time analysis.

After completing the data quality steps across the ten studies, we had a final sample size of 423 participants; 221 in the TA condition and 202 in the non-TA condition. This is by far the largest dataset collected to date that is suitable for comparing TA with non-TA for remote unmoderated studies.

Task Completion Rates

Across the ten studies, task-success criteria were predetermined (e.g., the correct name of a hotel or the price of a flight). The success criteria bars were set relatively low for these studies, as task success wasn’t the main focus of analysis. Consequently, the success rate for most tasks was on the higher side (averaging 80%, ranging between 55% and 100%). Table 1 shows that in seven of the ten studies, participants in the TA condition had a lower task completion rate than those in the non-TA condition. This was statistically significant in only one study (United 3, p = .07). Sample sizes varied between 12 to 24 per condition, so only large differences could be identified as statistically significant. Aggregated across studies, the average completion rates were similar, at 78% for those in the TA condition versus 82% for those in the non-TA condition.

Study
Task Comp. TA
Task Comp Non-TA
Diff
United 1
86%
100%
 14%
United 2
73%
70%
 −3%
United 3*
62%
86%
 24%
Tripadvisor 1
83%
89%
  6%
Tripadvisor 2
85%
63%
−23%
Tripadvisor 3
88%
90%
  2%
Kayak
95%
100%
  5%
Zillow
59%
55%
 −4%
OpenTable
78%
89%
 11%
Hilton
73%
76%
  3%

Table 1: Completion rates in the TA and non-TA conditions. Studies with a * indicate statistical significance at p < .10. Sample sizes varied between 12 to 24 per condition.

Post Task Ease (SEQ)

The Single Ease Question (SEQ®) is a single seven-point item that asks participants to rate how difficult or easy they found the task. It has a historical average score of 5.5 and is highly correlated with the NASA TLX (a measure of mental effort).

Table 2 shows that in six of the ten studies, participants in the TA condition rated the task as more difficult (lower SEQ means). Two of these were statistically significant (p < .10). Of the four studies in which ease was higher for TA, two (United 1 and OpenTable) had statistically significantly higher SEQ scores.

An advantage of using the SEQ is that we have enough historical data to know when a task is relatively hard or easy. We can see some tasks were harder than average (e.g., United 2 TA, Tripadvisor 3 TA, Zillow TA) and some were easier than average (e.g., all Tripadvisor non-TA, OpenTable TA, Hilton TA).

Study
Average SEQ TA
Average SEQ Non-TA
Diff% Harder for TA
United 1*
5.5
4.8
−0.7
−16%
United 2
3.9
4.6
 0.7
 16%
United 3
4.8
5.3
 0.5
 10%
Tripadvisor 1
5.3
6.0
 0.7
 12%
Tripadvisor 2*
5.2
6.2
 1.0
 17%
Tripadvisor 3*
4.8
6.1
 1.3
 21%
Kayak
5.6
6.0
 0.4
  6%
Zillow
4.4
4.3
 0.05
 −1%
OpenTable*
6.5
6.1
−0.4
 −6%
Hilton
6.5
6.0
−0.5
 −7%

Table 2: SEQ means in the TA and non-TA conditions. Studies with a * indicate statistical significance at p < .10. Sample sizes varied between 12 to 24 per condition.

We also conducted an ANOVA on the combined dataset using method (TA vs. non-TA) and study as factors. The results showed statistically significant main effects for both method and study (Method: F(9, 403) = 4.4, p = .037; Study: F(1, 403) = 8.7, p < .0001) and a statistically significant interaction (F(9, 403) = 8.69, p = .022). Overall, the TA condition had 6% lower SEQ scores than non-TA (5.5 versus 5.2).

Confidence

In addition to the SEQ, participants were asked how confident they were that they completed the task successfully using a single seven-point item (1 = not at all confident; 7 = extremely confident).

Study
Average Conf. TA
Average Conf. Non-TA
Diff% Less Confident for TA
United 1*
6.1
5.2
−0.9
−18%
United 2
4.8
5.0
 0.2
  3%
United 3
5.3
5.9
 0.6
  9%
Tripadvisor 1
6.2
6.6
 0.4
  7%
Tripadvisor 2*
6.3
6.5
 0.2
  4%
Tripadvisor 3*
5.8
6.7
 0.9
 13%
Kayak
6.1
6.7
 0.6
  8%
Zillow
5.1
5.7
 0.6
  9%
OpenTable*
6.5
6.7
 0.2
  3%
Hilton
6.5
6.7
 0.2
  3%

Table 3: Mean confidence scores in the TA and non-TA conditions. Studies with a * indicate statistical significance at p < .10. Sample sizes varied between 12 to 24 per condition.

Table 3 shows that in nine of the ten studies, participants provided lower confidence ratings in the TA condition, of which only one was statistically significant (p < .10 for Tripadvisor 3). The one study in which TA participants provided higher confidence ratings was statistically significant (p < .10 for United 1).

We also conducted an ANOVA on the combined dataset using method (TA vs. non-TA) and study as factors. The results showed statistically significant main effects for both method and study (Method: F(9, 403) = 4.83, p = .029; Study: F(1, 403) = 9.08, p < .0001), but no statistically significant interaction (F(9, 403) = 1.55, p = .130). Overall, the TA condition had 4.4% lower confidence ratings than non-TA (6.1 versus 5.9).

Additional Analyses

Table 4 summarizes the overall differences across the three task-level metrics we analyzed. It shows a modest downward effect for TA of about 5%, which was statistically significant for SEQ and confidence.

MetricTAnon-TA% Lower TA
Task Comp 78%82%5%
SEQ5.25.56%
Confidence5.96.14%
Average %5%

Table 4: Aggregated differences across task completions, SEQ, and confidence ratings.

While there was a modest overall effect, Tables 1–3 (and results of the SEQ ANOVA) show a more complicated interaction effect between website and conditions. Interaction effects can be hard to interpret. Why were the results different when we changed websites or tasks? We attempt to address that question next.

SEQ Interaction Effects

Although the data in these studies were independently collected from different participants across studies and conditions, we conducted the studies sequentially in 2022. We reviewed the results of one study before launching the next and often changed the task or website based on the findings, sometimes to replicate findings and other times to investigate how changing the experimental designs affected the results.

For example, in the original study on United Airlines (United 1), study participants rated the TA task as easier than the non-TA condition. So, a plausible explanation was that thinking aloud makes task completion “feel” easier. When we attempted to replicate the results a few months later, however, we found the opposite: United 2 and United 3 studies had lower SEQ scores in the TA than the non-TA conditions, despite sourcing participants from the same panel and giving them the same task as in the United 1 study.

The United 1 study was conducted between May 27th and June 3rd, 2022. The United 2 and United 3 studies were conducted later (on August 30th and between Sept 15th and Sept 16th, 2022, respectively). It could be that fluctuating factors on an airline website (fare prices, seat availability, and potentially even multiple people searching for the same flight) may have unexpectedly affected the difference in ratings between TA and non-TA conditions. It’s hard to tell. This issue sometimes happens when conducting UX research with real-world websites.

We also wondered whether task complexity (more cognitive demand) would interfere with a participant’s ability to think aloud, explaining the interaction effect. We hypothesized that perhaps with more complex tasks, thinking aloud would make the task experience harder, but with simpler tasks, it could make no difference or even make the task seem easier. In the Tripadvisor studies, we first used what we felt was a complex task similar to the United task (see Table 5 for the task description).

Across two Tripadvisor studies (Tripadvisor 1 and Tripadvisor 2), we found similar findings—the TA condition was rated as harder. We then simplified the task by removing some parameters (see the task for Tripadvisor 3 in Table 5). However, in the Tripadvisor 3 study, the results showed an even lower SEQ score for the TA condition and about the same relatively high score for the non-TA condition. We just weren’t sure why. We didn’t know whether there was something unusual with Tripadvisor, so to check, we studied another travel site (Kayak), using the same less complex task as we did for the Tripadvisor 3 study. The Kayak results were directionally consistent with the Tripadvisor 3 results—the mean SEQ was lower for TA than for non-TA—but the magnitude of the difference was smaller (6% vs. 21%) and not statistically significant.

StudyTask Complexity
Task Description
United (all)More complexYou want to fly from Denver, CO to Los Angeles, CA with your significant other. Arrive in L.A. on September 8th and leave on September 12th [arrive on November 12th and leave on November 19th for study 3]. Select the earliest, nonstop flight on both of your travel days. Select the fare that allows you to choose your seat ahead of time. No checked bags. Choose seats next to your partner.
Tripadvisor 1 & 2More complexLook for a hotel in Denver, CO from September 19th to the 22nd for you and your spouse that cost less than $200 per night. Make sure the hotel has a fitness center, conference rooms, free parking, and has more than 500 reviews. Of the hotels that meet these criteria, look through the hotel reviews and photos to determine which best suits your preferences.
Tripadvisor 3Less complexLook for a hotel in Denver, CO from September 19th to the 22nd for you and your spouse that cost less than $200 per night and has at least a 4- out of 5-star review.
KayakLess complexLook for a hotel in Denver, CO from September 19th to the 22nd for you and your spouse that cost less than $200 per night and has at least a 4- out of 5-star review. Stop the task after you have identified a hotel that meets these criteria.

Table 5: Descriptions of key tasks with different levels of complexity.

Summary and Discussion

Our analysis of ten remote unmoderated usability studies with 423 participants (roughly half in a TA condition and half in a non-TA condition) revealed the following:

Thinking aloud tends to modestly depress post-task attitudinal metrics. Both post-task ease (the SEQ) and post-task confidence ratings were statistically significantly lower when aggregated across the studies. The differences, however, were not perfectly consistent (some studies showed opposite effects), and the effect was smaller (5% lower) than with task time (which was ~20% longer for TA).

Thinking aloud had a negligible effect on task completion rates. We found that participants in the TA condition had a slightly lower task completion rate (78%) compared to those in the non-TA condition (82%). A difference this small with a binary variable such as completion rate would require a sample size over a thousand to indicate statistical significance. That said, in the United 3 study, the completion rate for non-TA was significantly higher than for the TA condition (a difference of 24%, p < .10). That was balanced, however, by a 23% difference in the opposite direction in the Tripadvisor 2 study.

Longer times impact ratings. We have generally found medium to large correlations between post-task metrics of completion, time, and ease. In general, people who take longer to complete a task tend to rate tasks as more difficult (especially if they fail). As such it’s not too surprising to see the lower TA post-task ratings, but it’s unclear whether the lower ratings are caused by additional cognitive demands of attempting a task while thinking aloud or just from the tasks taking longer (or both).

Study context matters. We did not see a consistent effect across all studies. While most studies showed that ease, confidence, and task completion were lower for TA, in a few studies, TA had higher ratings.

There was a statistically significant interaction effect for post-task ratings of ease. This suggests additional variables are at play, and the effect of thinking aloud may be stronger or weaker depending on the website and task. Use caution when generalizing and applying findings from just one dataset.

Study Limitations

We repeated studies of tasks and websites. Although we had ten separate studies and unique participants in each study, several of the studies used the same website and the same or a similar task. We did this intentionally to see whether the results we obtained with one sample were repeatable and to get a start on understanding what variables might drive the interactions we found. However, this dependency between tasks and websites may make it harder to disentangle effects from thinking aloud.

Are tasks harder, or are the people different? From this analysis alone, we cannot definitively conclude either that the act of thinking aloud makes tasks seem harder or that people who complete TA studies tend to be different from those who don’t participate. For example, it could be that participants who agree to complete a TA study tend to be more deliberate, tech critical, tech savvy, or have other characteristics that make them different from participants who drop out or don’t participate in these studies. In our earlier analysis, we did see a strong and significant difference in dropout rates when we asked participants to think aloud. We’ll examine possible individual differences in future analyses.

There’s more to investigate on task-type interaction. Given our early findings, we suspected that the complexity (and, therefore, cognitive load) of the task may have impacted the post-task ratings. However, we found inconsistent results when we attempted to modulate the task complexity (e.g., using several criteria to filter on a flight or hotel). Future research is needed.

We conducted only remote unmoderated studies. We did this to fill a gap in the research literature, but we cannot necessarily generalize these findings to moderated studies, where there can be complex verbal interactions between moderators and participants.

Future Directions

We’re planning additional analyses of these data and future studies. In the future, we will look at the impact TA had on post-study metrics (e.g., SUPR-Q® and NPS), look for systematic differences in people who agree to participate in TA studies, and see whether TA uncovers more problems than non-TA.

0
    0
    Your Cart
    Your cart is emptyReturn to Shop
    Scroll to Top