One of the most popular UX research methods is Think Aloud (TA) usability testing.
Having participants speak their thoughts while working on tasks helps researchers identify usability problems and potential fixes.
While the method has been around for decades with the use of an attending moderator, more recent technological advancements in the past 15 years have allowed participants to remotely think aloud without a moderator by using software (for example, our MUIQ® platform). This likely has resulted in a significant increase in the number of participants speaking their thoughts, but there is little known about unmoderated verbalization behaviors (e.g., how many utterances would be categorized as the Level 1, 2, or 3 categories identified by Ericsson & Simon, 1980, which differ in the extent the utterances require additional cognitive processing).
Despite the proliferation of TA, there’s been little investigation of its impact on UX metrics such as time (and, as far as we know, no investigation of its impact in remote unmoderated usability testing). In an earlier article, we reviewed five published studies that compared silent usability testing with moderated TA, finding mixed results. The sample sizes were small, but the results generally indicated that moderated TA task completion time was not different from silent task completion time. A clear takeaway from the literature review is that more data is needed.
In this article, we report findings from 10 experiments we conducted to better understand the effect of TA on task time in unmoderated remote usability studies.
Ten New Studies on Think Aloud Impacts on Task Time
Between January and September 2022, we conducted ten studies with participants from an online U.S.-based panel using the MUIQ platform. In each study, participants were randomly assigned to a TA or non-TA condition. In both conditions, participants were asked to attempt one task on a website (e.g., United.com, Zillow.com). The websites were in the travel, real estate, and restaurant industries.
We have covered details about several of these studies (including descriptions of several of the tasks) in our earlier article on TA-related drop-out rates.
Each of these ten studies had between 40 and 60 unique participants. During our screening and recruiting process, we ensured that participants couldn’t take a study more than once. Task time was automatically collected in MUIQ, starting when participants clicked MUIQ’s Start Task button and ending when they clicked the End Task button. As these studies were unmoderated, no attending moderator was available to answer questions, probe, or provide any cues for the participant. Participants in the TA condition were shown a short video demonstrating how to think aloud before beginning the task.
Successful task completion was determined by using post-task validation questions (e.g., “How much did the hotel cost per night?”). Post-task ease (SEQ®), post-task confidence, SUPR-Q®, and NPS were also collected in all studies (we’ll cover the results in future articles).
We completed three steps to prepare the raw data for the task time analysis:
- Retained only participants with videos and audio. For either condition, we excluded participants who did not share their screen; for the TA condition, we also excluded those who did not share their audio.
- Reviewed videos. Of the participants who completed the task and shared their screen (and audio for TA), we reviewed the videos to ensure participants were making a concerted effort to complete their assigned task. Few were excluded in this step.
- Confirmed TA. For participants in the TA condition, we retained the participants who actually spoke (at least a minimal amount) as they attempted the task.
After completing these three data quality steps across the ten studies, we had a final sample size of 423 participants; 221 in the TA condition and 202 in the non-TA condition. This is by far the largest dataset collected to date that is suitable for comparing TA with non-TA for remote unmoderated studies.
Analysis of All Task Completion Times
We applied a log transformation to the raw times to account for the expected positive skew in task time data and performed statistical calculations on the log times (i.e., analysis of geometric means), analyzing all times regardless of task success. The geometric means for both conditions are shown in Table 1.
|Study||Average Time TA||Average Time Non-TA||TA Longer||% Longer for TA|
Table 1 shows that in nine of the ten studies, the TA condition had a longer average task time, of which three were statistically significantly longer (p < .10). Averaged across the studies, completion times were about 17% longer in the TA condition. The largest difference (TA 39% longer) was in the Tripadvisor 2 study.
We also conducted an ANOVA on the combined dataset using method (TA vs. non-TA) and study as factors. The results showed statistically significant main effects for both method and study (Method: F(9, 403) = 8.6, p = .004; Study: F(1, 403) = 20.2, p < .0001), but no statistically significant interaction (F(9, 403) = 0.8, p = .623). The TA condition took 17% longer than non-TA (312 versus 267 seconds).
Time for Successful Task Completion Only
As is common with an analysis of task time in UX research, we also examined time differences for the subset of participants who successfully completed the task. For all ten studies, task-success criteria were predetermined (e.g., the correct name of a hotel, the price of a flight). The bar for success was set relatively low for these studies, as it wasn’t the main focus of our analysis. Consequently, the success rate for most tasks was on the higher side (averaging 80%, ranging between 50% and 100%).
Using only successful times reduced the number of times available for analysis by about 25% (from 423 to 317). Table 2 shows the average times for TA and non-TA for the successful task attempts. The pattern is similar to the times for all attempts.
|Study||Average Time TA||Average Time Non-TA||TA Longer||% Longer for TA|
We also conducted an ANOVA on completion times for successful attempts with method (TA vs. non-TA) and study as factors. The results showed statistically significant main effects for both method and study (Method: F(9, 297) = 8.9, p = .003; Study: F(1, 297) = 18.4, p < .0001), but no statistically significant interaction (F(9, 297) = .56, p = .832). The TA condition took 20% longer than non-TA (337 versus 279 seconds).
Variability of Times for Successfully Completed Tasks
Next, we determined whether completion times were more variable when thinking aloud. To do so we compared the variances of raw (not log-transformed) times between the conditions for the ten studies using an F-test. Consistent with having slightly higher mean completion times, the TA conditions generally had more variable completion times (standard deviations about 27% higher than the non-TA condition). At the individual study level, there was only one statistically significant difference (Tripadvisor 3; see Table 3 and the dot plot in Figure 1), but an overall sign test of the eight studies given higher variability for the TA method was statistically significant (mid-p = .03).
|Study||TA # Completed||TA SD||Non-TA # Completed||Non-TA SD||Avg Diff|
Summary and Discussion
Here are the key findings, study limitations, and future directions for an analysis of ten remote unmoderated usability studies with 423 participants (roughly half in a TA condition and half in a non-TA condition).
Thinking aloud seems to increase total task time. Using total task duration (regardless of task success), we found in nine of the studies that participants took longer to complete tasks when thinking aloud. In only three studies the differences were statistically significant, but ANOVA indicated a strong and significant main effect of method. On average, the difference was around 16% longer for participants in the TA condition.
Thinking aloud also seems to increase successful task completion time. We found the same pattern when looking only at successful task attempts (e.g., significant main effect of method). After removing the roughly 25% of attempts that didn’t meet the success criteria, participants in the TA condition took about 20% longer to complete the task.
TA makes times more variable (but not a lot). Longer mean times are usually associated with larger standard deviations, and that’s the case here. The TA condition had more variability in the times (as measured by the standard deviation). The higher variability is a result of a few participants taking a lot longer (which pulls up the mean and increases the standard deviation). At the individual study level, only one of the ten differences in variances was statistically significant, but a sign test of the overall trend (eight of ten had higher variance for TA) was statistically significant.
One dataset may not reveal patterns. The benefit of using a large sample size across several studies is that we were able to see that the typical pattern of TA taking longer didn’t always happen, and in most cases, the difference for individual studies wasn’t statistically significant. This illustrates the importance of using caution when generalizing and applying findings from just one dataset.
We repeated studies of tasks and websites. Although we had ten separate studies and unique participants in each study, several of the studies used the same website and the same or a similar task. We did this intentionally to see whether the results we obtained with one sample were repeatable. The results show that for task time, this effect likely happens most but not all the time, and additional mitigating effects need to be better understood.
Is TA the “root for fruit” of longer times? From this analysis alone, we cannot definitively conclude that the act of thinking aloud increases task times or that people who complete TA studies tend to be different than those who don’t participate. For example, it could be that participants who agree to complete a TA study tend to be more deliberate, tech-savvy, or have another characteristic that makes them different from participants who drop out or don’t participate in these studies. In our earlier analysis, we did see a strong and significant difference in dropout rates when participants were asked to think aloud. We’ll examine possible differences in people in future analyses.
Task-type interaction was not examined. This analysis didn’t examine the possible effect of the type of task on the total time. We wondered whether some tasks that require more cognitive demand (e.g., using several criteria to filter on a flight or hotel) would impact metrics differently than those with less cognitive demand (e.g., finding any hotel in Denver for a specified date and price).
We only conducted remote unmoderated studies. We did this to fill an obvious gap in the research literature, but we cannot necessarily generalize these findings to moderated studies, where there can be complex verbal interactions between moderators and participants.
We’re planning many additional analyses of these data. In the future, we will look at task success, examine the reasons for longer times, and examine the effects of TA on other metrics.