Why isn’t usability testing done more? And when it is done why is the sample size small? One major reason is the cost. It takes a lot of money and time to bring users into a lab and conduct a usability test. Even if users don’t get compensated for their time, it still takes a lot of time for a test facilitator to prepare for and attend each test. Each additional user tested takes more time—one reason sample sizes tend to be small. Is it possible to get the same data by having users test themselves?
The ability to quickly collect data from users by using the Internet to test websites or web-applications is generally a good thing. It would eliminate much of the time needed to attend each session. The research is still coming out on how usability data from both formative and summative testing differ between unattended and lab-based testing. See for example “Let your users do the testing“. I recently had the opportunity to examine the differences in data between a traditional attended lab-based test and a remote unmoderated test.
CUE-8
Along with Rolf Molich, Jurek Kirakowski and Tom Tullis, I helped organize and participate in the Comparative Usability Evaluation (CUE-8) workshop on quantitative usability data at the UPA 2009 conference. Fifteen international teams with varying levels of quantitative testing experience all tested the same five tasks on the budget.com rental car website.
About half the teams used lab-based testing and the other half used remote unmoderated testing. Many variables affected the outcome of the data. For example some teams had users think aloud and “probed” user actions and some international teams had issues with translation. The level of experience of the team also played a primary role (some teams had never conducted a quantitative usability test).
For that reason, it was difficult to make judgments about the similarity of unmoderated vs. moderated testing (since differences in results might be entirely due to differences in testing approaches or the aforementioned language issues). Based on the discussions, I assembled a report which provides answers to the most common quantitative usability questions.
In examining the data from the unmoderated teams, the first major problem is revealed—there is a lot of bad data. For example, some task times were impossibly short (1-5 seconds) or way too long (taking hours). Since the users in an unmoderated test aren’t directly observed, you have to decide which times to “throw-out,” and which times are legit. This inserts a fair amount of subjectivity.
Although you can collect task times automatically, you have to add questions at the end of each task to determine if a user legitimately completed the task. For example, you could ask what the location of the nearest rental car facility is to an address (something that has an objective right answer). Without these questions and a fair amount of diligence, you can easily get bad data (as many of the teams who had no experience conducting unmoderated tests did).
I compared the completion rates, task times and SUS scores from my lab-based test of 12 users to another team who tested over 300 users using an unmoderated approach. While I cannot reveal the team members on this team, I can say that they have a lot of experience conducting both lab and remote tests. In essence I picked this team to minimize the unwanted effects of experience.
How do the data compare?
The figures below show comparisons of my metrics (Team G) and theirs (Team L). The large overlap in the confidence intervals by task show a surprising amount of agreement given the different testing methods. It is even more surprising considering I tested only 4% of the number of users as this team.
Table 1 below shows the completion rates and Figure 1 graphs the 95% Adjusted-Wald Confidence intervals for the fives tasks by team.
Figure 1 : Comparison of 95% Confidence Intervals for Completion Rates for Team G and Team L. The large overlap in the confidence intervals shows similar results for completion rates despite using an unattended vs. lab based testing approach.
Tasks 1 and 5 differed the most (16% and 14%) respectively while the other completion rates were within 5% of each other. Only task 1 was statistically different p 2-proportion calculator available in the StatsPakage). On average the completion rates were within 8% of each other.
 Task |  Team G |  Team L | Difference |
 1* |  83.3% |  96.8% |  16% |
 2 |  91.7% |  92.7% |  1% |
 3 |  100% |  97.7% |  2% |
 4 |  83.3% |  79.9% |  5% |
 5 |  100% |  86.3% |  14% |
Table 1: Completion Rates for Teams G and L. *Significantly different at p < .05.
Table 2 shows the average task times and Figure 2 shows the 95% confidence intervals using the log-transformed times by task for the two teams. Again there is a lot of overlap in the confidence intervals showing we arrived at similar times, although there is noticeably less agreement.
Figure 2: Comparison of 95% Confidence Intervals for Task Times for Team G and Team L. The large overlap in the confidence intervals shows similar results despite using a remote vs. lab based testing approach.
While task 1 differed by 6% we differed by 51% on task 3 with the average difference across tasks of 31%. Tasks 3 and 4 were statistically different (p <.05). (Use the 2-sample t calculator available in the StatsPakage).
 Task |  Team G |  Team L | Difference |
 1 | 157 |  148 |  6% |
 2 |  114 | 150 |  32% |
 3* |  53 |  80 |  51% |
 4* |  122 |  158 |  30% |
 5 |  85 |  115 |  35% |
Table 2: Average Task Times (in seconds) from log-transformed data for Teams G and L. *Significantly different at p < .05.
Table 3 below shows the average SUS score and Figure 3 shows the 95% confidence intervals for the System Usability Scale (SUS) scores.
Figure 3: Comparison of 95% Confidence Intervals for S for Team G and Team L. The large overlap in the confidence intervals shows similar results for completion rates despite using a remote vs. lab based testing approach.
 Task | Team G |  Team L | Difference |
 SUS |  79.6 |  78 |  2% |
Table 3: Average SUS Score by Team.
The SUS scores administered after the tests were within 2% of each other, a surprisingly small and non-significant difference. This is also encouraging for SUS—after all it is called a quick and dirty usability scale. While SUS wasn’t designed for websites (they didn’t exist in 1986) two teams working independently with different users and testing approaches came to virtually the same SUS score. Maybe it’s a quick and not-so-dirty scale.
How much less time does it take?
As part of the CUE-8 workshop each team recorded how many hours they spent conducting the test. Here’s where the real payoff happens. I spent 40 hours testing, analyzing and reporting for my test while Team L spent 21 hours.
It’s not just that Team L took half the time; it’s that in half the time they got data from 300 more users than I did. With that many users they had really tight confidence intervals (more accurate estimates). What’s more, Team L was 55 times more efficient since it took me 3.33 hours to get data from one user and it only took them a mere 3 ½ minutes.
 Team |  Total Hours |  Users | Hours/User |
 G |  40 |  12 |  3.33 |
 L |  21 |  314 |  .06 |
Table 4: Hours spent by Team on testing.
Conclusion
The agreement with SUS data was very encouraging and to a lesser extent the completion rate data showed good agreement. While there was overlap in the task-time confidence intervals, an average 30% difference is more than I’m comfortable with and it exposes the weak point with unmoderated testing. You really have to set some criteria to scrub task times and remove unrealistically fast or slow times. This is a topic which is covered in a forthcoming book on unmoderated testing. Even with this additional burden it takes less effort to get a lot more data from users than a traditional moderated test. To gain the real advantage of the method, you’d want to test dozens or hundreds of users in an unattended test.
So, is there a difference in data? There was to some extent but the method appears to show promise from this limited dataset. It shows promise considering none of the teams discussed their methods and data until after they were presented and each team recruited different users (relying mostly on volunteers).
I did cherry pick one team out of 6 that used an unmoderated approach making this evidence far from conclusive. I’d be interested to see more research on comparing unattended with attended testing—especially with task times. In the interim, unmoderated testing appears to provide a cost effective alternative for gathering usability data.