Moderators have to bring users to a dedicated location, test each one (usually one at a time), and only then get results from only a handful of users.
Here are 10 things to know about this essential usability testing method that’s reducing the cost and improving the frequency of usability testing.
- It’s growing: According to the latest User Experience Professionals Associate Survey in 2011, around 23% of respondents report using unmoderated testing (compared to 52% using lab-based testing). This has shown growth of 28% since 2009 when 18% of respondents used it. The method wasn’t even listed as an option in 2007!
- Recruiting is a lot easier: Jakob Nielsen calls it “unglamorous” and Steve Krug says he’s not very fond of it in “Rocket Surgery Made Easy.” Finding qualified participants is hard but necessary. Fortunately, for unmoderated tests, it’s a bit easier to find both more users and more specialized users through a variety of approaches.You can use panel companies like OP4G and Toluna, which are able to recruit and send users to your study or you can pull users right off of a website. When we use intercepts to recruit off of websites we typically see a much higher attrition rate than when we use panel companies. In general, we like to include both types in our unmoderated studies as they help provide a good mix of data from current users to prospective users.
- Survey + Usability Study: We often start a project and have vague business questions to work with: Do customers understand our unique selling points? What do we change in our checkout form? Is our new homepage design better? We operationalize these questions into testable hypotheses, and use a mix of tasks and traditional survey questions. This allows us to examine both attitudes and action. Sometimes, it’s the percentage of users who click on a navigation element. Other times, it’s the answer to a question about how few users understood a concept that becomes most insightful.
- Metrics Fiesta: In unmoderated studies it’s easy and usually fairly automatic to collect User Experience metrics like completion rates, task time, task-difficulty, overall perceptions of usability, the Net Promoter Score, and task-level confidence. Products like MUIQ will even collect every click and click path to generate compelling heat maps and paths to help understand where users are going.
- With video it’s almost like the lab: In almost every unmoderated study we have a subset of users who we video thinking aloud while completing the tasks and going through the same study as the larger sample of users. We use Usertesting.com, which has a large panel of participants from the U.S., Canada and the UK. We are also able to recruit on more specific criteria like having a department store credit card, having researched or purchased a laptop online, or having purchased an item at Target.com in the last six months. We’ll see a low task completion rate and wonder what’s causing it. It usually takes just a couple videos and we can see why users are struggling. Sometimes it’s the complexity of the task, terminology problems, navigation issues or even a poorly placed pop-up.
- Setup takes about half as long as lab studies: Setting up the study, and carefully designing and pre-testing tasks and questions takes about as long as moderated testing. For example, in the Comparative Usability Evaluation 9, the average time spent on unmoderated sessions was 37 hours versus 60 hours for moderated testing (a little more than half the time). There’s sort of a fixed cost associated with any usability test—metrics, tasks, user profiles and research questions. The real benefit comes from the time invested per user.
- It’s more efficient than lab-based studies: The logistics involved in having people come to a physical lab in one or even a few locations isn’t trivial. It usually takes weeks to recruit and facilitate, and it takes at least one person’s full time during the study (hard to multitask in a lab!). While the average time spent on unmoderated sessions was 37 hours versus 60 hours for moderated studies, for teams that tested more users, the payoff was significant. For example, in the Comparative Usability Evaluation 9, Teams G and L both had similar tasks, data collection and methods. Team L tested 12 users in a lab and Team G tested 314 unmoderated users. The study found that it took over three hours per participant for the lab-based study but only 3.5 minutes for the unmoderated study. In just over half of the overall testing time, the unmoderated test collected similar data on 26 times more users (see the table below)!
Users Hours/User G (Lab Based) 40 12 3.33 L (Unmoderated) 21 314 .06
Table 1: Hours spent by teams G and L on testing and (type of testing) from CUE-9.
- Mostly Comparable to Lab Data: While there isn’t a lot of data comparing the results that comes from the different methods, we found that measures of overall ease (using the System Usability Scale), task completion, and task-level difficulty were reasonably similar. Task time, however, was found to differ by a substantial 30%. This begs the question: Which task time is the “correct” time? The one in the artificial lab environment with people watching behind a one-way-mirror more accurate, or the one where users are on their own computer and might get “interrupted” by Facebook, Twitter or the toilet? In short, they both are probably wrong but, when making comparisons in task time, sticking with the same method ensures a fair comparison. Synchronous (and face-to-face) interactions of course do allow you to follow up and engage in a dialogue with users so unmoderated testing will never be a full replacement for moderated testing.
- You need a way to verify task completion: In a typical moderated usability study, the facilitator can determine whether a user has successfully completed the task. Because no one is watching the user in an unmoderated study, you need some way to determine success. This is done by using a validating question or a validating URL.
- Validation by Question: If users are asked to look for a specific product, you can ask for the price, model number or some other piece of information that can only be found if the task was completed successfully. For example, if the task is to look for the fair market value of a 2010 Honda Accord in a specific zip code, you can provide a few plausible values at the end of the task and have users select the correct one. We always include an “other” option because, despite detailed planning, there always seems to be exceptions or product variations we never counted on. The other responses allow us to go back and give credit.
- Validation by URL: If there is a specific page on a website that a user can only get to if they’ve located the correct item or piece of information, you can use the software to check the final URL(s). For example, in a recent test of findability, we knew there were only three pages that contained the correct piece of information so we were able to verify whether users completed the task.
- Statistical Precision: It is a common misconception that you need a large sample size to use statistics. However, with smaller sample sizes, you are limited to seeing only large differences between designs or generate confidence intervals that are rather wide. With a larger sample size you are able to detect smaller differences in designs. This can be especially important when you’re designing a new homepage or improving your navigation, and differences of 5%-15% translate into a meaningful difference.Because it’s easier to recruit and faster to test more users with unmoderated testing, you are able to detect smaller differences and have more precise metrics. For example, the table below shows the typical margin of error around your metrics for a sample size of 20 will be approximately 18%, compared to approximately 6% for a sample size of 200. To compare, say completion rates between two designs, the difference would have to be at least 60 percentage points if you tested 20 users (10 in each group). For a sample size of 200 (100 in each group), differences as small as 17 percentage points would be statistically significant. That is, if one group had a completion rate of 50% versus 67% at a sample size of 200, a difference this large or larger would be not explainable by chance alone.
Sample Size Typical Margin of Error
Smallest Difference to Detect
(90% Confidence & 80% Power)
20 +/- 18% 60 percentage points (e.g. 20% vs. 80%) 200 +/- 6% 17 percentage points (e.g. 50% vs. 67%)
Table 2: Typical margin of error for two sample sizes (20 and 200) and smallest difference to detect when comparing designs.