Just how much does the process of measuring impact the metrics we collect?
In measuring perceived usability of five popular websites, I found that a single difficult task lowered post-test usability scores by 8%.
This was largely driven by users with the least experience with the website, whose scores dropped by almost 20%. A difficult task doesn’t appear to affect the most experienced users’ attitudes.
Measuring Affects the Metrics
Users develop perceptions of the ease of use of websites and products over time but their perceptions are often measured in the context of a usability test.
In a usability test, users are asked to attempt tasks in a lab-based setting or a remote unmoderated session over the Web. The tasks are selected for a number of reasons. Some are selected to simulate actual usage. Others are chosen to probe new features or a difficult function. Performance metrics like task completion rates, time, errors and task-level satisfaction are collected.
At the end of each session, users also typically respond to a post-session questionnaire like the System Usability Scale (SUS) which provides an overall impression of an application’s usability.
Usability tests are of course an artificial situation—the tasks are made up, the data is generic and the users know they are being watched. In a usability test we want to strike a good balance between simulating actual usage and controlling variables so we have a reasonably stable picture of the usability given a set of tasks, users and functions.
How much do tasks affect the Metrics?
The tasks moderators select may be those that users attempt at home or on the job. But some tasks being measured might be infrequently done or happen to be rather difficult ones.
In fact, tasks might tend to be more difficult in usability tests simply because we usually measure usability to improve it and it’s hard to improve easy tasks.
It is accepted practice that when developers or a usability test team have had difficulty designing a feature, they include tasks to probe it to see if the design is effective. Those tasks are sometimes difficult for users but their experience is invaluable to improving the design of that feature.
If you change a task, it makes sense that completion rates and times will differ depending on the complexity of the task. But how much will these changing task-scenarios impact the users’ overall attitude about the website or product’s usability?
If the tasks are easy, will they cast a positive halo on the whole product and inflate scores? If the tasks are hard, will they change users’ perceptions by lowering scores and make the product appear more difficult?
The best way to measure the overall perceived usability of products and websites is to use one of the standardized usability questionnaires like SUS, SUMI and PSSUQ [pdf] because their validity and reliability have been established. These questionnaires provide more stable measures of usability than task-level metrics, but they are still context dependent, that is, they are affected by the users’ history with a product and the sample of tasks the team selected to test. John Brooke, creator of the SUS cautioned in 1996
So how much does task difficulty in usability tests affect our perceptions of overall usability?
Task performance correlates modestly with perceptions of usability
In some earlier research[pdf] we found that post-test questionnaires such as SUS don’t provide a strong correlation to task-level metrics. On average the correlation is a modest .24 between task-completion rates and post-test questionnaires.
That means changes in task completion rates explain only around 5%-6% of the changes in SUS scores. This does establish that the tasks matter as we expect, but 95% of the difference being explained by other factors suggests the tasks are at best modestly impacting our impressions of overall usability.
Research by Cavallin et al. (2007), using SUMI (instead of SUS) during a usability test of drafting software, found SUMI scores differed by around 15% when two samples of users attempted different tasks on the same release of a product. But this difference only affected less experienced users as there were no statistical differences in SUMI scores for expert users.
One complication with this study is that the two tests were conducted at different times (a year apart), in different physical locations, and potentially on different types of users. Even without task differences, we have found that just these variables alone can cause unexpected results (see Sauro & Lewis 2011 [pdf]).
What’s more, it is unclear how different the tasks were in complexity and difficulty. For example, if the first set of users conducted rather simple tasks lasting 2-3 minutes and the second group attempted more complicated tasks lasted 20-30 minutes, one would expect some of this more difficult experience to be reflected in measures of usability. Perhaps even one difficult task might make a difference.
Understanding how tasks affect SUS scores will involve several experiments. I suspect the major factors are the duration of the study and the complexity of the tasks. That is, I’d expect SUS scores to be lower after a user attempted many difficult tasks for a long period of time compared to users who attempted a few short tasks for a short period of time.
It may also depend on when the easy or difficult tasks are attempted. For example, even one easy task attempted at the end of a session might increase SUS scores, while a late difficult task might reduce them.
I first wanted to understand the effect of task-difficulty on SUS scores. SUS and other usability questionnaires are often given to users outside the context of a usability test (in isolation) to generate a usability benchmark. Users are typically recalling their experience with the product.
How much would these scores differ from SUS scores administered in a usability test?
Five Websites, 224 Users & Three Conditions
To find out I recruited existing users for five websites (Apple, Walmart, eBay, Craigslist, Amazon) and ended up with a sample of 224 users. These users were randomly assigned to one of three conditions:
- No Tasks: Users were asked to fill out the SUS without attempting any tasks.
- Easy Task : Users were asked to go to the website and attempt one easy task then fill out the SUS.
- Hard Task: Users were asked to go to the website and attempt one hard task then fill out the SUS.
For example, on the Walmart site, the easy task was :
And the difficult task was:
There were between 14 and 18 users on each website and condition for a total of 224 users. There were no differences in the average prior experience with the websites between conditions [ (F 2,221 = .73) p =.485).
To verify that there was a difference in task difficulty, I asked users to answer the Single Ease Question (SEQ) after they attempted the task. The SEQ is a 7-point scale where lower scores indicate a harder task. The average rating for the hard tasks was a 4.48 (sd=1.87) and the average rating for the easy task was a 6.38 (sd =1.22) p
While the difference in average ratings was large (30% lower for the difficult tasks) I wanted to confirm these were meaningful differences in difficulty. I compared these means to a large database of 200 web-based tasks and 2000 users in which the SEQ was administered. I found that only 8% of all tasks were rated harder than a score of 4.48, whereas 77% of all tasks were rated harder than a score of a 6.38.
So there is a clear difference in the perceived difficulty between the hard and easy tasks used in this study.
On average I found that the difficult tasks reduced the SUS score by 8% compared to users who attempted no tasks. Easy tasks tended to increase the SUS scores slightly but the difference wasn’t statistically significant at this sample size.
Figure 1: Mean SUS scores for the three conditions and 95% confidence intervals (yellow-bars).
In looking at the effects of experience, I found the users across all website with the most website usage (those who report visiting the site at least weekly) generated 13% higher SUS scores than those with the least experience (p <.01). This again confirms the effect experience has on perceptions of usability.
Building on the finding reported by Cavallin I also found the most experienced users with a website were not as affected by task difficulty. The graph below shows that there was no significant difference between the 3 conditions for the most experienced users (in red) but the least experienced users show results similar to the full sample (in blue).
Figure 2: Mean SUS scores for the three conditions by the user’s experience level and 95% confidence intervals (yellow-bars).
In this case, the hard task reduced SUS scores by almost 20% for the least experienced users of the website compared to those who attempted no tasks. If difficult tasks do lower SUS scores for experienced users the difference is small.
By using only one task I’ve isolated the effect of a single task’s difficulty on post-test measures. Most usability tests include more than a single task. It would be good to repeat this sort of experiment but manipulate the length of time users interact with the system and understand the interaction between study length and difficulty. Also it would be valuable to know if the placement of a hard or easy task makes a difference, for example if the difficult task it first or last, does it make a difference?
I selected a handful of large well known websites with millions of users each. Websites in this class are already refined and probably more usable than most. It would be good to see what happens for well known but less usable website. Perhaps the easy tasks would increase the scores more than we have seen in this experiment.
Here’s how I would summarize the findings:
- Usability test-tasks alone do not seem to impact SUS scores and therefore our measures of perceived usability. That finding provides testers with some confidence that they are not always changing perceptions of the product by testing it.
- Easy tasks appear to have little to no effect on SUS scores regardless of experience. There is some evidence they may actually increase the scores modestly and future research is needed to confirm this finding. It would be interesting to know if having an easy task late or early in the session matters.
- Difficult tasks will likely have the most impact on SUS scores, on average lowering the scores by 8% compared to SUS scores taken in isolation.
- Most of the reduction in SUS scores can be attributed to less experienced users. This group on average had almost 20% lower SUS scores after attempting difficult tasks. This finding may indicate that the SUS scores of relatively novice users can be more easily changed by test conditions.
- Experienced users don’t appear to be substantially affected by even difficult tasks. This finding is consistent with what was reported by Cavallin et al 2007. This finding is also reassuring for testers because it shows that the bulk of users’ experience forms their perceptions and they are not changed much by a single event.
The “correct” measure of usability depends largely on what sort of statement you’re making about the product. If it is related to specific features or functional areas of a product, it probably makes the most sense to use data from usability tests. Keep in mind that the tasks are only modestly affecting the post-study questionnaire scores.
SUS scores taken in isolation probably more accurately reflect users’ current attitude about a product or website. While the attitudes might be vague or come from infrequent use, it is probably more reflective of how the perception of usability will generate positive or negative word-of-mouth.
This experiment suggests users with less experience have more amorphous attitudes about usability and are more susceptible to influence from tasks encountered in a usability test.
As with most measures, it’s the comparison over time or to a competitor that matters most. So it’s safe to compare scores across usability tests or those taken in isolation. It’s probably also OK to compare scores from usability tests and those taken in isolation for experienced users. If your sample contains a substantial amount of less experienced users and difficult tasks, the comparison may be unfair as difficult tasks are likely artificially lowering actual usability attitudes. It may make sense to qualify metrics from usability tests noting that difficult tasks may lower scores by 8-10%.