The System Usability Scale has been around for decades and is used by hundreds of organizations globally.
The 10-item SUS questionnaire is a measure of a user’s perception of the usability of a “system.”
A system can be just about anything a human interacts with: software apps (business and consumer), hardware, mobile devices, mobile apps, websites, or voice user interfaces.
The SUS questionnaire is scored by combining the 10 items into a single SUS score ranging from 0 to 100. From its creation, though, John Brooke cautioned against interpreting individual items:
“Note that scores for individual items are not meaningful on their own”~John Brooke.
Brooke’s caution against examining scores for the individual items of the SUS was appropriate at the time. After all, he was publishing a “quick and dirty” questionnaire with analyses based on data from 20 people.
There is a sort of conventional wisdom that multiple items are superior to single items and in fact, single item measures and analysis are often dismissed in peer-reviewed journals.
More items will by definition increase the internal consistency reliability of a questionnaire when measured using Cronbach’s alpha. In fact, you can’t measure internal consistency reliability with only one item. However, other methods measure reliability, including test-retest reliability. Single measures, such as satisfaction, brand attitude, task ease, and likelihood to recommend, also exhibit sufficient test-retest reliability and little if anything may be gained by using multiple items.
John Brooke didn’t publish any benchmarks or guidance for what makes a “good” SUS score. But because the SUS has been used extensively by other researchers who have published the results, we have been able to derive a database of scores. Table 1 shows SUS grades and percentiles that Jim Lewis and I put together from that database, which itself is an adaptation of work from Bangor and Kortum.
|B||74.1 – 77.1||70 – 79|
To use the table, find your raw SUS score in the middle column and then find its corresponding grade in the left column and percentile rank in the right column. For example, a SUS score of 75 is a bit above the global average of 68 and nets a “B” grade. A SUS score below 50 puts it in the “F” grade with a percentile rank among the worst interfaces (worse than 86% or better than only 14%).
Why Develop Item-Level Benchmarks?
While the SUS provides an overall measure of perceived ease and our grading scale provides a way to interpret the raw score, researchers may want to measure and set targets for other more specific experience attributes (e.g. perceptions of findability, complexity, consistency, and confidence). To do so, researchers would need to develop specific items to measure those more specific attributes.
Some attributes, such as findability, do not appear in the 10 SUS items. Other attributes, such as perceived complexity (Item 2), perceived ease of use (Item 3), perceived consistency (Item 6), perceived learnability (Item 7), and confidence in use (Item 9) do appear in the SUS.
Researchers who use the SUS and who also need to assess any of these specific attributes would need to decide whether to ask participants in their studies to rate this attribute twice (once in the SUS and again using a separate item) or to use the response to the SUS item in two ways (contributing to the overall SUS score and as a measure of the specific attribute of interest). The latter, using the response to the SUS item in two ways, is the more efficient services.
In short, using item benchmarks saves respondents time as they answer fewer items and saves researchers time as they don’t have to derive new items and get the bonus of having benchmarks to make the responses more meaningful.
Developing SUS Item Level Benchmarks
To help make the process of understanding individual SUS items better, Jim Lewis and I compiled data from 166 unpublished industrial usability studies/surveys based on scores from 11,855 individual SUS questionnaires.
We then used regression equations to predict overall SUS scores from the individual items. We found each item explained between 35% and 89% of the full SUS score (a large percentage for a single item). Full details of the regression equations and process are available in the Journal of Usability Studies article.
To make item benchmarks easy to reference, we computed the score you’d need for an average “C” score of 68 or a good score of 80, an “A-.“ Why 80? We’ve found that a SUS of 80 has become a common industrial goal. It’s also a good psychological threshold that’s attainable. Achieving a raw SUS score of 90 sounds better but is extraordinarily difficult (only one study in the database exceeded 90–data from Netflix).
Table 2 shows the mean score you would need for each item to achieve an average “C” or good “A-“ score.
|SUS Item||Target for Average Score||Target for Good Score|
|1. I think that I would like to use this system frequently.||≥ 3.39||≥ 3.80|
|2. I found the system unnecessarily complex.||≤ 2.44||≤ 1.85|
|3. I thought the system was easy to use.||≥ 3.67||≥ 4.24|
|4. I think that I would need the support of a technical person to be able to use this system.||≤ 1.85||≤ 1.51|
|5. I found the various functions in this system were well integrated.||≥ 3.55||≥ 3.96|
|6. I thought there was too much inconsistency in this system.||≤ 2.20||≤ 1.77|
|7. I would imagine that most people would learn to use this system very quickly.||≥ 3.71||≥ 4.19|
|8. I found the system very cumbersome to use.||≤ 2.25||≤ 1.66|
|9. I felt very confident using the system.||≥ 3.72||≥ 4.25|
|10. I needed to learn a lot of things before I could get going with this system.||≤ 2.09||≤ 1.64|
For example, if you’re using Item 3, “I thought the system was easy to use,” then a mean score of 3.67 would correspond to a SUS score of 68 (an average overall system score). For an above average SUS score of 80, the corresponding target for Item 3 would be a mean score of at least 4.24.
Note that due to the mixed tone of the SUS, the directionality of the item targets is different for odd- and even-numbered items. Specifically, for odd-numbered items, means need to be greater than the targets; for even-numbered items, observed means need to be less than the targets. For example, for Item 2, “I found the system unnecessarily complex,” you would want to have a mean below 2.44 to achieve an average score (SUS equivalent of 68) and below 1.85 for a good score (SUS equivalent of 80).
The popularity of the SUS has allowed for the creation of normalized databases and guidance on what constitutes poor, good, or excellent scores. Researchers on some occasions may want to use single items from the SUS to benchmark more specific constructs (e.g. “I felt very confident using the system” representing user confidence). Using data from almost 12,000 participants we were able to create benchmarks for individual SUS items to achieve average “C” scores and high “A-“ SUS equivalent scores. These benchmarks allow researchers to know what mean value to aim for to achieve an average or good experience when interpreting single items from the SUS.