The System Usability Scale (SUS) is a ten-item questionnaire administered to users for measuring the perceived ease of use of software, hardware, cell phones and websites.
It’s been around for more than a quarter century, and its wide usage has allowed us to study it extensively and write about it in this blog and in the book, A Practical Guide to the System Usability Scale.
If you are unfamiliar with the SUS, see the earlier blog for some background and fundamentals. Here are 10 things to know when using the SUS:
- The average SUS Score is a 68: When looking at scores from 500 products, we found the average SUS score to be a 68. It’s important to remember that SUS Scores are not percentages. Even though the SUS ranges from 0 to 100, these scaled scores are not a percentage. A score of a 68 is 68% of the maximum score, but it falls right at the 50th percentile. It’s best to express the raw number as a score and, when wanting to express it as a percentage, convert the raw score to a percentile by comparing it to the database.
- SUS measures usability & learnability: Even though SUS was intended to be a measure of just usability (one-dimensional) we found that two of the items can be used as a measure of learnability: Item #4 (I think that I would need the support of a technical person to be able to use this system) and #10 (I needed to learn a lot of things before I could get going with this system). The graph below shows the relationship between the learnability score and usability score (from all 10 items as well as just the remaining eight items).
In general, we see that the learnability scores track higher than the usability scores and the original SUS score for a selection of 88 studies.
Depending on the type of system being tested and its maturity, measures of learnability may be just as important as measures of usability.
- Reversing the items causes more harm than good: SUS, like many questionnaires, alternates the tone of each item. The odd items are phrased positively (e.g. I think that I would like to use the system frequently), and the even items are negative phrases (e.g. I found the system unnecessarily complex). The alternating tone is intended to reduce acquiescence and extreme response biases. If you’ve ever seen someone quickly answer a survey without carefully reading the items then you may think this sort of thing is a good idea.
In a paper we published at CHI[pdf] a few years ago, we actually found no difference in response biases between an all-positively worded version of SUS and the original version.
What we did see, unfortunately, was a side-effect of alternating. Eleven percent of researchers mis-scored the SUS, because they forgot to reverse the even items. What’s more, 17% of the studies we examined contained problems with participants forgetting to change their response orders when responding to negative items (users were agreeing to at least 3 positive and negative items). These errors are hard to detect because they still generate valid SUS scores. Despite this shortcomings, it’s OK to use the original SUS, just be sure to double check your item coding and, if possible, have a way to follow up with participants if the scoring looks wrong. To help reduce this problem, the SUS Calculator flags suspect responses for you.
- Familiarity breeds content: In examining SUS scores from software and websites, we find that users’ prior experience with the application impacts perceptions of usability as measured by the SUS. In general, a user with a lot of prior experience will rate an application as more usable. This will especially be the case between the users with the most experience and those with the least (or none at all).
For websites, we found that repeat users rated the websites with SUS scores 11% higher than those of first-time users. The same pattern held for software. Users with five or more years of experience with software generated SUS scores 11% higher than users with 0-3 years of experience.
- Usability predicts customer loyalty: In general, we find SUS scores predicts around 40% of why customers recommend software and websites as measured by the Net Promoter Score. Detractors have an average SUS score of 67 (slightly below average usability) and Promoters have an average score of 82 (well above average usability). In independent, large datasets, we’ve seen that you can estimate the Likelihood to Recommend question used in the Net Promoter Score (a 0 to 10 scale) by simply dividing the SUS by 10. For example, an SUS score of 72 would predict a LTR response of 7.2.
- Raw SUS scores aren’t normally distributed but the sample mean is: If you graph SUS scores from a study, you get a very asymmetrical looking shape (see the graph below). This leads some people who are familiar with parametric statistics and normal theory to get concerned when using confidence intervals and t-tests to make statistical inferences as the distribution isn’t symmetrical or bell-shaped.
The figure above shows what 311 SUS scores from a single study look like when graphed in a histogram (similar to a bar graph).
While the normal distribution is the reference distribution used in most of the statistical procedures we recommend, it is the distribution of the sample mean which needs to be normally distributed. The graphs below show what the sample mean looks like for sample sizes ranging from 8 to 30. In all cases, the distribution of the sample mean is bell-shaped and symmetrical and allows us to have accurate confidence intervals and p-values, even at small sample sizes.
The figure above shows 1000 Sample Means taken from the dataset shown above at sample sizes of 8, 20 and 30. These sample means show a symetrical bell shape even at small sample sizes and make use of parametric statistics legitimate and accurate.
- You can use SUS on small sample sizes: One common question I get when using the SUS (or when measuring usability in general) is about the lowest acceptable sample size. Technically you need at least two users to have some measure of variability (the standard deviation) and to generate confidence intervals. We have never done a test using the SUS with only two users. We will, however, report the SUS score with just five users.
Five is often a magic number for early-phase usability studies. Confidence intervals will be rather wide, but the average SUS score will be surprisingly stable. We did several computer simulations and showed that at a sample size of 5, the sample mean is within six points of a very large sample SUS score 50% of the time (see the graph below).
The figure above shows the difference between the average SUS score and a the mean from a sample size of just 5 repeated 1000 times. In 50% of the samples the SUS score from a sample size of 5 was within 6 points of the true SUS score. Not bad for such a small sample size.
In other words, if the actual SUS score was a 74, average SUS scores from five users will fall between 66 and 80 half of the time. Seventy-five percent of the time, the score differed by 10 points and 95% of the time, by about 17 points. In other words, you get within the ballpark of the actual SUS score in more than half of the cases with very small sample sizes. For more precise measures of sample sizes, use the SUS Guide and Calculator.
- SUS scores were not meant to be diagnostic: One surprise that first-time users of SUS sometimes encounter is the lack of diagnostic information they receive. At best, the SUS will provide a measure of usability and learnability, which can be compared to some industry benchmarks. You can look at the individual items in the SUS but none of them really tell you what to fix in the interface. That’s because the SUS, like most questionnaires, was never meant to be diagnostic. It would take too many items and still probably be too vague to determine if the labels, the search engine results page or the product descriptions need improvement. Fortunately, by having participants attempt a few realistic tasks and noting problems in their behavior you can quickly identify areas that are affecting SUS scores.
- SUS is technology agnostic: The items of the SUS are phrased in a way that allows it to be administered on any type of system a user interacts with. That means a company that develops hardware, software or voice response systems can use SUS as a flexible internal benchmark. This flexibility comes at a price, however. When you need to get more specific measures for a technology (for example, trust or visual appeal) the SUS is probably not the best tool for the job.
- SUS might not always be the best questionnaire: While SUS is technology agnostic and relatively short, we use other instruments depending on the job.
- For measuring website usability, we use the 13 item SUPR-Q. Four of the items can generate a reliable SUS equivalent score. The other items provide measures of credibility/trust, appearance and loyalty.
- For measuring task-level usability, we use the Single Ease Question (SEQ).
- For measuring perceived usefulness of mobile apps, we use the item “The application’s capabilities meet my requirements” which has a five-point rating scale (more on this item in future blogs).
One thing all these scales have in common is that we can compare a raw score to a larger dataset to generate relative rankings and percentile ranks to provide more meaning to the metrics.