- This website was easy to use.
- It was difficult to find what I needed on this website.
The major reason for alternating item wording is to minimize extreme response bias and acquiescent bias.
However, some recent research[pdf] Jim Lewis and I conducted found little evidence for these biases. We found response bias effects are at best small and outweighed by the real effects of miscoding and misinterpreting by users.
Usability Questionnaires mostly Alternate
The popular System Usability Scale(SUS) has items that alternate between positive and negative wording. In fact, of the most frequently used questionnaires to measure attitudes about usability all but one use a mix of positive and negative items.
- System Usability Scale (SUS): 10 Items (half positive & half negative)
- Post-Study System Usability Questionnaire (PSSUQ [pdf]): 19 Positive items
- Software Usability Measurement Inventory (SUMI): 50 Items with a mix of positive and negative
- Questionnaire for User Interaction Satisfaction (QUIS): 27 items with a mix of positive and negative
Advantages to Alternating
There are two major reasons for alternating item wording
- Reducing Acquiescent Bias: This is what happens when users generally go on auto-pilot and agree to all statements. In a 5-point scale these would be all 4’s and 5’s.
- Reducing Extreme Response Bias: Participants who provide all high or all low ratings (all 5’s or all 1’s on a 5 point scale). This is somewhat related to the acquiescent bias except respondents basically pick the most extreme rating and provide it to many or all items.
By including a mix of both positive and negative items, respondents are forced to consider the question and (hopefully) provide a more meaningful response which should reduce these biases.
Despite published concerns about acquiescence bias, there is little evidence that the “common-wisdom” of including both positive and negatively worded items solves the problem. To our knowledge there is no research documenting the magnitude of acquiescence bias in general, or whether it specifically affects the measurement of attitudes toward usability.
Disadvantages to Alternating
There is a dark side to alternating items. We are aware of at least three.
- Misinterpret: Users may respond differently to negatively worded items such that reversing responses from negative to positive doesn’t account for the difference. There is evidence that this lowers the internal reliability, distorts the factor-structure and is more problematic in cross-cultural settings.
- Mistake: Users might not intend to respond differently, but may forget to reverse their score, accidentally agreeing with a negative statement when they meant to disagree. We have been with participants who have acknowledged either forgetting to reverse their score or commenting that they had to correct some scores because they forgot to adjust their score.
- Miscode: Researchers might forget to reverse the scales when scoring, and would consequently report incorrect data. Despite there being software to easily record user input, researchers still have to remember to reverse the scales. Forgetting to reverse the scales is not an obvious error. The improperly scaled scores are still acceptable values, especially when the system being tested is of moderate usability (in which case many responses will be neutral or close to neutral).While this may seem likely an easily avoidable problem, we found 3 of 27 SUS datasets (11%) to be miscoded suggesting the harried life of a researcher, marketer or product manager can affect between 3 and 28% of all datasets (which represents the 95% confidence interval)
Is it worth the trouble?
Does alternating item wording outweigh the real negatives of misinterpreting, mistaking and miscoding? To find out, we created an all positively worded version of the SUS and tested it against the original alternating SUS in a series of remote unmoderated usability studies.
|#||All Positive SUS||Original SUS|
|1||I think that I would like to use the website frequently.||I think that I would like to use this system frequently.
|2||I found the website to be simple.||I found the system unnecessarily complex.|
|3||I thought the website was easy to use.||I thought the system was easy to use.|
|4||I think that I could use the website without the support of a technical person.||I think that I would need the support of a technical person to be able to use this system.|
|5||I found the various functions in the website were well integrated.||I found the various functions in this system were well integrated.|
|6||I thought there was a lot of consistency in the website.||I thought there was too much inconsistency in this system.|
|7||I would imagine that most people would learn to use the website very quickly.||I would imagine that most people would learn to use this system very quickly.|
|8||I found the website very intuitive.||I found the system very cumbersome to use.|
|9||I felt very confident using the website.||I felt very confident using the system.|
|10||I could use the website without having to learn anything new.||I needed to learn a lot of things before I could get going with this system.|
We had 213 users in the US attempt two representative tasks on one of seven websites (third party automotive or primary financial services websites: Cars.com, Autotrader.com, Edmunds.com, KBB.com, Vanguard.com, Fidelity.com and TDAmeritrade.com).
The tasks included finding the best price for a new car, estimating the trade-in value of a used-car and finding information about mutual funds and minimum required investments. At the end of the study users randomly completed either the standard or the positively worded SUS. There were between 15 and 17 users for each website and questionnaire type. The mix of gender, age and education levels were not statistically different between groups.
We found little evidence that the purported advantages of the alternating items outweighed the disadvantages.
- Differences in scores were negligible : The mean SUS scores, means to the even items and the means to the odd items were statistically indistinguishable (see Figures 1 and 2 below).
Figure 1: Mean SUS scores for both versions (p >.39).
Figure 2: Mean scores (scaled from 0 to 4) for odd (p >. 54) and even (p >.2)items.
- No difference in acquiescent bias : The mean number of agreement responses on both questionnaires were nearly identical 1.64 for the standard and 1.66 for the all positive (p > .95).
- No difference in extreme response bias: The mean number of extreme responses was 1.68 for the standard SUS and 1.36 for the positive version (SD = 2.23, n = 106), a non-significant difference (t (210) = 1.03, p > .30).
- No difference in reliability: The internal reliability of both questionnaires was high (Cronbach’s alpha of .92 for the original and .96 for the positive).
Negatives Outweigh the Positives: There is little evidence that the purported advantages of including negative and positive items in usability questionnaires outweigh the disadvantages. This finding certainly applies to the SUS when evaluating websites using remote unmoderated tests. It also likely applies to usability questionnaires with similar designs in unmoderated testing of any application. Future research with a similar experimental setup should be conducted using a moderated setting to confirm whether these findings also apply to tests when users are more closely monitored.
New Usability Questionnaires Shouldn’t Alternate Wording: Researchers interested in designing new questionnaires for use in usability evaluations should avoid the inclusion of negative items.
No Reason to stop using the original SUS (just watch your coding!) Researchers who use the standard SUS have no need to change to the all positive version provided that they verify the proper coding of scores (for example by using the error-checking spreadsheet included in the SUSPackage).
- In moderated testing, researchers should include procedural steps to ensure error-free completion of the SUS (such as when debriefing the user).
- In unmoderated testing, it is more difficult to correct the mistakes respondents make, although it is reassuring that despite these inevitable errors, the effect is unlikely to have a major impact on overall SUS scores.
All Positive SUS generates similar results: Researchers who do not have a current investment in the standard SUS can use the all positive version with confidence because respondents are less likely to make mistakes when responding, researchers are less likely to make errors in coding, and the scores will be similar to the standard SUS.
Is there ever a good reason to alternate items?
We only examined questionnaires that measure usability or system satisfaction. Usability is an analysis at the group level (we’re not testing users, but rather applications that users use) so we care about differences between groups. It could be that in other areas of behavioral research where the emphasis is on the individual (e.g. clinical or counseling psychology) alternating item wording provides benefits that outweigh the problems. Until other research identifies net benefits to alternating item wording, it’s best to stay positive.
For more detail on the experiments and related research into this topic see the full paper[pdf] (to be presented at CHI in May 2011).