How satisfied are you with your life?
How happy are you with your job or your marriage?
Are you extroverted or introverted?
It’s hard to capture the fickle nature of attitudes and constructs in any measure. It can be particularly hard to do that with just one question or item.
Consequently, psychology, education, marketing, and user experience have a long history of recommending multiple items to measure a construct.
Using only a single item to measure a construct is often greeted with skepticism—the Net Promoter Score being a recent example. It’s based on responses to only a single item (how likely are you to recommend to a friend) to measure loyalty.
But is it ever acceptable to use just a single item to measure a construct like loyalty, satisfaction, or ease of use?
History of Multi-Item Scales
The classic text that has influenced much of scale development in the behavioral sciences comes from Psychometric Theory. Jum Nunnally, its author, recommended multi-scale instruments because “measurement error averages out when individual scores are summed to obtain a total score” (p. 67).
This thinking has influenced standardized testing and personality assessments. For example, a well-known assessment of personality, the 16PF, has 185 items.
Marketers also have generally followed the advice of Churchill and his influential 1979 paper. In it Churchill advises:
“marketers are much better served with multi-item than single-item measures of their constructs, and they should take the time to develop them.”
For example, the ServQual questionnaire uses 22 items to gauge the quality of service received from a company. And even the American Consumer Satisfaction Index (ACSI) uses three items to measure company satisfaction.
This thinking has also influenced UX measurements; all the following instruments use multiple items:
- QUIS: The Questionnaire for User Interaction Satisfaction uses 27 items
- SUS: The System Usability Scale (SUS), a measure of perceived usability, uses 10 items
- SUMI: The Software Usability Measurement Inventory, a measure of software quality, uses 50 items
- PSSUQ: The Post-Study System Usability Questionnaire: 16 items
- SUPR-Q (UX Website Quality): 8 items
- SUPR-Qm (Mobile App UX Quality): 16 items using Item Response Modeling
Concerns with Single Item Scales
Generally, three major concerns are expressed when using single item scales: They don’t capture the construct (low content validity), have fewer points of discrimination (sensitivity), and lack a measure of internal-consistency reliability (reliability).
Low content validity: Content validity refers to how well the content of the items used in a questionnaire addresses the topic. With only one item describing the construct, it can be difficult to adequately address it. McIver and Carmines (1981) say, “It is very unlikely that a single item can fully represent a complex theoretical concept or any specific attribute for that matter” (p. 15).
Sensitivity: Single items are also limited in their capability to provide enough points of discrimination. For example, a single Likert question has five points to discriminate. In contrast, a 10-item 5-point scale has 40 points of discrimination (scores range from 10 to 50). A single item also needs a larger sample size.
Reliability: The most common measure of reliability is Cronbach’s alpha, a measure of internal consistency. It’s based on how respondents consistently answer items. You need at least two items to compute Cronbach’s alpha though, making this measure not possible for single items.
Research on Single Items
For every “rule” in measurement, there’s usually a long list of exceptions. In some cases, these rules are conventions that can be broken. For example, we found that the long-standing practice of alternating the tone of items (a “rule” for making good questionnaires) to reduce acquiescence bias actually does more harm than good. What’s more, we found little evidence of the acquiescence bias.
To see whether the conventional wisdom that single items are insufficient, I examined the literature for studies on single versus multi-item scales. There are a lot of papers advocating for multi-item scales (especially to measure complex constructs), but quite a few showed times when single items are sufficient.
Scarpello and Campbell in 1983 found a single 5-point measure of job satisfaction was sufficient. This suggests at least one important measure of satisfaction can be captured with a single item. It failed to catch on as a measure, though.
Later, Wanous et al. (1997) conducted a meta analysis [pdf] on 17 studies of job satisfaction and found single item measures performed sufficiently well (correlating high with total scores). They even concluded that “single-item measures are more robust than the scale measures of overall job satisfaction” and should not be dismissed outright. They used a correction for attenuation formula to estimate the internal reliability of a single item.
Hyland and Sodergren (1996) compared 12 items of self-reported quality of life (0=might as well be dead to 100=perfect quality of life) from undergraduates and an elderly population. They found items within the most preferred four-item scale correlated very highly with each other (r > .84) and had a level of inaccuracy of less than 3% (suggesting any one item is a sufficient measure).
Cunny and Perri (1991) found that a single item of the Medical Outcome Study Survey can serve as a substitute for the full 20-item measure. The item (In general, would you say your health is Excellent, Very Good, Good, Fair, or Poor) correlated very high (r= .86) with the overall score of the health-related quality of life measure.
McKenzie and Marks (1999) showed that a single item measure of depression was a reasonable substitute for a longer 21-item assessment and saved clinicians and patients time[pdf].
Bergkvist and Rossiter (2007) had 92 undergraduate students evaluate four advertisements and rate how much they liked the advertisement, attitude toward the brand, purchase intention, and brand beliefs. They found no difference in predictive capability between multiple item measures and single item measures.
Drolet and Morrison (2001) conducted a study varying the number of items and found that as the number of synonymous items grows, respondents are more likely to engage in “mindless response behavior.” They suggested that not only do multiple items take more time, multiple items may actually increase response error (potentially offsetting the benefits advocated by Nunnally and Churchill).
Ittner & Larcker (1998) found that a single 10-point item of overall satisfaction with a company’s service performed equally as well as a multi-scale measure of satisfaction in predicting financial performance.
Van doorn et al. (2013) also found a single 5-point measure of customer satisfaction was an adequate measure of future business performance. It was not the best measure but was statistically indistinguishable from other multi-item measures—encouraging, considering only five points of discrimination.
One of the principal criticisms of using single items is that internal consistency reliability cannot be computed (you need at least two items). But that doesn’t mean you can’t measure reliability. While Cronbach’s alpha is the most common measure of reliability, it’s not the only measure.
Another way to measure reliability is test-re-test reliability (the “test” comes from the legacy of psychometrics—which had its start building standardized tests in education). In short, to measure how reliable a questionnaire with 10, 5, or 1 items is, data should be collected at two points in time and correlated. The two-time points can be hours, days, weeks, or years and depends on the content being measured.
There is, in fact, some discussion that test-retest may be a better measure of participant consistency (or at least equally as informative as Cronbach’s alpha).
A Place for Single Items
Even proponents of multi-item measures agree that there is a place for single item measures. For example:
Pollack and Alexandrov (2013) examined the criticism of the Net Promoter Score’s single item and while they found other measures were a better predictor of growth than the NPS, they found sufficient evidence supporting a single item’s adequacy.
Sarstedt and Wilczynski 2009 questioned the approach used by Bergkvist and Rossiter using measures of customer satisfaction and customer loyalty. They still found that while single-item measures are not appropriate for complex constructs, they “perform acceptable with regard to reliability” for simple constructs.
While Grapentine (2001) argued for the importance of multi-item scales for measuring complex constructs, he also conceded that single items would be sufficient if “it does not significantly affect its reliability and validity, and if the client does not want to know how the company’s product performs on an item that could be part of the multiple-item measure.
Freed (2013), a strong critic of the NPS, also agreed a single item has a place:
If the construct being measured is sufficiently narrow or is unambiguous to the respondent (e.g., the measurement of subjective probabilities, such as future behaviors), a single item measure may suffice.
But for more complex psychological constructs (especially those based on attitudes), it is usually recommended that scales with multiple items be used. Appendix D
We’ve found single items perform as well or better than multi-item instruments or are a reasonable substitute for more items. For example:
- SEQ: The Single Ease Question performed as well or better than multi-item measures.
- SIUM: A Single Item Usability Measure was shown to be a reasonably valid and reliable measure for online store usability.
- SMEQ: The Subjective Mental Effort Questionnaire is also a single item, using a visual analog scale, that performs well in measuring mental effort of a task.
- UMUX-Lite: Offers a reasonable substitute for the SUS, with only two items.
- SUS: Individual items in the 10-item SUS have a strong correlation (r > .9) to the total SUS score (at the product level), suggesting a single item is a reasonable substitute and may adequately cover the content of system usability (similar to the meta-analysis from Wanous et al.). I’ll cover more on this in future articles.
Summary & Takeaways
- Single item measures are likely adequate for some constructs. For simple (one-dimensional) or concrete constructs that are well understood, a single item may suffice. What little is gained in internal consistency reliability may be offset by the burden of additional items and possibly additional response error. This likely includes customer satisfaction, job satisfaction, likelihood to recommend, advertisement favorability, brand favorability, and perceived ease of use.
- Multi-item measures are better for more complex constructs. More items mean more coverage of content (higher content validity). For example, the SUPR-Q is a measure of website UX quality and taps into multiple constructs (trust, usability, appearance). It’s not possible to ask about all these concepts in a single item.
- Multi-item measures will improve the amount of points of discrimination. Having more multi-point scales will likely lead to detecting differences with smaller sample sizes and/or better correlations with current and future business metrics.
- Use more points for single items. Few points of discrimination mean you need large sample sizes to differentiate changes over time or between products—a justification for the NPS’s underlying 11-point scale (but not necessarily scoring system). When using a single item, consider using more points (7, 9, or 11) rather than fewer points (5, 3, or 2).
- Consider test-retest as an alternate measure of reliability. Internal consistency reliability (Cronbach’s alpha) cannot be computed on single items. To measure reliability on single items, use test-retest reliability (correlating responses from the same participants taken at different times). A topic I’ll cover in a future article.
Thanks to Jim Lewis for commenting on an earlier version of this article.