A poorly worded question, an unclear response item, or an inadequate response scale can create additional error in measurement.
Worse yet, the error may be systematic rather than random, so it would result in unmodeled bias rather than just increased measurement variability.
Rating scales have many forms, with variations in points, labels, and numbers. In earlier articles, and from our own primary research, we’ve examined the effects of adding colors to scale points, using only three points, presenting items in a grid, and labeling the neutral response.
The results of changing scales can often be counterintuitive. Unfortunately, much “conventional wisdom” on response scales is unsubstantiated, incorrect, and exaggerated, or may be highly dependent on the context in which questions and scales are used.
It’s hard to know whether small changes to item wording or response formats will have a big or small impact on scores. For example, when we changed half the items of the System Usability Scale from a negative to positive tone, the change didn’t significantly affect the overall SUS score (Sauro & Lewis, 2011, Experiment 2).
One core question about multipoint rating scales is whether to label all points or just the endpoints.
Proponents of labeling all points argue that it makes it clear what people are responding to. Adding labels adds clarity for the respondent. After all, what exactly does 5 mean on the following scale?
While labeling points does provide more information to respondents, the labeling of the scales may also introduce a new level of interpretation and ambiguity. What exactly does “Somewhat Satisfied” mean?
While numbers may lack the potentially increased clarity of verbal labels, there is universal understanding that 5 is greater than 4. The inherent ambiguity in language (not to mention problems in translation) may introduce more problems than solutions. For example, on something such as the Technology Acceptance Model’s seven-point scale (see Lewis, 2019), is the difference between “Slightly” and “Quite“ universally understood?
What’s more, there’s also the practical problem of coming up with the right labels for scales with more than five or seven points. For example, coming up with labels for each of the points on the eleven-point Likelihood to Recommend item would likely be impractical; endpoints and a neutral label may be all that can be done. But labeling each response option for five- and seven-point scales is practical and a fairly common practice. But even when it’s easy to do, should you do it?
Research on Labeling Response Scales
There has been a fair amount of research on the effect of labeling scales. Much of it comes from the context of public opinion polling, which can result in outcomes that are both high-stakes and controversial.
One of the more comprehensive discussions of assessing the impacts of full versus partial labeling of response scales comes from a chapter by Krosnick and Fabrigar in their 1997 book, Survey Measurement and Process Quality. They lay out the evidence for and against labeling response options, focusing on scale reliability and validity.
Early Studies Found Labeling Was Not Better
The first studies Krosnick and Fabrigar cited are relatively old and found no impact on reliability for full versus partially labeled scales (e.g., Finn, 1972; Madden & Bourdin, 1964). Finn (1972) varied the number of response options and labeling strategies for scales to rate different types of jobs, finding no significant difference in scale reliability for labeled versus unlabeled response options. Madden and Bourdin (1964) studied nine different configurations of scales with nine response options, also in the context of rating different types of jobs, but the configurations they studied and their method of assessing reliability were unusual, making their results difficult to integrate into this literature review.
A study by Andrews (1984), contrary to the author’s expectations, found that data quality was actually below average when all category scales were labeled compared to only partially (including just the endpoints) labeled. The Andrews study used data from six North American surveys, totaling 7,706 respondents. The surveys were administered over the phone and in person, asking about a variety of topics including quality of life, business activities, and lifestyle behaviors.
However, these three studies seem to be the exception with Krosnick and Fabrigar (1997), who cited more studies showing higher reliability and validity when all points were labeled, especially for respondents with low to moderate education. They cited a few studies supporting full labeling that we will summarize in more detail.
Other Studies Did Find Labeling Was Superior
A study by Dickinson and Zellinger (1980) had 86 veterinary students rate faculty performance on six dimensions using three scale variations. The scales had five points fully labeled: Always, Very Often, About As Often As Not, Very Seldom, and Never; two scales contained behavioral examples to help the students judge the faculty’s performance. The scales with examples were generally preferred and performed better than the less labeled version. This study didn’t compare a partially labeled (or numeric only scale) so the context was more akin to a rubric, suggesting that more information for each scale point led to improved measurement quality.
Another study by Wallsten et al. (1993) had 442 college students (including statistics and MBA students) answer questions about preferred communication using verbal or numerical information regarding judgments of uncertainty. For example, when asked “Which mode do you usually prefer to use when communicating your opinion to others?” 65% of respondents indicated “Verbal” over the “Numerical” choice. However, when asked whether there were times when they’d prefer the opposite, 93% indicated so, suggesting the context of the question may dictate preference. It’s unclear how this stated preference translates into the sort of rating scales administered in web surveys.
A study by Krosnick and Berent (1993) examined political party identification and ratings of governmental policies across eight experiments that compared endpoint-labeled scales to fully labeled scales. Surveys were administered over the phone, in-person, or on paper (self-administered), although a number of the studies had confounding effects, making it hard to isolate the effects of labeling.
One study compared response scales by asking respondents over the phone to judge the right amount of U.S. defense spending. The fully labeled scaled version for this policy question was
In only two of the studies in the Krosnick and Berent (1993) paper (Study 5, face-to-face interview; Study 6, telephone interview) was there any attempt to separate the effects of labeling and branching. Those studies with the experimental designs were fractional rather than full factorials, so there were still some confounding of results. Specifically, the conditions were partially labeled nonbranching, fully labeled nonbranching, and fully labeled branching. Study 6 suffered from a different methodological issue because participants were initially exposed to a partially labeled nonbranching format, then divided into three groups exposed to the three different formats.
For the purpose of this review, we focus on Study 5, which was the only one in which there was a longitudinal comparison of partially labeled and fully labeled nonbranching conditions. The primary conclusion of Study 5 was “To our surprise, the combined reliability of the five partially labeled nonbranching items (58.9%) was not significantly different from the combined reliability of the fully labeled nonbranching items (57.8%)” (p. 957).
In another study, Alwin and Krosnick (1991) used data from five U.S. national surveys for 13 items assessing political attitudes. They reported higher reliability when seven-point scales were fully labeled (mean reliability of .783) versus partially labeled (mean reliability of .570).
Wedell et al. (1990) conducted two experiments with UCLA undergrads using different scales to gauge clinical judgment. Across two studies they had students read 36 case histories of psychiatric patients using different scales with anchors including “Very, Very Mild Disturbance” to “Very, Very Severe Disturbance” and from “Superior” to “Very Poor.” They found that fully labeled scales performed better. Their findings suggest that including clear and calibrated categories will improve the reliability of judgments. But the use of the rating scale categories in this context also may be considered more similar to a rubric and less like a rating scale of attitudes.
More Recent Research Found Mixed Results of Labeling
Research on this topic continued after the publication of the Krosnick and Fabrigar chapter.
Weijters et al. (2010) found that fully labeled scales evoke more acquiescent response bias (ARS) and less extreme response bias (ERS) than scales that have only endpoint labeling. They hypothesized that in the case of a fully labeled scale, the center categories become more salient to respondents than they are in scales with only endpoint labeling. Based on a set of complex outcome metrics, they concluded that endpoint labeling was better for studies that had the primary purpose of any sort of general linear modeling, and full labeling was better for opinion measurement (see their Figure 3).
In a study conducted by Lau (2007), he found no significant effect of endpoint labeling versus full labeling on the incidence of extreme responding. He found that more absolute descriptors (e.g., Completely Disagree/Completely Agree) yielded more extreme responses than did less absolute descriptors. Ultimately, however, he concluded that the extreme response style’s effect on substantive findings was negligible, with no differences in estimates of effect sizes in a study of the relationship between individualism/collectivism on satisfaction.
Tourangeau et al. (2007) conducted two studies that varied the style of rating scales in a web-based survey with over 5,000 U.S.-based panel respondents answering 16 seven-point items, including seven dietary habits items with endpoints of “Strongly Oppose” and “Strongly Favor” and nine mood-related items with frequency endpoints of “None of the Time” to “All of the Time.” They manipulated color, labels, and the type of numeric label (e.g., negative numbers vs. positive numbers). They found that the effect of fully labeling the scale increased mean scores in some conditions (as was first reported by Schwartz et al., 1991), and this effect was larger than the effect of changing colors.
They found that respondents took longer to answer the items with fully labeled scales than items with verbal endpoints only (~ .7 second longer per item). The authors hypothesized that there may be a hierarchy of features that respondents attend to, with verbal labels taking precedence over numerical labels and numerical labels taking precedence over purely visual cues such as color. In reference to Krosnick and Fabrigar’s recommendation to label every point in a scale, the authors speculated it may partly reflect the added time respondents give to the question.
Moors et al. (2014) conducted a study with 3,266 U.S.-based panel respondents to examine the effects of extreme response styles (picking the highest and lowest options) and acquiescent response styles (agreeing). Participants were randomly assigned to one of five formats that asked about the environment and attitudes toward risky driving, each consisting of four items for each construct, two positively worded and two negatively worded. All used seven-point scales.
The five formats were
- Full labeling with numerical values
- Full labeling without numerical values
- End labeling with numerical values
- End labeling without numerical values
- End labeling with bipolar numerical values
The authors found that all scale formats exhibited extreme response bias, but end labeling evoked more ERS than full labeling. However, they felt that ERS, similar to what they found in an earlier study they conducted, is a stable trait that holds across different questionnaires and time (Kieruj & Moors, 2013).
Hjermstad et al. (2011) conducted a meta-analysis of self-reported pain scales that varied in scale points and use of labels (and included Visual Analogue Scales). They concluded it was unclear to what extent and in which direction the actual scores were influenced by labeling, so this remains an open question.
A study by van Beuningen et al. (2014) involved three experiments comparing five-point fully labeled scales compared to ten and eleven endpoint-only labeled scales across two experiments with over 10,000 respondents on life happiness and satisfaction measures. They didn’t disentangle the effects of labels and number of points but ultimately recommended that Statistics Netherlands use a ten-point scale with only the endpoints labeled for both international comparability and the increased ability to discriminate between happy and unhappy gained from adding more points.
Coromina and Coenders (2006) evaluated 383 PhD students from three European cities in Spain, Slovenia, and Belgium to rate how frequently they were able to accomplish tasks such as asking colleagues for information and engaging in social activities outside of work with their colleagues in the past year.
The authors found the fully labeled variation generated higher validity scores than with only endpoints labeled. The type of labeling used here (specific frequencies) differs from other scales that measure more abstract concepts such as agreement. With regard to full labeling versus endpoint only, they concluded: “Our results for factor 2 (all categories or only endpoints of the scale labeled) show that a higher validity is obtained when all labels are used. The reason why extra labels are helpful may be that in our questionnaire, labels indicate precise social contact frequencies and not vague or unclear quantifiers like ‘agree’, ‘not much agree’, ‘undecided’ and so on. This is typical of any frequency of contact question (a most common type of question in social network research), and we hypothesize that it may generalize to any data collection mode” (pp. 227-228). It does seem reasonable that this would be the case for frequency-of-contact questions in general, but it’s not clear whether it would be the case for the measure of sentiments (e.g., perceived usability).
More Recent Evidence Found Labels Aren’t Measurably Better
Schneider et al. (2008), including Krosnick as a co-author, conducted two studies using intent-to-recommend scales such as those used to compute the Net Promoter Score, varying the number of response options from 5 to 7 to 11. Contrary to their expectations, they concluded that assigning full labels did not improve scale validity; instead it produced weaker relationships between the scales and the validity criteria (stated historical recommendations). The partially labeled eleven- and seven-point scales were almost identical and better predictors of stated historical recommendations than the fully labeled scales for customers and non-customers.
Lewis (2019) investigated the effect of manipulating item formats for a revised version of the Technology Acceptance Model (TAM) questionnaire, originally designed to assess likelihood of use with fully labeled seven-point items (from left to right: Extremely Likely, Quite Likely, Slightly Likely, Neither, Slightly Unlikely, Quite Unlikely, Extremely Unlikely). To modify the items for the assessment of user experience for full labeling, the word “Likely” was changed to “Agree” and “Unlikely” was changed to “Disagree.”
The experimental design of the study was a full factorial based on crossing two independent variables: response format (full labeling with no numbers or numbers only) and response order (increasing in level of agreement from right to left or from left to right).
Respondents used the items to rate their experience with their company’s email application. With n = 546 and roughly equal numbers of participants in each of the four conditions, there was no significant effect of either format or order on mean ratings. The results indicated that the item format differences didn’t lead to any important differences in the magnitude or structure of measurement, but there were significantly more response errors when the magnitude of agreement increased from right to left.
Summary and Discussion
Our comprehensive (although not exhaustive) review of 17 studies of scale labeling found:
There is no clear superiority on labeling. Despite some recommendations and “best practice” wisdom, we didn’t find a clear pattern of fully labeled scales being measurably superior. When comparisons were well controlled (no confounding), labeling differences didn’t matter much. The most direct and unconfounded comparisons of full and endpoint-only labeling were reported by Lewis (2019) and Krosnick and Berent (1993, Study 6 only), both of whom reported no significant effects due to differences in labeling formats.
Context matters. As is often the case with survey research, knowing the topic, how the survey and rating scale was administered (e.g., over the phone, in-person, or via web survey), and other contextual variables is important to understanding the generalizability of the findings. We found significant variation in methods and conclusions across the literature. Scales with specific frequencies (e.g., daily vs. monthly) have objective meanings, whereas user experience measurement is usually focused more on measuring sentiments (e.g., subjective concepts such as agreement or satisfaction). Full labeling of objective items seems like it should lead to better data quality than endpoint-only labeling, but it’s not clear whether this is the case for subjective items.
How do you judge which format is better? There are a number of ways response scales have been evaluated in the published literature: comparing differences in scores, different distributions, reliability, correlations to other measures (validity), measure of extreme response bias, participant response time, and preference. The strongest arguments for or against a response format would be the validity (is it predicting or measuring what it’s intended better) and reliability (consistent responses). The literature has been mixed.
Extreme responses alone are not necessarily bad. We’ve found that measures of extreme responses (e.g., the top-box score) tend to be better predictors of behavior, and unless there are a sufficient number of response options, you can’t measure the extreme responses.
More points likely increase scale reliability, validity, and sensitivity. Furthermore, there is good evidence that scales with seven, ten, or eleven points increase the sensitivity, reliability, and in some cases the validity of the measure (e.g., Lewis & Erdinç, 2017), especially when there is a known relationship between extreme responses and behavior. It’s impractical to label all points when there are more than seven response options. The mixed results we found from the literature suggest that keeping the number of points lower merely to enable full labeling will usually not be worth it.
Is it a rubric or a rating scale? The Krosnick and Fabrigar chapter was written before the proliferation of web-based surveys. With closer examination of the contexts of the studies citied by Krosnick and Fabrigar, some of the scales evaluated are similar to grading rubrics (e.g., Wallsten et al., 1993) — an issue similar to the distinction between measurement of objective and subjective items.
Stay tuned for more research. Previous research into these types of controversies when conducting user experience research has often found little to no effect of differences in item formats (e.g., Lewis, 2018; Lewis, 2019; Lewis & Erdinç, 2017). But if you know MeasuringU, you know that we like to find out for ourselves in the context of user experience measurement. To investigate the effects of labeling on the more common five- and seven-point rating scales you’re likely to see on UX web surveys, we recently conducted two studies that we’ll report on in an upcoming article.