Your computer crashes at the worst times.
Your friend doesn’t show up to your dinner party.
If something or someone isn’t reliable, it’s not only a pain but it makes your life less effective and less efficient.
And what is true for people and products is true for measurement. The wording of items and the response options we use to gauge the attitudes and expected behaviors of people need to be reliable.
Reliability assesses how consistently people respond to the same items. A measure needs to be reliable to be valid—measuring what it intends to measure. That is, you can’t say you have a valid measure of satisfaction, ease of use, or loyalty if the measure isn’t AT LEAST consistent each time you use it.
There isn’t a single way to measure reliability. Instead, the four most common ways of measuring reliability are:
- Inter-rater reliability: Different evaluators rate the same phenomenon.
- Test-retest reliability: The same people respond to the same items at different times.
- Parallel forms reliability: Slightly different versions of items are administered to the same or different people.
- Internal consistency reliability: The correlations among the items are used to measure a construct.
By far the most common way to measure the reliability of questionnaires is using internal consistency reliability as measured by Cronbach’s alpha. It’s not because this is necessarily the best method. It’s quite simply because it’s the easiest to derive for two reasons: you only need one set of data from one set of respondents, and it’s computed relatively easily using statistical packages like SPSS or R (the SUS and SUPR-Q calculators include it).
But internal consistency reliability, by definition, needs more than one item to compute. In fact, the best way to increase internal consistency reliability is to add more items. This makes measures like the Net Promoter Score, the Single Ease Question, and other single item measures difficult to gauge reliability from a single sample. For this reason, there’s no measure of reliability (from what I could find) of the Net Promoter Score. The best way to measure its reliability is to correlate participant responses after some amount of time—that is, the test-retest reliability.
Test-Retest Reliability of the NPS
To measure the test-retest reliability of the Net Promoter Score, we collected data from 2,529 U.S.-based online participants to reflect on their experiences with one of the top 50 brands and asked their attitude toward the brand (7-point scale), satisfaction with the brand (7-point scale), and likelihood to recommend it (11-point scale used to compute the Net Promoter Score).
We then selected 19 of those brands for retesting. A total of 981 participants answered the exact same survey about the same brands starting 17 days later. Of these, we received 259 completed responses, which works out to a response rate of 26% of those invited and 10% of the original sample. The average time between surveys was 30 days, with a range of 17 to 47 days. For our analysis, we used the mean scores of items and didn’t use the Net Promoter scoring system of promoters minus detractors (a topic for another article).
In general, we found reasonably high correlations between participant responses over time. The correlation in responses between the two survey times for three measures are
- Likelihood to recommend: r = .75
- Satisfaction with the brand: r = .70
- Attitude toward the brand : r = .69
The mean difference in likelihood to recommend scores changed by a negligible and not statistically significant difference of 1.2%. That’s encouraging considering brand attitude, and consequently, the tendency to recommend is likely to shift over a 30-day period. In fact, the brand attitude scores also increased by the same 1.2% between periods. Satisfaction increased a bit more at 3.4%, which was statistically significant.
There were larger fluctuations within each brand (see Table 1). Three brands had mean LTR scores change by more than 20% (American Express, Louis Vuitton, and Nissan). The sample sizes within each brand were relatively small, though, ranging from 6 to 25. Only the mean LTR score difference for American Express was statistically significant. Interestingly, there was no correlation between the time between surveys and the difference in LTR values.
|Brand||Sample Size||Mean Difference||Mean LTR 1||Mean LTR 2||%Change|
As a check on the reliability of our participant responses, we compared the responses to demographic questions from both points. We found that responses to gender, education, and age were almost identical between the samples with deviations of less than 4% across all of them (99%, 97%, and 96% respectively). That’s good because it’s unlikely these demographics would change. Reported income, however, did change a bit with 85% of respondents reporting in the same income brackets between the two time periods. This 15% difference in reported income is likely a combination of incomes actually changing, people’s variability in their estimates, and participants being less consistent (and honest) about their income.
What Is Good Test-Retest Reliability?
The magnitude of the correlation (r = .75) in LTR scores after a month seems relatively high compared to other correlations in the behavioral sciences. But is it high enough as a measure of reliability? Unfortunately, there are few standards for judging the minimum acceptable value for test-retest reliability. For example, retest reliability coefficients of r=.70 have been described in the healthcare literature as “acceptable,” “good,” “adequate,” “highly reliable,” and “satisfactory.” An early review of customer satisfaction research on single item scales found typical test-retest reliability between r=.55 to .84.
The reason it’s difficult to understand what’s acceptable or excellent is a function of at least three things:
- The impact of the measure: A college entrance exam has a higher impact than a brand questionnaire and you’d want more reliability.
- The time between measurements: For example, ten minutes versus ten weeks.
- How stable the attribute being measured: Some things will inherently change over time.
Of course, compared to the other two single items used in this analysis (brand attitude and satisfaction with the brand), the likelihood to recommend question exhibited the same or better reliability. So while there are valid concerns about using the NPS as a measure of loyalty, the reliability of the underlying item “How likely are you to recommend the brand to a friend or colleague” asked on an 11-point scale should be of less concern.
When thinking about the reliability of the Net Promoter Score (or any single item measure), consider the following from this analysis of consumer attitudes toward popular brands over time:
- Use test-retest reliability for single items. To measure the reliability of single items, correlate the same person’s responses to the same items after some period of time (called test-retest reliability).
- The Net Promoter Score’s test-retest reliability was relatively high (r=.75) and higher than single measures of satisfaction and brand attitude.
- There’s no standard but r = .75 seems reasonable. There is little consensus on what constitutes “adequate” or “good” reliability, but a correlation of r=.75 certainly places it as high as many other measures in the literature.
- Likelihood to recommend isn’t static. People’s attitude toward a brand isn’t fixed. Some amount of change is expected (from news events or direct experiences with the brand). The generally small differences in LTR scores and high correlation were, therefore, encouraging after one month of time.
- The average difference in mean LTR scores differed by only 1.2%. The difference was not statistically significant and actually lower than satisfaction with the brand and the same as brand attitude. Differences within brands varied by as much as 30%, but the samples used by brand were smaller and only one brand LTR scores (American Express) statistically changed.
- A future analysis can compare the single LTR item to a multi-item measure of loyalty and compare the reliability and examine the consequences of using the top-two box minus bottom six box scoring services.