Morgan and Rego are back in Pingatore et al. in 2007.
The authors acknowledged that asking Likelihood to Recommend has merits, its just that businesses may mistakenly think it is the ONLY measure that predicts financial performance (no citation for this concern). In examining the JD Power proprietary data, the authors claimed that “net metrics of any kind are usually not the strongest voice-of-the-customer metrics.”
The authors examined JD Power data from 2005–2006 in multiple industries, each with at least nine companies: auto insurance (24 companies), full-service investment (21 firms), airline (10 carriers), and rental car (9 brands).
They looked at several different measures scored using the standard mean or “net” scoring, including: ten-point net “delighted” (anchored with outstanding to displeased), net satisfaction (top-two box minus bottom-five box), a four-item net committed scale, and a four-point version of the NPS.
We’ve tested five- and ten-point versions of the NPS and found that they provide relatively similar scores; however, a three-point will-recommend item performed noticeably worse. It’s unclear how well a four-point version would perform. The authors noted this weakness but included a full 11-point version for the full-service investment firms.
The authors collected financial metrics for each industry using historical data (2002–2005). They used a mix of sources and metrics: For auto insurance companies they used retention rates and customer acquisition costs. For airlines they used three-year change in revenue. For full-service investment firms they used self-reported metrics, including “Share of Wallet” data and self-reported amount/frequency of investing.
The correlation between NPS 4 and NPS 11 in the full-service investment industry was high (r = .94—calculated by me); however, the differences in scores were quite large. The mean differences across the two scores was 22%, with the 11-point NPS always scoring lower, ranging between 10% and 44%. This is more than five times as large as the difference we observed when comparing 5 to 11 points and 10 to 11 points where the average difference was 4%, with a high of 10%. It is, however, similar to our analysis of three-point recommend scales, which had an average difference of 22% in two large sample studies. This suggests the four-point, like the three-point, is not a good substitute for the 11-point version.
For full-service investment, the standard 11-point NPS performed comparably or, in some cases, better than several other measures, including multi-item satisfaction measures. In many cases, using the mean of the 11-point LTR performed slightly better. The mean absolute difference in correlations was a modest .07 (not using the fisher transformation). The NPS also performed slightly better than a multi-item satisfaction index, also with a modest mean absolute difference of .06.
In the rental car industry, the authors concluded that the best predictor of revenue growth was Net Satisfaction, and in the airline industry a multi-item satisfaction index performed best. However, an examination of their correlations in Exhibit 3 shows the mean of the four-point Likelihood-to-Recommend item had the highest correlation (r = .773) compared to Net Satisfaction (r = .691). The airline data seems to match the prose, with a correlation of r = .875 with the Sat Index compared to r = .824 for the four-point NPS. They did not collect the 11-point LTR for these two industries.
For the auto insurance industry, the raw LTR (r = -.52) was the best predictor, followed by the NPS 4 (r = -.43), compared to overall satisfaction or the satisfaction index (r = -.43). For actual retention rates, both LTR 4 (r = .6) and NPS 4 (r = .59) were the worst relative predictors, but underperformed the best satisfaction measure (r = .77) by .17 points.
The authors point out the much larger sample sizes needed to have the equivalent margins of error (something we discussed earlier too). However, it could be that what is lost in not using the mean is worth shedding.
While one reading of this article is that the NPS isn’t universally the best metric to always predict all growth metrics, another interpretation is that many metrics, both using the mean form and “net” scoring, will predict future metrics. There is variation in which ones will be the best, but rarely is one so different from another that it’s worth switching measures.
Note Gina Pingitore at the time of writing was the chief research officer at JD Power and Associates, a competitor to Satmetrix, who would have a strong interest in promoting their own satisfaction metrics.
Takeaways: Their 4-point and 11-point versions of the NPS correlated with historical business metrics, but it wasn’t always the best. In some cases, the NPS was the best it was; in some cases, it was the worst, and in some cases, it performed better than a multi-item measure. Using only four points for their NPS item may have confounded their findings.