When we wrote Quantifying the User Experience, we put confidence intervals before tests of statistical significance. We generally find fluency with confidence intervals to be easier to achieve and of more value than with formal hypothesis testing. We also teach confidence intervals in our workshops on statistical methods.
Most people, even non-researchers, have been exposed to the concept of margins of error—political polls include them. Any estimate of a statistic, such as a percentage or a mean, is approximate due to random measurement error, and the margin of error quantifies that uncertainty with a given level of confidence.
Significance testing is much more difficult for people to wrap their heads around, even after many hours (or semesters) of training. Problems with the interpretation of the results of significance tests by professional scientists are frequent enough that over the past few decades there have been occasional calls to drop significance testing and to focus instead on effect sizes and confidence intervals.
Although we acknowledge their problems, we do not advocate eliminating significance testing because it can be a useful technique in the UX researcher’s toolkit. Tests of significance provide a principled way to divide results into two groups given the data at hand—those that could plausibly have happened by chance (not significant) and those that are not likely to have happened by chance (significant).
Despite its popularity, there is no well-founded method for significance testing of differences between Net Promoter Scores (NPS). The NPS is a popular loyalty metric calculated with a single likelihood-to-recommend (LTR) question (“How likely is it that you would recommend our company to a friend or colleague?”) that has 11 scale steps from 0 (Not at all likely) to 10 (Extremely likely). Respondents who select 9 or 10 on the LTR question are “Promoters,” those who select 0 through 6 are “Detractors,” and all others are “Passives.” The NPS is the percentage of Promoters minus the percentage of Detractors.
In two recent articles, we presented a relatively new adjusted-Wald method for computing NPS confidence intervals (Rocks, 2016) and demonstrated that it worked well with real-world NPS datasets. Fortunately, once you have an established method for computing confidence intervals for a measure, it’s possible (with a little algebra) to convert that into a test of significance.
In this article, we present a new significance test for the NPS based on the NPS adjusted-Wald confidence interval and describe how to compute a confidence interval around the difference between scores.
The New Test of Significance
Here are computational details for the test with a fully worked-out example. Note that to keep computations as simple as possible, we usually work with the NPS as proportions and only convert to percentages when reporting scores.
To use this method, you need to know the number of detractors, passives, and promoters for each NPS (which are usually available in company dashboards). The computational steps are
- Add 3 to the sample sizes: n1.adj = n1 + 3; n2.adj = n2 + 3.
- Add ¾ to the number of detractors: ndet1.adj = ndet1 + ¾; ndet2.adj = ndet2 + ¾.
- Add ¾ to the number of promoters: npro1.adj = npro1 + ¾; npro2.adj = npro2 + ¾.
- Compute the adjusted proportion of detractors: pdet1.adj = ndet1.adj/n1.adj; pdet2.adj = ndet2.adj/n2.adj.
- Compute the adjusted proportion of promoters: ppro1.adj = npro1.adj/n1.adj; ppro2.adj = npro2.adj/n2.adj.
- Compute the variances: var1.adj = ppro1.adj + pdet1.adj − (ppro1.adj − pdet1.adj)2; adj = ppro2.adj + pdet2.adj − (ppro2.adj − pdet2.adj)2.
- Compute the adjusted NPS: NPS1.adj = ppro1.adj − pdet1.adj; NPS2.adj = ppro2.adj − pdet2.adj.
- Compute the difference in adjusted NPS: NPS.diff = NPS1.adj − NPS2.adj.
- Combine the variances to get the standard error of the difference: se.diff = (var1.adj/n1.adj + var2.adj/n2.adj)½.
- For a two-tailed test, divide the absolute value of the difference by the standard error to get a Z score: Z = abs(NPS.diff)/se.diff.
- Assess the significance of the difference by getting the p-value for Z; in Excel, you can use the formula: =2*(1-NORM.S.DIST(Z,TRUE)).
For example, in a UX survey of online meeting services conducted in 2019, we collected likelihood-to-recommend ratings. For GoToMeeting (GTM), there were 8 detractors, 13 passives, and 15 promoters, making an NPS of 19% (n = 36). For WebEx, there were 12 detractors, 12 passives, and 7 promoters, for an NPS of −16% (n = 31). Table 1 shows the steps to compute the significance of their difference.
|Compute NPS and Variance||n.adj||ppro.adj||pdet.adj||NPS.adj||Var.adj|
|GTM vs WebEx||0.33||0.180||1.815||0.07|
With p = .07, the statistical significance of the test depends on the criterion determined before testing. The most common significance criterion in scientific publishing is p < .05, but it’s not uncommon in industrial research to use p < .10. This example illustrates that large differences can be differentiated from sampling error even at modest sample sizes for NPS surveys (the difference here was considerable: 33 percentage points).
As a sanity check on the value of p from this new Z-test, we also conducted (1) a standard t-test on the NPS trinomial mean computed after assigning −1 to each detractor, 0 to each passive, and +1 to each promoter, and (2) a randomization test on the same data (similar to the bootstrapping we did when comparing confidence intervals). For the t-test, the result was t(65) = 1.86, p = .07 (the same as the Z-test); for the randomization test, the result was p = .085 (a bit higher than the Z test, but still significant at p < .10).
Constructing a Confidence Interval Around the Difference
As we mentioned in the introduction, a test of significance is often a reasonable first step in an analysis, but it has limited utility for assessing the practical significance of the result. For that, you need a confidence interval around the difference. The steps are
- Find Z for the desired level of confidence: Common values are 1.96 for 95% confidence and 1.645 for 90% confidence.
- Compute the margin of error for the interval: MoE = Z(se.diff).
- Compute the confidence interval: NPS.diff ± MoE.
Table 2 shows the results for a 90% confidence interval around the GTM and WebEx difference, and Figure 1 shows a graph of the interval.
|GTM vs WebEx||0.33||0.180||1.645||0.30||0.03||0.62|
Consistent with the obtained value of p = .07 from the test of significance, the 90% confidence interval doesn’t include 0. Thus, consistent with the significance test, 0 is not a plausible difference. The difference could plausibly be as low as 3% or as high as 62%.
A common strategy when working with confidence intervals is to ask if the practical decisions you’d make if the lower limit was true are the same as those you’d make if the upper limit was true. If that is the case, your confidence interval is precise enough for your purposes. If not, then you need to collect more data to increase its precision.
Summary and Discussion
Based on earlier research investigating the best way to construct confidence intervals for the NPS, we have developed a new way to assess the statistical significance of the difference between two Net Promoter Scores.
We applied the method using data collected in a 2019 UX survey of online meeting services, focusing on GoToMeeting (n = 36; NPS = 19%) and WebEx (n = 31, NPS = −16%), and found the difference to be statistically significant (p = .07, p < .10).
That result was consistent with the p-values from two other methods that have been used to compare Net Promoter Scores. A t-test on the trinomial mean after assigning −1, 0, and +1 to detractors, passives, and promoters, respectively, also had p = .07. The p-value from a randomization test was a bit higher (p = .085) but close to the other methods.
Because a test of significance is limited by itself, we also demonstrated how to construct a confidence interval around the difference in the Net Promoter Scores for GoToMeeting and WebEx.
In a future article, we plan to draw upon our real-world NPS datasets to explore how well this new method works with a greater variety in sample size and magnitude of the difference in NPS.
Here’s the algebra behind this new test of significance for NPS.
Our starting point is the known formula for the variance of the difference of two proportions applied to Net Promoter Scores (see Rocks, 2016):
var = ppro + pdet − (ppro − pdet)2, where ppro and pdet are the observed proportions of promoters and detractors in one sample.
And the adjustments for Rocks’ (3, T) version of the adjusted-Wald:
n.adj = n + 3
ppro.adj = (npro + ¾)/n.ajd
pdet.adj = (ndet + ¾)/n.adj
nps.adj = ppro.adj − pdet.adj
To get the variance after adjustments, substitute the adjusted values in the standard formula for the variance of the difference in two proportions:
var.adj = ppro.adj + pdet.adj − (ppro.adj − pdet.adj)2
The standard error for a difference between two samples is the square root of the sum of the variances divided by their respective sample sizes:
se.diff = (var1.adj/n1.adj + var2.adj/n2.adj)½
Finally, the Z-score is the ratio between an observed difference and the standard error of the difference:
nps.diff = nps1.adj − nps2.adj
Z = nps.diff/se.diff