The Net Promoter Score (NPS) is widely used by organizations. It’s often used to make high-stakes decisions on whether a brand, product, or service has improved or declined.
Net Promoter Scores are often tracked on dashboards, and any changes (for better or worse) can have significant consequences: adding or removing features, redirecting budgets, even impacting employee bonuses. Random sampling error, however, can often explain many of the changes in Net Promoter Scores, and organizations don’t want to be fooled by randomness. Two approaches to help differentiate the signal (meaningful changes) from the noise of sampling error are confidence intervals and significance tests.
We recently described three methods for computing confidence intervals for Net Promoter Scores. (For full computational details, see “Confidence Intervals for Net Promoter Scores.”) One was based on adjusted-Wald estimates, one was based on trinomial means, and one used bootstrapping. The adjusted-Wald method had accurate coverage and—especially for smaller sample sizes—produced more precise (narrower) intervals.
After completing our research on NPS confidence intervals, we worked out a method for testing the statistical significance of the difference between two NPS datasets based on the formula for the adjusted-Wald confidence interval. (For computational details, see “How to Statistically Compare Two Net Promoter Scores.”) Our preliminary evaluation of that method was promising, but it was limited to one test case.
For this article, we compare this new statistical test with two other methods using real-world NPS data that we collect every two years in UX surveys of consumer applications.
Comparison of the Three Test Methods
The three methods we used were
- Adjusted-Wald: For each set of NPS data, add a constant of 3 to the sample size (n), ¾ to the number of detractors, and ¾ to the number of promoters. Then compute the standard error based on the variance of the difference in two proportions. Combine the standard errors for each adjusted NPS to get the standard error of the difference, and then divide the difference in the two adjusted NPS by that standard error to get a Z-score. Use the Z-score to determine the two-tailed p-value for the test.
- Trinomial means: For each set of NPS data, assign −1 to detractors, 0 to passives, and +1 to promoters. Use the means and standard errors for each NPS to compute a t-score. Use the t-score to determine the two-tailed p-value for the test.
- Randomization test: For each set of data, assign −1 to detractors, 0 to passives, and +1 to promoters. With at least 1,000 iterations (preferably more), use a randomization test to shuffle the data and recompute the difference in the NPS, storing each difference in an array. The two-tailed p-value for the test is the percentage of differences in the array where the absolute value is greater than the absolute value of the observed difference.
Every two years we research select business and consumer software, most recently in 2020. As part of that research, we collected likelihood-to-recommend ratings and computed the NPS for each product. Table 1 shows the sample sizes; number of detractors, passives, and promoters; and the NPS for 17 of the most recently evaluated consumer products (with data from over 1,000 respondents). The NPS ranged from −53% to 43%, and the sample sizes ranged from 29 to 111. We organized the products into three groups based on their sample sizes (small: 29–35; medium: 49–50; and large: 101–111) and paired them so we would have three comparisons in each sample-size group with variation in the magnitude of the NPS differences (|d| in Table 1).
|Video Editor A||11||16||8||35||9%||Music Service A||13||7||13||33||0%||9%|
|Language App||16||11||8||35||23%||Video Editor B||9||11||10||30||-3%||26%|
|Tax Prep||14||7||8||29||21%||Email A||6||6||19||31||-42%||63%|
|Music Service B||22||15||13||50||18%||PDF Program||19||17||14||50||10%||8%|
|Browser A||29||12||8||49||43%||Music Service B||22||15||13||50||18%||25%|
|Finance App||13||20||17||50||-8%||Email B||7||9||33||49||-53%||45%|
|Browser B||46||33||32||111||13%||App Suite B||47||39||22||108||23%||10%|
|App Suite A||53||34||14||101||39%||Slides App||48||24||30||102||18%||21%|
|Email C||61||30||16||107||42%||Word Processor||39||38||32||109||6%||36%|
Table 2 shows the results (p-values) for the three methods and nine comparisons.
|Comparison||n1, n2|||d|||adjusted-Wald||Trinomial Means||Randomization Test|
|Video Editor A/Music Service A||35, 33||9%||0.67||0.67||0.77|
|Language App/Video Editor B||35, 30||26%||0.20||0.20||0.22|
|Tax Prep/Email A||29, 31||63%||0.00||0.00||0.01|
|Music Service B/PDF Program||50, 50||8%||0.63||0.63||0.72|
|Browser A/Music Service B||49, 50||25%||0.13||0.12||0.14|
|Finance App/Email B||50, 49||45%||0.00||0.00||0.01|
|Browser B/App Suite B||111, 108||10%||0.33||0.33||0.35|
|App Suite A/Slides App||101, 102||21%||0.06||0.06||0.07|
|Email C/Word Processor||107, 109||36%||0.00||0.00||0.00|
We expected that small differences would generally not be statistically significant and that large differences generally would, with larger sample sizes having more power than smaller sample sizes. The patterns of p-values in Table 2 were consistent with this expectation. For example, differences of 36 and 45 percentage points were statistically significant at respective sample sizes around 100 and 50 per product, while differences of 8 to 11 points weren’t statistically significant at any of the sample sizes, even those over 100.
The p-values produced by the adjusted-Wald and Trinomial Means methods were surprisingly close. For seven of the nine comparisons, they were the same. For one comparison, the adjusted-Wald was .01 larger than the Trinomial Means method, and for the other comparison, the adjusted-Wald was .01 smaller than the Trinomial Means method.
The p-values for the Randomization Test method were about .10 larger than the others when sample sizes were small to medium, and the difference in the NPS was less than 10%. For larger differences and sample sizes, its p-values were closer to those of the other methods, but it was only equal to the others when the sample size and the NPS difference were large, or the NPS difference was very large (63%).
Even though the p-values for the Randomization Test method tended to be larger than the others when sample sizes or the NPS differences were smaller, all three methods produced consistent decisions regarding whether a difference was statistically significant (p < .05 or p < .10) for these nine comparisons.
Summary and Discussion
Using real-world NPS data to make nine comparisons, we evaluated three methods of testing the differences between two NPS datasets. Data came from surveys of 17 consumer applications and 1,079 respondents.
We found that a new method based on the formula for adjusted-Wald NPS confidence intervals worked as well as a method based on NPS trinomial means. Both of these methods had better performance than a randomization test, especially when sample sizes and the NPS differences were smaller. The randomization test makes fewer assumptions about the distribution of the NPS scores, but that seems to make it more conservative than the other methods.
If conducting tests of significance only, the method based on NPS trinomial means works surprisingly well. Because randomization tests are more conservative in many cases and otherwise have similar results, we do not recommend routinely using them to assess the NPS differences.
Based on these results, we recommend that UX researchers who need to conduct a test of significance on pairs of NPS scores use the adjusted-Wald method, especially when used in conjunction with adjusted-Wald confidence intervals.