Sample size estimation is a critical step in research planning, including when you’re trying to detect differences in measures like Net Promoter Scores.

Too small of a sample and you risk not being able to differentiate real differences from sampling error. Too large of a sample and you risk wasting resources—researchers’ and respondents’ time and, likely, participant costs.

Until recently, there has been no well-defined method for estimating sample sizes for Net Promoter Score (NPS) studies. In a previous article, we provided a sample size estimation method for NPS precision studies. These studies focus on adjusted-Wald confidence intervals for NPS with a specified margin of error.

Drawing on articles we’ve published that describe and evaluate a significance test for NPS that uses adjusted-Wald proportions, this article provides a method for the more complex problem of sample size estimation when the research plan includes statistical comparison of two independent NPS.

## Sample Size Estimation for Comparison of Two NPS

We’ve written descriptions of statistical hypothesis testing, covering the basics and what can go wrong. The primary goal of sample size estimation is to ensure a large enough sample to be able to detect a difference if one exists. This step is necessary to control Type I and Type II errors.

The sample size estimation formula for comparing two NPS is

n = s^{2}Z^{2}/d^{2} − 3

In the formula,

- d is the minimum difference with which you can reject the null hypothesis.
- s is an estimate of the standard deviation.
- Z is a Z-score whose value is the sum of two component Z-scores: one controls the likelihood of Type I errors (Z
_{α}, where 1−α is the confidence level) and another controls the likelihood of Type II errors (Z_{β}, where 1−β is the level of power).

This formula calculates the sample size for one group (n), so the total sample size requirement for a study comparing two NPS will be 2n. It’s similar to other formulas for computing sample sizes. (See Chapter 6 in *Quantifying the User Experience*.)

The process for computing a sample size starts with identifying a reasonable minimum difference you hope to detect. Then we work backward from the adjusted-Wald formula, which we used for the confidence interval around an NPS difference, by solving for the sample size algebraically. (If you’re interested and want to check our math, see the appendix.) The NPS is usually reported as a percentage, but to keep computations simple in this article, sometimes we use mathematically equivalent proportions.

At a high level, the sample size estimation process has four steps:

- Start by deciding the minimum difference you need to detect (d).
- Estimate the standard deviation (s = the square root of the variance).
- Determine the appropriate level of confidence (Z
_{α}). - Determine the appropriate level of power (Z
_{β}).

After completing these steps, compute the sample size or use lookup tables to avoid messy math. Next, we’ll go over each of the four steps in detail, including three ways to estimate the standard deviation (variance).

## 1. Decide the Minimum Difference to Detect (d).

This step is the most important, and it often has the biggest effect on the sample size needed. You need to consider the minimum difference in percentage points between whatever two Net Promoter Scores you’re comparing (e.g., Q1 vs. Q2 NPS; your company’s NPS vs. a competitor’s). It can be helpful to think in terms of small, medium, and large differences. While there’s no official designation, as a rough guide, large differences would be something like 25 percentage points or larger (e.g. 50% vs. 25%). Medium differences might be between 25% and 10%, and small differences might be less than 10%. Of course, context matters.

The smaller the difference you hope to detect, the larger the sample size you will need. If you need to show that your NPS improved by as little as 3%, you’re going to need a huge sample size (see table and discussion below).

## 2. Estimate the Standard Deviation (s).

The standard deviation (the square root of the variance) is the most common way of measuring variability. Usually, people don’t know the standard deviation ahead of time. However, because the NPS is bound between −100% and 100%, and we have a growing NPS database that includes standard deviations, we can come up with good estimates. There are three approaches:

- A conservative approach assumes the greatest possible variability (max variance) and will require a large sample size.
- A less conservative approach based on a more realistic estimate of maximum variance requires a smaller sample size.
- Use a variance estimated from prior data.

### 2a. Simple Approximate Formula Assuming Maximum Variance

If you have no idea about the variability of your NPS data, you could use a simple estimate that guarantees your sample size will be large enough to reject the null hypothesis for the target difference:

s^{2} = 2

This estimate is simple and guarantees an adequate sample size for the goals of the study, but because it assumes maximum variance, it likely will recommend a sample size much larger than needed.

### 2b. Simple Approximate Formula Assuming Maximum Realistic Variance

The next estimate is still simple, but it is slightly modified from the one above, replacing 2 with 1.34. This reduces the estimation of the sample size by about 33%.

So, where does 1.34 come from? As shown in the appendix, before adjustment and simplification, the sample size formula is

n = (var1 + var2)( Z^{2})/d^{2}

In this formula, var1 is the unadjusted variance of the first NPS and var2 is the unadjusted variance of the second NPS.

The variability of NPS is maximized to 1 when half of the respondents are detractors and the other half are promoters, so var1 + var2 = 2 (which is where the 2 came from when assuming maximum variance).

For this to happen, though, both sets of NPS results would have to be evenly split between promoters and detractors. In our experience, this scenario is unlikely to happen. In previous research, we computed adjusted variances for 18 sets of NPS data. Across those analyses, the variances ranged from .40 to .76 with a mean (and 99% confidence interval) of .61 ± .06. Given an upper limit of .67 for the 99% confidence interval, we settled on using .67 as a reasonable estimate of maximum realistic variance, so the 1.34 in the modified formula is the sum of two maximum realistic variances. (Note that this decision uses a mix of math and judgment, which is subject to change pending future research.)

### 2c. Maximum Accuracy Method Given Estimates of Variance

If you have some knowledge of the magnitude of adjusted variances for the two NPS (from previous research or a pilot study), then you can estimate the required sample size more accurately.

This process has two stages. In the first stage, compute adjusted-Wald confidence intervals for the previous/pilot data using the steps in our previous article, “Confidence Intervals for Net Promoter Scores.” In the second stage, add the two adjusted variances to get the estimate that will be the most accurate:

s^{2} = var1.adj + var2.adj

## 3. Set the Confidence Level to Control the Type I Error (Z_{α}).

Z_{α} is the value associated with statistical confidence and the α criterion used to determine statistical significance (control of the Type I error). Commonly used values for Z_{α} are 1.96 (for two-tailed tests with α = .05) and 1.645 (for two-tailed tests with α = .10).

## 4. Set the Level of Power to Control the Type II Error (Z_{β}).

Z_{β} is the value associated with statistical power and the β criterion used to control the Type II error. Even when comparing two NPS, always use one-tailed values of Z_{β}. Common values for Z_{β} are 0 (for 50% power, β = .5) and .842 (for 80% power, β = .2).

## Examples

The following examples show how to use the sample size formula:

n = s^{2}Z^{2}/d^{2} − 3

Note that the smaller the α criterion (larger value of Z_{α}), the more power you want (larger value of Z_{β}); the smaller the difference you want to be able to detect (smaller value of d), the larger the sample size requirement will be.

### Example 1: Maximum Variance Method

If you’re planning a study with α = .05, β = .20, and d = .10, the sample size for one group will be 2(1.96 + .842)^{2}/.10^{2} − 3 = 1,568 (always round up), so the total sample size for two groups would be 3,136.

If you relax d to .15 and the respective α and β criteria to .10 and .50, the sample size would be 2(1.645)^{2}/.15^{2} − 3 = 236 per group for a total of 472.

### Example 2: Maximum Realistic Variance Method

Using the data from Example 1 but substituting 1.34 for 2, the estimated sample size requirement for a study with α = .05, β = .20, and d = .10 would be 1.34(1.96 + .842)^{2}/.10^{2} − 3 = 1050, so the total sample size for the study would be 2,098 (instead of 3,136).

If relaxing d to .15 and the respective α and β criteria to .10 and .50, the sample size would be 1.34(1.645)^{2}/.15^{2} − 3 = 159 per group for a total of 318 (instead of 472).

### Example 3: Maximum Accuracy Method

For an example of the maximum accuracy method, consider the analyses below of two NPS. In a UX survey of online meeting services conducted in 2019, we collected likelihood-to-recommend ratings. For GoToMeeting (GTM), there were 8 detractors, 13 passives, and 15 promoters, for an NPS of 19% (n = 36). For WebEx, there were 12 detractors, 12 passives, and 7 promoters, for an NPS of −16% (n = 31). The total sample size was 67 (36+31). Table 1 shows the steps to compute their 90% confidence intervals, and Table 2 shows the results from a significance test comparing the two NPS.

Service | n.adj | ppro.adj | pdet.adj | NPS.adj | Var.adj | se.adj | z90 | MoE90 | Lower90 | Upper90 |
---|---|---|---|---|---|---|---|---|---|---|

GTM | 39 | 0.40 | 0.22 | 0.18 | 0.596 | 0.124 | 1.645 | 0.203 | −0.02 | 0.38 |

WebEx | 34 | 0.23 | 0.38 | −0.15 | 0.581 | 0.131 | 1.645 | 0.215 | −0.36 | 0.07 |

Significance Test | NPS.diff | se.diff | Z | p(Z) |
---|---|---|---|---|

GTM vs. WebEx | 0.33 | 0.180 | 1.815 | 0.070 |

If the process works, we should be able to start with these results and then get an estimated sample size close to the actual n of the study, allowing for some uncertainty due to rounding errors.

To match the study, Z_{α} and Z_{β} should be, respectively, 1.815 (α = .07) and 0 (50% power), and d should be .33. From the confidence intervals, we get var1.adj = .596 and var2.adj = .581. With these values, the estimated sample size requirement for one group is n = 1.815^{2}(.596 + .581)/.33^{2} − 3 = 33.36. For two groups, multiply n by 2 and round up to get 67, which matches the total sample size from the source comparison.

That gives us some confidence that the process is working, so let’s explore a few additional hypothetical examples using these data as if they were from previous or pilot research.

Suppose the α criterion had been set to .05, so the resulting p of .07 would not be sufficient to reject H_{0} and claim statistical significance. It’s close, though, so assuming no other changes, what sample size would you need to achieve p < .05? Substituting 1.96 for 1.815 in the previous calculation, we get n = 1.96^{2}(.596 + .581)/.33^{2} − 3 = 38.53 for one group, which would be a total sample size of 78. Assuming nothing else changes, if we collect data from 11 more participants, the p-value should drop to .05.

But what if something does change? After collecting more data the variances might increase, or the observed difference might decrease, and if that happens, we won’t get to p < .05. This is where power comes into play, like an insurance policy or safety net. If we keep our study goals the same but increase power from 50% to 80%, then the new β criterion is .2, with an associated Z-score of .842. The revised sample size requirement is 2.802^{2}(.596 + .581)/.33^{2} − 3 = 81.9 for one group (a total sample size of 164). The cost of increasing the power of the study from 50% to 80% is the requirement to collect data from 97 more participants (164 − 67).

## Sample Size Lookup Tables

As shown in the previous section, it’s best if you have some idea about the expected variances, but that might not always be possible. Table 3 shows maximum realistic variance estimates for a range of values of d (this time shown as percentages) for two α criteria (.10 and .05, i.e., 90% and 95% confidence) and two β criteria (.5 and .2, i.e., 50% and 80% power).

For example, if you want to detect a difference of 30% with 90% confidence and 50% power, you’d need a total sample size of 75 (e.g., 38 in one group and 37 in the other). If you want to detect a difference of 3% with 95% confidence and 80% power, you’d need to collect data from a total of 23,374 people (e.g., 11,687 in each group).

Difference to Detect | n (90% Confidence; 50% Power) | n (95% Confidence, 50% Power) | n (90% Confidence; 80% Power) | n (95% Confidence, 80% Power) |
---|---|---|---|---|

70% | 10 | 16 | 28 | 38 |

60% | 16 | 24 | 42 | 54 |

50% | 24 | 36 | 62 | 80 |

40% | 40 | 60 | 98 | 126 |

30% | 76 | 110 | 180 | 228 |

25% | 112 | 160 | 260 | 332 |

20% | 176 | 252 | 410 | 520 |

15% | 318 | 452 | 732 | 930 |

10% | 720 | 1024 | 1652 | 2098 |

5% | 2896 | 4114 | 6622 | 8408 |

3% | 8052 | 11434 | 18406 | 23368 |

1% | 72504 | 102946 | 165688 | 210344 |

If you have no choice but to collect all data without a break, using the table can help with decisions about confidence and precision, even though the estimated sample sizes are likely to be overestimates. With resources for a total sample size of only 100 (about 50 in each group), you should probably set confidence to 90%, power to 50%, and expect to detect significance for differences around 25%. Should anyone suggest that you need 95% confidence, 80% power, and the ability to detect differences as small as 1%, showing that the total sample size estimate for those requirements is n = 210,407 should guide the conversation to more reasonable requirements.

If you can break the survey into multiple administrations, you can use the table for an initial sample size estimate for budgeting purposes. Plan to stop data collection after you have about 50 ratings for each group, and then use the observed variances to get a more accurate estimate. Check the results when you get to the new estimate of n. If you’ve achieved statistical significance, you can stop. If you’re not quite there, you still have a budget for continuing data collection.

## Summary and Discussion

Based on an adjusted-Wald method for constructing NPS confidence intervals and conducting tests of significance, we’ve developed several associated methods for estimating sample size requirements to support that important step when planning NPS research. The methods differ in their complexity.

The simplest approach, the maximum variance method, assumes no prior knowledge about likely NPS variances, but it is likely to overestimate the required sample size by a large amount.

The next simplest approach, the maximum realistic variance method, also assumes no prior knowledge about likely NPS variances, but by using an adjustment based on analyses of 18 real-world NPS datasets, this approach reduces the maximum variance estimates by about 33%.

The most complex approach, the maximum accuracy method, requires prior estimates of the NPS variances. Despite its complexity, this approach will be substantially more accurate than the other methods, so we recommend its use whenever possible.

When it isn’t possible to use the maximum accuracy method, the table above provides sample size estimates based on the maximum realistic method, which you can use for initial project budgeting. Ideally, researchers should stop collecting data after about 50 complete responses per group so they can use the initial data to compute confidence intervals and variances. Then they can use the maximum accuracy method to fine-tune the study’s sample size estimate.

## Appendix

To get to the formulas used in this article, we begin with the known formula for the standard error of two independent variances. To keep things as simple as possible, we assume equal sample sizes for both groups. Also, we start with unadjusted values and will substitute adjusted values later.

se = ((var1 + var2)/n)^{½}

The simplest significance test, the Z-test, is computed by dividing an observed difference by the standard error.

Z = d/se

Multiply both sides by se and replace se with its formula.

d = Z(((var1 + var2)/n)^{½})

Square both sides.

d^{2} = Z^{2}((var1 + var2)/n)

Multiply both sides by n and divide both sides by d^{2}.

n = Z^{2}(var1 + var2)/d^{2}

Substitute adjusted n and variances for unadjusted values (Z and d are not adjusted).

n.adj = Z^{2}(var1.adj + var2.adj)/d^{2}

Because for this type of adjusted-Wald procedure, n.adj = n + 3:

n + 3 = Z^{2}(var1.adj + var2.adj)/d^{2}

Subtract 3 from both sides to get the final formula for the maximum accuracy sample size estimate.

n = Z^{2}(var1.adj + var2.adj)/d^{2} − 3

When estimates of NPS variance are not available, we can take advantage of the fact that maximum NPS variance is 1 when half of the respondents are promoters and half are detractors. This is the maximum variance method.

n = Z^{2}(1 + 1)/d^{2} − 3

n = 2Z^{2}/d^{2} − 3

The maximum variance formula is simple, but it will almost always produce a substantial overestimate for n, because it is unlikely, in our experience, that any reasonable sample size will have an even split between promoters and detractors. As described in the body of the article, we can modify this formula to one that uses an estimate of NPS variances that will overestimate n to a lesser extent because it’s based on analysis of 18 real-world NPS datasets. This is the maximum realistic variance method.

n = 1.34(Z^{2})/d^{2} − 3