The Importance of Replicating Research Findings

Jeff Sauro, PhD

You’ve probably heard of the infamous Stanford Prison Experiment by Philip Zimbardo (especially if you took an intro psych class).

The shocking results had similar implications to the notorious Milgram experiment and suggested our roles may be a major cause for past atrocities and injustices.

You might have also heard about research from Cornell University that found, across multiple studies, that simply having larger serving plates make people eat more!

It’s hard to question the veracity of studies like these and others that have appeared in peer-reviewed, prestigious journals coming from well-educated, often tenured professors from world-class academic institutions and often funded by the government or other non-profit grants.

Yet the methodology of the Stanford prison experiment has come into question and those serving portion studies have been retracted along with many others, and the author was forced to resign from Cornell.

Fortunately, it’s less common to find studies that are redacted based on overtly disingenuous methods and analysis (like egregious p-hacking). It’s more common, and likely more concerning, that a large number of studies’ findings can’t be replicated by others.

The replication problem is only for peer-reviewed studies. What about more applied methods and research findings that are subject to less academic rigor? For example, the Myers-Briggs personality inventory has been around for decades and by some estimates is used by 20% of Fortune 1000 companies. But the validity—and even basic reliability—of the Myers-Briggs has been called into question. Some feel passionately that it should “die.” Others point out that many of the objections are unfounded and there’s still value.

The Net Promoter Score, of course, has also become a widely popular metric with many supporters and vocal critics in academia and industry.

To compound matters, there’s a natural friction between academia’s emphasis on internal validity from controlled settings and the need for applied research to be more generalizable, flexible, and simple.

It’s tempting to take an all or nothing approach to findings:

  • If there’s a flaw with a study, all the findings are wrong.
  • If it’s published in a peer-reviewed journal, it’s “science” and questioning it makes you a science denier.
  • If it’s not peer reviewed, it can’t be trusted.

But failing to replicate doesn’t necessarily mean all the original findings are worthless or made up. It often just means that the findings may be less generalizable due to another variable, or the effects are smaller than originally thought.

Thinking in terms of replication is not just an exercise for academics or journal editors. It’s something everyone should think about. It’s also something we do at MeasuringU to better understand the strengths and weakness of our methods and metrics.

Our solution is more replication to understand the findings carefully and then document our methods and limitations for others to also replicate. We publish the findings on our website and in peer-reviewed journals.

Here are nine examples of claims or methods that we’ve attempted to replicate. Some we found similar results for, and we failed to replicate others, but with all we learned something and hope to extend that knowledge for others to leverage.


Top-Tasks Analysis to Prioritize

Prioritizing tasks is not a new concept, and there are many methods for users rating and ranking. But having users force-rank hundreds or thousands of features individually would be too tedious for even the most diligent of users (or even the most sophisticated, choice-based conjoint).

Gerry McGovern proposed a unique way of having users consider a lot of features in The Stranger’s Long Neck: Present users with a randomized (often very long) list and have them pick five tasks. I was a bit skeptical when I read it a few years ago (as are other survey experts, Gerry says). But I decided to try the method and compare it to other more sophisticated approaches. I found the results generally comparable. Since then, I’ve used his method of top-task ranking for a number of projects successfully.


Suggestions: It’s essential to randomize the order of the tasks and we found similar results whether you have respondents “pick” their top five tasks or “rank” their top five. And there’s nothing sacred about five—it could be three, two, or seven—but it should be a small number relative to the full list.


Reporting Rating Scales Using Top-Box Scoring

Marketers have long summarized rating scale data by using the percentage of responses that select the most favorable response (which before the web used to mean checking a box on paper). This “top-box” scoring though loses a substantial amount of information as 5, 7, and 11-point scales get turned into 2-point scales. I’ve been critical of this approach because of the loss of information. However, as I’ve dug into more published findings on the topic, I found that behavior is often better predicted from the most extreme “top-box” responses. This suggests that there’s something more to top-box (and bottom-box) scoring—more than just executive friendliness. For example, we’ve found that people who respond with a 10 (the top box) are the most likely to recommend companies and the top SUPR-Q responders are most likely to repurchase.


Caveat: While a top-box approach may predict behavior better, it’s not always better than summarizing with the mean (or certainly rejecting the mean). If you’re looking to predict behavior, the extremes may be superior to the mean but it’s not that hard to track both.


A Lostness Measure for Findability

Tomer Sharon breathed new life into a rarely used metric that was proposed in the 1990s when hypertext systems were coming of age. This measure of how lost users are when navigating is computed using the ratio of number of unique pages and total pages visited to the minimum “happy path.” The original validation data was rather limited, using only a few high school studies to validate the measure on a college hypercard application. How much faith should we have in such a small, hardly representative sample? To find out, we replicated the study using 73 new videos from US adults on several popular websites and apps. We found the original findability thresholds were in fact, a reasonable proxy for getting lost.


Suggestions: Asking the 1-item SEQ predicted lostness scores with 95% accuracy. If lostness isn’t computed automatically it may not be worth the manual effort of logging paths.


Detractors Say Bad Things about a Company

The rationale behind designating responses of 0 to 6 as detractors on the 11-point Likelihood to Recommend item used for the NPS is that these respondents will be most likely to spread negative word of mouth. Fred Reichheld reported that 80% of the negative comments came from these responses in his research.

In our independent analysis we were able to corroborate this finding. We found 90% of negative comments came from those who gave 0 to 6 on the point scale (the detractors).


Caveat: This doesn’t mean that 90% of comments from detractors are negative or that all detractors won’t purchase again, but it means that the bulk of negative comments do come from these lower scores.

The CE11 Is Superior to the NPS

Jared Spool has urged thousands of UX designers to reject the Net Promoter Score and instead has advocated for a lesser known 11-item branding questionnaire, the CE11. Curiously, one of the items is actually the same one used in the Net Promoter Score.

We replicated the study described in Jared’s talks but found little evidence to suggest this measure was better at diagnosing problems. In fact, we found the 11-point LTR item actually performed as well as or better than the CE11 at differentiating between website experiences.

Failed to Replicate

Suggestions: This may be a case of using a valid measure (Gallup validated the CE11) in the wrong context (diagnosing problems in an experience).


The NPS Is Superior to Satisfaction

One of the central claims of the NPS is that it’s better than traditional satisfaction measures that were often too lengthy, didn’t correlate with future behavior, and were too easy to game.

We haven’t found any compelling research to suggest that the NPS is always—or in fact ever—better than traditional satisfaction measures. In fact, we’ve found the two are often highly correlated and indistinguishable from each other.

Failed to Replicate

Caveat: Even if the NPS isn’t “superior” to satisfaction, most studies show it performs about the same. If a measure is widely used and short, it’s probably a good enough measure for understanding attitudes and behavioral intentions. Any measure that is high stakes (like NPS, satisfaction, quarterly numbers, or even audited financial reports) increases the incentive for gaming.


SUS Has a Two-Factor Structure

The System Usability Scale (SUS) is one of the most used questionnaires to measure the perceived ease of a product. The original validation data on the SUS came from a relatively small sample and consequently offered little insight into its factor structure. It was therefore assumed the SUS was unidimensional (measuring just the single construct of perceived ease). With much larger datasets, Jim Lewis and I actually found evidence for two factors, which we labeled usability and learnability. This finding was also replicated by Borsci using an independent dataset.

However, in another attempt to replicate the structure, we found the two factors were actually artifacts of the positive and negative wording of the items and not anything interesting for researchers. We published [pdf] the results of this replication.

Replicated then Failed to Replicate

Caveat: This is a good example of a replication failing to replicate another replication study!


Promoters Recommend More

Fred Reichheld reported that 80%+ of customer referrals come from promoters (9s and 10s on the 11-point LTR item). But some have questioned this designation and there was little data to corroborate the findings. We conducted a 90-day longitudinal study and found that people who reported being most likely to recommend (9s and 10s) did indeed account for the majority of the self-reported recommendations.


Caveats: Our self-reported recommendations accounted for less than the 80%+ referral data Reichheld reported and we didn’t use actual referrals, only self-reported recommendations. While promoters might recommend, we found stronger evidence that the least likely to recommend are a better predictor of those who would NOT recommend.


The NPS Predicts Company Growth

When we reviewed Fred Reichheld’s original data on the predictive ability of the NPS we were disappointed to see that the validation data actually used the NPS to predict historical, not future, revenue (as other papers also pointed out). We used the original NPS data and extended it into the future and did find it predicted two-year and four-year growth rates. But it’s unclear how representative this data was and we considered it a “best-case” scenario.

Using 3rd party NPS benchmark data and growth metrics for 269 companies in 14 industries data we again found a similar, albeit weaker relationship. The average correlation between NPS and two-year growth was r = .35 for 11 of the 14 industries and r = .44 when using relative ranks.


Caveats: The NPS does seem to correlate (predict) future growth in many industries most of the time. It’s not a universal predictor across industries or even in the same industry across long periods. The size of the correlation ranges from non-existent to very strong but on average is modest (r=.3 to .4). It’s unclear whether likelihood to recommend is a cause or effect of growing companies (or due to the effects of other variables).


Thanks to Jim Lewis PhD for commenting on this article.

    Your Cart
    Your cart is emptyReturn to Shop
    Scroll to Top