You go to your doctor for a checkup.
You’re feeling fine but as a matter of procedure, your doctor orders a battery of tests and scans to be sure all is well. She runs 30 tests in total.
A few days later she calls and tells you one of the tests came back positive–an indication something is potentially wrong.
Now it could be that a medical issue has gone undetected. After all, many diseases have no early symptoms. But it could also be that this result is a false positive–a result that indicates a problem when in fact nothing is wrong.
While the false positive rate for any single medical test is usually low (to minimize patient agony and unnecessary tests!), the chance you’ll see at least one false positive in multiple tests turns out to be rather high.
For example, if the false positive rate for any test is 5%, the probability of at least one false positive in 30 tests is almost 80% (use the binomial probability distribution to work this out).
The danger of false positives applies to more than just blood work and pet-scans; it applies to p-values used in data science and customer experience research, too. With the proliferation of big data, the number of statistical tests we can perform seems endless. But the number of fluke discoveries we’re likely to detect has increased as well.
Managing the False Positive Rate
Fortunately, there are techniques to help manage the false positive rate when making multiple comparisons.
One of the better known is the Bonferroni correction. The idea is that you lower your threshold for significance by adjusting your p-value downward based on the number of tests you perform. It’s done by dividing the original p-value by the number of tests. If your goal is a finding (such as differences between two designs or survey questions) that’s statistically significant at p < .05 for 20 tests, your new corrected p-value becomes .05/20 = .0025. That means only p-values below .0025 become statistically significant.
This correction is intuitive and simple to compute and is used extensively by researchers in many fields. Unfortunately, it tends to overcorrect the problem by making it too difficult to declare differences as statistically significant! In protecting against the false positive rate, it increases the false negative rate! This is akin to making a medical test with such a low false positive threshold that it tends not to flag a disease.
A relatively new approach that balances both false positives and false negatives is the Benjamini-Hochberg method. To use the Benjamini-Hochberg method, follow these steps:
- Rank the p-values from all your comparisons from lowest to highest.
- Create a new threshold for statistical significance for each comparison by dividing the rank by the number of comparisons and then multiplying this number by the initial significance threshold.
For example, for six comparisons using an initial significance threshold of .05, the lowest p-value, with a rank of 1, is compared against a new threshold of (1/6)*.05 = .008, the second is compared against (2/6)*.05 = .017, and so on, with the last one being (6/6)*.05 = .05. If the observed p-value is less than or equal to the new threshold, it’s considered statistically significant.
To see how this works in practice, we collected SUPR-Q scores from 10 popular U.S.-based retail websites in 2013. The raw values along with 90% confidence intervals are shown in the figure below (I used a statistical significance level (alpha) of .10 for this research).
Figure 1: Raw SUPR-Q scores for 10 retail websites. Data collected by MeasuringU in 2013.
Amazon had the highest SUPR-Q score and Walgreens the lowest, but which websites were statistically different from each other? With 10 websites, there are 45 total paired comparisons and plenty of opportunity to falsely identify a difference! I computed the p-values using an unadjusted method, the Bonferroni method, and the Benjamini-Hochberg method as shown in the table below.
With the p-values ranked lowest to highest, the results were:
- Unadjusted method: Without any adjustments, 22 of the 45 comparisons were statistically significant at p < .10 (the “Unadjusted Result” column in the table).
- Benjamini-Hochberg method: Using the Benjamini-Hochberg method, 17 of the 45 comparisons were statistically significant (the “BH Result” column in the table below). The significance threshold for the Benjamini-Hochberg method is shown in the “BH Threshold” column. For example, the difference between Amazon and Etsy’s SUPR-Q score (row 11) generated a p-value of .009, which is below the Benjamini-Hochberg threshold of (11/45)*.10 = .024. Because the p-value of .009 is less than .024, we declare it statistically significant.
- Bonferroni method: With the Bonferroni method, only 9 of the 45 comparisons were statistically significant (the “Bonferroni Result” column). For example, the threshold for the more conservative Bonferroni method for the Amazon and Etsy’s SUPR-Q score (row 11) was p < .002 (.10/45), which did not flag the difference between Amazon and Etsy as statistically significant.
Table 1: 45 paired comparisons between the 10 retail website SUPR-Q scores sorted by lowest p-values. “Sig” indicates which method flagged the comparison as statistically significant.
|Rank||Site 1||Site 2||p-value||BH Threshold||Unadjusted Result||BH Result||Bonferroni Result|
This illustrates how the Benjamini-Hochberg method strikes a good balance between the unadjusted p-values (which will give too many false positives over time) and the more conservative Bonferroni method (which generates too many false negatives over time) and is why I recommend it for managing false positives.
The next time you find yourself making a number of statistical comparisons (especially unplanned ones) or the subject of a number of medical tests, consider the Benjamini-Hochberg correction to help ensure your conclusions aren’t just false positives.
Learn More: UX Measurement Boot Camp
Intensive Training on UX Methods, Metrics and Measurement
|Denver: Aug. 5th-7th, 2020|