It’s often said that you can get statistics to show anything you want.
And while this is true to an extent, it’s equally true of any method, qualitative or quantitative.
With statistics, however, it’s a bit harder because there is a numeric audit trail that allows others to better understand your assumptions and conclusions.
Unfortunately this audit trail can also bring some additional scrutiny. Additional scrutiny is good for testing the veracity of claims, but it’s not so good when it’s done mean-spiritedly.
Some people use statistics and math as a weapon against those who are less quantitatively inclined. This can have the unfortunate effect of reducing the use of quantitative methods in places where they are needed most.
Over the years I’ve seen many UX practitioners shy away from quantifying any of their findings, often because of poor prior experiences. If they just plaster “qualitative” on all the slides then it’s more difficult for people to challenge the findings.
Qualitative methods play an important role in UX research, but when you need to do things like extrapolate your findings onto the user population, it’s best to use quantitative methods and statistics to understand the precision and uncertainty in your estimates.
You don’t need to go overboard quantifying everything and using unnecessarily complicated procedures. But when you do use appropriate quantitative methods, here are some thoughts on managing some critical questioning, and hopefully not too much uncivil criticism.
You can’t do that!
This is a broad criticism you may hear and it can take on many forms. It usually says less about the accuracy of your quantitative claims and more about how comfortable the person expressing the concern feels about your results.
You can’t use statistics on small sample sizes: This is a common concern. All things being equal it’d be great to have a larger than smaller sample size. But that doesn’t prevent you from using statistics. The math allows you to compute statistics with sample sizes as small as two. The t-statistic, Adjusted Wald, geometric mean and using the upper and lower boundary of the confidence interval are all techniques that make the most of small sample sizes.
Your sample size is too small: A corollary to the claim that you can’t use statistics on small sample sizes is that while you can, the sample size is too small to draw conclusions. Unlike the previous statement which is rarely true, this can be both a legitimate criticism, but can also be a knee-jerk reaction. If you are making a comparison between two designs and found the difference to be statistically significant (from say a p-value of less than .05), then your sample size isn’t too small. You had a large enough sample size to differentiate the difference from chance occurrences!
If you compared two designs and found no statistical difference, however, one of the first questions you should ask is if your sample size is too small. With small sample sizes you are limited to finding statistical significance for relatively big differences. If your conclusion is no significant difference then you need to be prepared to address the sample size question. For that see some tips on working with small sample sizes.
You can’t use ordinal data with statistics: An instance of the “you can’t” variety, this complaint often arises when we use rating scale data (like the Net Promoter Score or SEQ) in a statistical test to tell if there was an improvement or significant difference. Ordinal data are data that have clear orders (e.g. an 8 is greater than a 7) but it’s unclear if the distance between intervals is equal (e.g. what it takes to go from a 4 to a 5 is more than what it takes to go from a 3 to a 4).
This latter data is called interval data and the argument is an old controversy in statistics. Most scales you encounter in user research, marketing or psychology are ordinal (like the SEQ, SUS and SUPR-Q) and rarely will you find interval scales (like the SMEQ). There are some measurement purists (usually theoretical or academic) who take issue with using ordinal data to even compute means, standard deviations and therefore any statistical test.
Most applied statisticians, however, take no issue with using ordinal data to compute statistical tests because they provide very useful results. In fact, even SS Stevens[pdf], the person credited with coming up with the ordinal and interval hierarchy, says it’s OK to perform these “illegal” computations if they provide “fruitful results.” However, even us applied stats folks agree that you should be careful not to make interval-type statements about ordinal data. For example, you shouldn’t claim “users were twice as satisfied” when the the average satisfaction rating is twice as high.
With statistics there are of course mistakes, errors and erroneous conclusions (e.g. if you forget to take the square root of the sample size when computing the standard error). But statistics is more philosophy than math. That is, it’s a field dedicated to developing methods of interpreting data. And like politics and philosophy, well informed people can come to different conclusions and disagree.
In fact, one of the central frameworks in statistics, null hypothesis testing, is the result of years of hateful criticism between two schools of thought (Neyman-Pearson vs. Fisher[pdf]). Today we’re left with a model that neither would like but is generally useful. But one lesson we should learn is that very smart and informed people can bicker like children. Do us all a favor and refrain from the childishness next time you find yourself in disagreement.
Your Data Isn’t Normal: If you know even a little about statistics then you know how central a role the normal distribution plays. Seeing graphs of data that are not at all normal looking may raise eyebrows and lead some to question your techniques (means, confidence intervals and t-tests for example). Fortunately, the assumption of normality is more of a “nice to have” than an show-stopper. Technically speaking, for most of the data you encounter in applied research, the sampling distribution of the mean will be quite normal, even at small sample sizes. And when it’s not, most statistical tests still work quite well.
There’s No Difference: While convention has placed the threshold for declaring a difference as statistically significant at p <.05, there’s nothing sacred about this level. In applied settings having p < .10 or p <.20 can be enough evidence to conclude there’s a difference. Relaxing the significance criteria is especially appropriate when the consequences of being wrong are small and you aren’t looking to publish your results, but instead want to pick the better of two alternative designs. There can be a legitimate discussion around how much evidence is really needed.
You may find that the dispute is not around the statistics but around the time and effort committed toward a project or product that your results are showing as inferior. When this happens it’s more politics than p-values.
If you find yourself being told you are wrong, try to get to the heart of the disagreement. It’s rarely computational, but it could be. If it’s on the interpretation of results, remember that statistics and quantitative interpretation are like philosophy and politics. It’s likely that the people you are working with have differing attitudes toward how much evidence is needed to reject or accept your claim.
For some help on the matter, Robert Ableson wrote an influential book called Statistics as Principled Argument in which he described four styles an audience member can take: brash, stuffy, liberal and conservative. The first two he considered unreasonable and the latter two reasonable but differing in their willingness to explore possibilities.
I’ve never seen this before
Around 10 years ago I was presenting on the idea of combining usability metrics into a single score. It was to a large group of usability engineers, designers, and information architects inside the company where I worked at the time. About three quarters through the presentation someone in the audience raised his hand and said “I have a degree in mathematics and I’ve been in the field for 15 years and I’ve never seen anything like this.”
If something is innovative, it means it’s something new, or more likely repurposing an established method or technique in one field and applying it elsewhere. If someone points out they’ve never heard of a quantitative method or technique before, reassure them that it’s new but encourage them to dig deeper and see how effective it is. If you’re presenting a new metric, method or even visualization, be prepared for skeptical questions, but don’t get derailed by them.
Do We Really Disagree?
Because statistics involves math, it’s easy to think that conclusions should be black and white, right and wrong–like the hypotenuse of a triangle is always the sum of the square of the legs. A different answer means something was done wrong. Statistics is probabilistic however, and there’s disagreement on both the formulas and the results. Interpretation and judgement are essential with quantitative methods and with interpretation comes disagreement.
Dealing with quantitative objections is like dealing with disagreements in general. Well meaning people will disagree, but when you get to the source of the disagreement, you may find that you agree on more points than you think. Or, you may find you disagree about semantics and the appropriate statistical test, but you both agree that one design is the better solution.