But how do we know?
Smoking precedes cancer (mostly lung cancer). People who smoke cigarettes tend to get lung and other cancers more than those who don’t smoke. We say that smoking is correlated with cancer. Carefully rule out other causes and you have the ingredients to make the case for causation.
Correlation is a necessary but not sufficient ingredient for causation. Or as you’ve no doubt heard: Correlation does not equal causation. A correlation quantifies the association between two things. But correlation doesn’t have to prove causation to be useful. Often just knowing one thing precedes or predicts something else is very helpful. For example, knowing that job candidates’ performance on work samples predicts their future job performance helps managers hire the right candidates. We’d say that work sample performance correlates with (predicts) work performance, even though work samples don’t cause better work performance.
A common (but not the only) way to compute a correlation is the Pearson correlation (denoted with an r), made famous (but not derived) by Karl Pearson in the late 1880s. It ranges from a perfect positive correlation (+1) to a perfect negative correlation (−1) or no correlation (r = 0). In practice, a perfect correlation of 1 is completely redundant information, so you’re unlikely to encounter it.
The correlation coefficient has its shortcomings and is not considered “robust” against things like non-normality, non-linearity, different variances, influence of outliers, and a restricted range of values. Shortcomings however, don’t make it useless or fatally flawed. Consequently, it’s widely used across many scientific disciplines to describe the strength of relationships because it’s still often meaningful. It’s sort of the common language of association as correlations can be computed on many measures (for example, between two binary measures or ranks).
Returning to the smoking and cancer connection, one estimate from a 25-year study on the correlation between smoking and lung cancer in the U.S. is r = .08 —a correlation barely above 0. You may have known a lifelong smoker who didn’t get cancer—illustrating the point (and the low magnitude of the correlation) that not everyone who smokes (even a lot) gets cancer.
By some estimates, 75%–85% of lifelong heavy smokers DON’T get cancer. In fact, 80%–90% of people who DO get lung cancer aren’t smokers or never smoked!
But one study is rarely the final word on a finding and certainly not a correlation. There are many ways to measure the smoking cancer link and the correlation varies some depending on who is measured and how.
For example, in another study of developing countries, the correlation between the percent of the adult population that smokes and life expectancy is r = .40, which is certainly larger than the .08 from the U.S. study, but it’s far from the near-perfect correlation conventional wisdom and warning labels would imply.
While correlations aren’t necessarily the best way to describe the risk associated with activities, it’s still helpful in understanding the relationship. But importantly, understanding the details upon which the correlation was formed and understanding their consequences are the critical steps in putting correlations into perspective.
Validity vs. Reliability Correlations
While you probably aren’t studying public health, your professional and personal life are filled with correlations linking two things (for example, smoking and cancer, test scores and school achievement, or drinking coffee and improved health). These correlations are called validity correlation. Validity refers to whether something measures what it intends to measure. We’d say that a set of interview questions that predicts job performance is valid. Or a usability questionnaire is valid if it correlates with task completion on a product. The strength of the correlation speaks to the strength of the validity claim.
At MeasuringU we write extensively about our own and others’ research and often cite correlation coefficients. However, not all correlations are created equal and not all are validity correlations. Another common correlation is the reliability correlation (the consistency of responses) and correlations that come from the same sample of participants (called monomethod correlations). Monomethod correlations are easier to collect (you only need one sample of data) but because the data comes from the same participants the correlations tend to be inflated. Reliability correlations also tend to be both commonly reported in peer reviewed papers and are also typically much higher, often r > .7. The availability of these higher correlations can contribute to the idea that correlations such as r =.3 or even r = .1 are meaningless.
For example, we found the test-retest reliability of the Net Promoter Score is r = .7. Examples of a monomethod correlation are the correlation between the SUS and NPS (r = .62), between individual SUS items and the total SUS score (r = .9), and between the SUS and the UMUX-Lite (r = .83), all collected from the same sample and participants. These are also legitimate validity correlations (called concurrent validity) but tend to be higher because the criterion and prediction values are derived from the same source.
Interpreting Validity Correlation Coefficients
Many fields have their own convention about what constitutes a strong or weak correlation. In the behavioral sciences the convention (largely established by Cohen) is that correlations (as a measure of effect size, which includes validity correlations) above .5 are “large,” around .3 are “medium,” and .10 and below are “small.”
Using the Cohen’s convention though, the link between smoking and lung cancer is weak in one study and perhaps medium in the other. But even within the behavioral sciences, context matters. Even a small correlation with a consequential outcome (effectiveness of psychotherapy) can still have life and death consequences.
Squaring the correlation (called the coefficient of determination) is another common practice of interpreting the correlation (and effect size) but may also understate the strength of a relationship between variables, and using the standard r is often preferred. We’ll explore more ways of interpreting correlations in a future article.
I’ve collected validity correlations across multiple disciplines from several published papers (many meta-analyses) that include studies on medical and psychological effects, job performance, college performance, and our own research on customer and user behavior to provide context to validity correlations. Many of the studies in the table come from the influential paper by Meyer et al. (2001).
For example, the first entry in Table 1 shows that the correlation between taking aspirin and reducing heart attack risk is r = .02. This is the smallest correlation in the table and barely above 0. Yet aspirin has been a staple of recommendations for heart health for decades, although it is now being questioned.
The blockbuster drug (and TV commercial regular) Viagra has a correlation of r = .38 with “improved performance.” Psychotherapy has a correlation of “only” r = .32 on future well-being. Height and weight that are traditionally thought of as strongly correlated have a correlation of r = .44 when objectively measured in the US or r = .38 from a Bangladeshi sample. That’s not that different than the validity of ink-blots in one study. The connection between the “pulse-ox” sensors you put on your finger at the doctor and actual oxygen in your blood is r = .89. All these can be seen in context with the two smoking correlations discussed earlier, r = .08 and r = .40.
Table 1 shows correlations for several indicators of job performance, including college grades (r = .16), years of experience (r = .18), unstructured interviews (r=.38), general mental ability (r = .51); the best predictor of job performance is work samples, r =.54. See How Google Works for a discussion of how Google adapted its hiring practices based on this data.
Like smoking, the link between aptitude tests and achievement has been extensively studied. Table 1 also contains several examples of correlations between standardized testing and actual college performance: for Whites and Asian students at the Ivy League University of Pennsylvania (r = .20), College GPA for students in Yemen (r = .41), GRE quantitative reasoning and MBA GPAs (r = .37) from 10 state universities in Florida, and SAT scores and cumulative GPA from the Ivy League Dartmouth College for all students (r = .43).
Customer and User Behavior
I’ve included several validity correlations from the work we’ve done at MeasuringU, including the correlation between intent to recommend and 90 day recommend rates for the most recent purchase (r = .79), SUS scores and software industry growth (r = .74), the Net Promoter Score and growth metrics in 14 industries (r = .35), evaluators’ PURE scores and users’ task-ease scores (r = .67). Similar correlations are also seen between published studies on peoples’ intent to purchase and purchase rates (r = .53) and intent to use and actual usage (r = .50) as we saw with the TAM.
The lesson here is that while the value of some correlations is small, the consequences can’t be ignored. And that’s what makes general rules of correlations so difficult to apply. My hope is the table of validity correlations here from disparate fields will help others think critically about the effort to collect and the impact of each association.
Summary and Takeaways
This discussion about the correlation as a measure of association and an analysis of validity correlation coefficients revealed:
Correlations quantify relationships. The Pearson correlation r is the most common (but not only) way to describe a relationship between variables and is a common language to describe the size of effects across disciplines.
Validity and reliability coefficients differ. Not all correlations are created equal. Correlations obtained from the same sample (monomethod) or reliability correlations (using the same measure) are often higher r (r > .7) and may lead to an unrealistically high correlation bar.
Correlations can be weak but impactful. Even numerically “small” correlations are both valid and meaningful when the contexts of impact (e.g., health consequences) and effort and cost of measuring are accounted for. The smoking, aspirin, and even psychotherapy correlations are good examples of what can be crudely interpreted as weak to modest correlations, but where the outcome is quite consequential.
Don’t set unrealistically high bars for validity. Understanding the context of a correlation helps provide meaning. If something can be measured easily and for low cost yet have even a modest ability to predict an impactful outcome (such as company performance, college performance, life expectancy, or job performance), it can be valuable. The “low” correlation between smoking and cancer (r = .08) is a good reminder of this.
Thanks to Jim Lewis for providing comments on this article.
Learn More: UX Measurement Boot Camp
Intensive Training on UX Methods, Metrics and Measurement
|Fall 2020: Delivered Online|