You often hear that research results are not “valid” or “reliable.”

Like many scientific terms that have made it into our vernacular, these terms are often used interchangeably.

In fact, validity and reliability have different meanings with different implications for researchers.

Validity refers to how well the results of a study measure what they are intended to measure. Contrast that with reliability, which means consistent results over time.

For example, if you weigh yourself four times on a scale and get the values 165, 164, 165, and 166, then you can say that the scale is reasonably reliable since the weights are consistent. If, however, you weigh 175 pounds and not 165, the scale measurement has little validity!

Reliability is necessary, but not sufficient to establish validity.

In a similar vein, if we ask 500 customers at various times during a week to rate their likelihood of recommending a product–assuming that no relevant variables have changed during that time–and we get scores of 75%, 76%, and 74%, we could call our measurement reliable.

The likelihood-to-recommend question is the one used to compute the Net Promoter Score (NPS). The NPS is intended to predict two things. First, it’s intended to predict how many customers will recommend in the future based on what customers say now. Customer recommendations predict, in turn, company growth. If the NPS doesn’t differentiate between high-growth and low-growth companies, then the score has little validity.

Test Validity versus Experimental Validity

Don’t confuse this type of validity (often called test validity) with experimental validity, which is composed of internal and external validity. Internal validity indicates how much faith we can have in cause-and-effect statements that come out of our research. External validity indicates the level to which findings are generalized.

Test validity gets its name from the field of psychometrics, which got its start over 100 years ago with the measurement of intelligence vs school performance, using those standardized tests we’ve all grown to loathe. Even though we rarely use tests in user research, we use their byproducts: questionnaires, surveys, and usability-test metrics, like task-completion rates, elapsed time, and errors.

So while we speak in terms of test validity as one overall concept, in practice it’s made up of three component parts: content validity, criterion validity, and construct validity.

To determine whether your research has validity, you need to consider all three types of validity using the tripartite model developed by Cronbach & Meehl in 1955, as shown in Figure 1 below.

Figure 1: The tripartite view of validity, which includes criterion-related, content and construct validity.

Content Validity

The idea behind content validity is that questions, administered in a survey, questionnaire, usability test, or focus group come from a larger pool of relevant content. For example, if you’re measuring the vocabulary of third graders, your evaluation includes a subset of the words third graders need to learn.

There’s no direct measure of content validity. To establish content validity, you consult experts in the field and look for a consensus of judgment. Measuring content validity therefore entails a certain amount of subjectivity (albeit with consensus).

When I developed the SUPR-Q, a questionnaire that assesses the quality of a website user experience, I first consulted other experts on what describes the quality of a website. This consensus of content included aspects like usability, navigation, reliable content, visual appeal, and layout.

Criterion-Related Validity

The next part of the tripartite model is criterion-related validity, which does have a measurable component. Usually, customer research is conducted to predict an outcome—a better user experience, happier customers, higher conversion rates, more customers recommending, more sales. We can think of these outcomes as criteria. We want our measures to properly predict these criteria.

To assess criterion-related validity, we correlate our measure with a criterion using the correlation coefficient r. The higher the correlation, the higher the criterion validity. We typically want the criterion to be measured against a gold standard rather than against another measure (like convergent validity, discussed below).

The two types of criterion validity —concurrent and predictive—differ only by the amount of time elapsed between our measure and the criterion outcome.

 Concurrent Validity measures correlations with our criteria that happen concurrently. For example, correlating customers’ likelihood to renew a service within a few days of the renewal period. Concurrent validity is often used in education, where a new test of, say, mathematical ability is correlated with other math scores held by the school.
   Predictive Validity measures correlations with other criteria separated by a determined period. Using the same example, we can measure customers’ likelihood to renew at the beginning of the year, and then correlate that with the customers that did renew at the end of the year.


Construct Validity

Constructs, like usability and satisfaction, are intangible and abstract concepts. We want to be sure, when we declare a product usable, that it is in fact easy to use. When we say that customers are satisfied, we must have confidence that we have in fact met their expectations.
Construct validity measures how well our questions yield data that measure what we’re trying to measure. Like criterion-related validity, construct validity uses a correlation to assess validity. Construct validity, comes in two flavors: convergent and discriminant.

Convergent Validity indicates how well a measure correlates with other measures that ostensibly measure the same thing. This is similar to concurrent validity except that we’re correlating against other measures and not a gold-standard criterion, such as observed usage, sales, or support calls. To measure convergent validity, have participants in a study answer your questions along with a previously validated instrument. For example, when I validated the SUPR-Q, participants also answered the System Usability Scale (SUS) as a measure of convergent validity. The correlation between the usability factor on both questionnaires is high (r > 0.8), showing high convergent validity.
  Discriminant Validity establishes that one measure is not related to another measure. For example, we don’t expect usability scores to correlate with the power consumption of mobile apps. We often create a new measure of, say, customer excitement. If our measure of customer excitement is highly correlated with customer satisfaction, then we are probably measuring much of the same thing and don’t have evidence for discriminant validity.  Ideally you are able to show both discriminant and convergent validity with your measures to establish construct validity.


Summary & Conclusion

Although the tripartite model of validity itself is under constant scrutiny, it endures so far and has been the standard for decades. It is a staple in determining the validity of research findings.

To establish a method of measurement as valid, you’ll want to use all three validity types.

  • Content validity: Consult with other experts to refine the measures and to ensure that you’re covering all aspects.
  • Criterion-related validity: Correlate the measure with some external gold-standard criterion that your measure should predict, such as conversion rates, sales, recommendation rates, or actual usage by customers.
  • Construct validity: Correlate the measure with other known measures. Correlate a new measure of usability with the SUS. Correlate a new measure of loyalty with the Net Promoter Score. High correlations indicate convergent validity. If your measure is supposed to measure something different—delight versus satisfaction—then look for low or no correlation to establish discriminant validity.