Like a late-night infomercial, some are even touted as the next BIG thing, like the NPS was.
New questionnaires and measures are a natural part of the evolution of measurement (especially measuring difficult things such as human attitudes). It’s a good thing.
I’ll often help peer review new questionnaires published in journals and conference proceedings such as the JUS, IHCJI, and CHI. Such questionnaires, usually administered in a survey, are often referred to simply as measures. Not only do I review, I’m partially to blame for new measures. In the last 10 years I’ve introduced a few: the SUPR-Q, SUPR-Qm, and SEQ.
But how do you know whether a new measure is good, adequate, or fails to live up to its hype? All too often “experts” on social media seem to prefer one measure or consider other measures “harmful” or a “waste of time” based more on opinion than data. They become more like arguments over which color of Starburst candy is “best” (red of course).
Unfortunately, when you dig into the hype it’s often based on flimsy reasons such as not liking the number of scale points or the wording of an item. But there’s more to assessing the quality of a questionnaire than just examining the content of the items or labels on a scale. How do you separate all the social media noise from real reasons to be concerned or encouraged by a measure?
Here are both psychometric and practical considerations to use to assess the quality of a new questionnaire or measure.
- Is it reliable? A questionnaire needs to be reliable to be valid, so you start with measuring reliability. People (respondents) need to understand the questions and response options and consistently answer to the same stimulus or concepts. Three measures of reliability are popular: test-retest (measuring two different points in time), alternate forms (assessing different versions), and internal consistency reliability (a sort of correlation between items). The third measure is the most popular because it only requires one dataset, but it also requires multiple items. It’s measured using Cronbach alpha and you want higher values; above .7 at minimum and ideally above .9.
- Is it valid? This is a big one. We want the questionnaire to measure what we intend it to measure. There isn’t a single way to measure validity (psychometricians disagree on how to measure validity). Instead, the current approach is to use multiple methods. Items should describe the construct (content validity) and be able to predict outcomes (predictive validity) including better experiences, more purchases, repeat business, reduced support calls, and even other measures (convergent validity). The more valid the measures, the better.
- Does it leverage existing questionnaires? There’s no need to reinvent the wheel. Items should be selected based on prior research. In many applied domains, someone has likely thought of how to ask a question about ease of use, visual appeal, satisfaction, or loyalty. How does a “new” questionnaire build on what’s already been measured?
- Is it sensitive enough to detect differences? You don’t want a blunt instrument. The measure should be able to differentiate as small a difference as possible. While not all small differences are meaningful (e.g. what’s a .1 difference on an 11-point scale?), you want to let the researcher interpret the impact and not be limited by your instrument. Often very small differences can have very practical implications. Sensitivity itself can be a form of validity.
- Are there reference scores? One of the hardest things with new measures is interpreting the scores. Is 6 good? What about 55? Publishing reference scores (often called normed scores) that others can compare against helps provide an immediate context for “good,” “ok,” or “bad.” Usually questionnaires with a professionally normed database of scores will not be free (e.g. SUPR-Q, SUMI, QUIS). But even free questionnaires such as the SUS, ACSI, and NPS have available reference scores to act as benchmarks.
- How efficient is it? Somewhat related to length is what you get out for what your respondents put in. Questionnaires can measure more than one construct and typically the minimum you need is two items per construct (to asses internal reliability and factor loadings). For example, the SUS, while relatively short at 10 items, isn’t terribly efficient. It measures only a single construct (ease of use), where a smaller subset of items would likely be sufficient. The SUPR-Q has two items per factor, so with eight items you get four constructs. What’s more, with the same eight items you can estimate the SUS highly accurately and generate the NPS, making the SUPR-Q very efficient. The UMUX-Lite is hyper efficient, providing a measure of two constructs (use and usefulness) with two items.
- Does the actual factor structure match the proposed structure? Many questionnaires claim to measure multiple things. Is there empirical evidence that the items support multiple dimensions? This can be established in one study using an exploratory factor analysis (EFA) or a Rasch model (for item response theory) and then verified in a follow-up study using a confirmatory factor analysis (CFA) as we did with the SUS factor structure. Just because a measure contains items that describe different content (e.g. delight or satisfaction) doesn’t mean it’s actually measuring different things.
- Has it been road-tested, and not just piloted? While all you need is one dataset to establish the reliability and validity of a new measure, I like to see the measure used across more than one dataset and in different contexts. This is especially the case if the pilot group were students or a small sample. You want to see that the factor structure still holds and the data is still reliable and valid.
- Is it too long? The more items you have, the more reliable your instrument will be. Of course, the participant pays the price for this higher reliability. Outside of forced compliance (e.g. government surveys or school tests), you’ll generally want shorter questionnaires. A 50-item questionnaire is rarely practical in applied settings. If it’s that long, there should be at least a short version that may exclude some constructs.
- How flexible is it? Will it be limited to only intranets, certain products, countries, or participant populations? There is a balance between being specific but also generalizable enough to allow researchers to use it in multiple contexts. For example, we updated our SUPR-Q trust items to assess trust on sites that don’t have clear ecommerce capabilities (e.g. Information and Company websites), but the SUPR-Q is intended for public-facing websites, not products.
- Does it require special administration? I like the potential of visual analog scales but digital versions require software to administer. Adaptive measures using IRT also show promise for reducing questionnaire length, but require specialized software to administer, too. We also found the administration overhead with magnitude estimation probably wasn’t worth the effort.
- Is there backward compatibility? One of the biggest “costs” associated with changing measures is the loss of historical data. Often what little is gained in changing measures can be lost by not being able to compare to your older scores that provide context. Is there a way to provide an estimate to other measures? For example, with the SUPR-Q we wanted to keep continuity with other popular UX measures and both SUS estimates and Net Promoter Scores can be generated from the SUPR-Q’s eight items. Adding one more item will also generate the UMUX-Lite (5-point version) making the SUPR-Q compatible and efficient.
- Is it “hard” to answer? Related to length, you don’t want complicated questions or response scales. For example, Visual Analog scales (“sliders”) may be more difficult for the very elderly or physically disabled to answer. As another example, we liked the idea of the Usability Magnitude Estimate (UME) but found participants really struggled to use it. You can gauge whether respondents have difficulty answering by conducting a cognitive walkthrough with participants while they answer. But don’t just dismiss a new measure because you don’t like the number of response options or the scale, it doesn’t necessarily make it harder to answer.
- Is the scoring complex or problematic? Like special software needed to administer a measure, a difficult scoring system also is less appealing. Most measures simply average together responses, but some add scores for a total or use box-scoring (as the NPS does). While this isn’t necessarily a problem, it can introduce complexities. For example, we found about 10% of SUS scores were miscalculated by researchers due to the complex scoring method. For questionnaires that add items (instead of averages them), you also have a problem if you have missing data. What’s more, a measure that sums items (instead of averages) becomes harder to score if a respondent skipped or missed an item (which is why I recommend averaging instead of summing items).
- Are you realistic about what the measure does? Don’t expect a measure to do everything perfectly all the time. If we set impossible criteria for a measure, then all will come up short. You shouldn’t expect a measure to work when it’s not used in the same context in which it was developed or show the same outcomes no matter who collected it or how it’s collected or have no error. There’s always measurement error; it’s a question of how much. For example, the Net Promoter Score and SUS were never intended to tell you what to fix in an interface. But that’s not a shortcoming of the measure; it’s misplaced expectations for the measure.
|UX Measurement Boot Camp : Three Days of Intensive Training on UX Methods, Metrics and Measurement Aug. 7th-9th 2019|