Good intentions? Because someone influential said to use it online?
A measure is valid if it can be demonstrated that it measures what it is intended to measure, has the expected alignment of items with factors, and has the expected statistical relationships with other metrics. Its usage also depends on its practicality.
So how do you demonstrate validity? It takes data and disclosure.
At MeasuringU, we originally benchmarked websites using the SUS. Enough data were publicly available that we could generate percentile rankings from raw SUS scores that made the perceived usability data more interpretable.
But we knew that the quality of the website user experience was more than just usability.
We started to develop what’s come to be known as the Standardized User Experience Percentile Rank Questionnaire (SUPR-Q®) in 2011 and published our findings in 2015.
The SUPR-Q is a short (eight-item) questionnaire that measures perceptions of Usability, Trust, Appearance, and Loyalty for websites. The combined score provides an overall measure of the quality of the website user experience.
We wanted to maintain the percentile ranking we had built from the SUS data, so the SUPR-Q also provides relative rankings expressed as percentiles. A SUPR-Q percentile score of 50 is average (roughly half the websites evaluated in the past with the SUPR-Q have received better scores and half received worse). The normative database contains responses from more than 10,000 participants and 150 websites (updated on an ongoing basis, about once per quarter). Its compactness and normed database made it practical, but we needed to show it also had strong psychometric properties.
During its development, the final version of the SUPR-Q was informed by psychometric analysis of over 4,000 responses across 100 website experiences. Iterative item selection led to an efficient questionnaire with two items per construct with validity established using exploratory factor analysis and acceptable levels of reliability (coefficient α > .70) for the overall and most subscales (Overall: α = .86, Usability: α = .88, Trust: α = .85, Appearance: α = .78, Loyalty: α = .64). In a study of 40 websites (n = 2,513), the global SUPR-Q and its subscales discriminated well between the poorest and highest quality websites, providing evidence of its sensitivity.
In this article, we report the results of a confirmatory factor analysis (CFA) to validate the SUPR-Q questionnaire and a multiple regression analysis of the basic SUPR-Q measurement model (how well the Usability, Trust, and Appearance metrics account for variation in the Loyalty metric).
The SUPR-Q Questionnaire
Shown in Figure 1, the SUPR-Q measures four website UX factors with eight questions: Usability (easy to use, easy to navigate), Trust (trustworthy, credible), Appearance (attractive, clean, and simple), and Loyalty (likelihood to revisit, likelihood to recommend). The item scores for each subscale are the averages of the two items (after dividing the 0–10-point LTR rating by 2). The overall scale is the average of the four subscales.
Figure 1: The SUPR-Q questionnaire (created in MUiQ®).
Confirmatory Factor Analysis of the SUPR-Q
How did we know the eight items we selected measured our intended constructs? We used a statistical technique called exploratory factor analysis (EFA). This approach shows how well the data we collect (what we can observe) measure what we can’t see but want to measure (e.g., usability, loyalty). As a measure is used and more data are collected, it’s good practice to show that the original factors still provide good measures of the constructs.
Now that we have used the SUPR-Q for over a decade, we decided to conduct a confirmatory factor analysis (CFA). As indicated in their names, researchers use EFA in the early stages of research to explore different plausible factor structures (e.g., items to retain, number of factors), then use CFA on an independent set of data to assess the model fit of the most promising factor structure found during EFA.
There are many ways to assess the quality of fit of a CFA model. We focused on the combination of Comparative Fit Index (CFI), Root Mean Square Error of Approximation (RMSEA), and Bayesian Information Criterion (BIC). There are guidelines for good levels of model fit for CFI (> 0.90) and RMSEA (< 0.08), but not for BIC, which is used to compare models (smaller is better).
For this analysis, we used SUPR-Q data from eight retrospective consumer surveys conducted between April 2022 and January 2023. Each survey targeted a specific sector, and, in total, we collected 2,761 responses to questions about the UX of 57 websites. The sample had roughly equal representation of gender and age (split at 35 years old). Table 1 shows the participant gender and age for each survey, with sector names linking to articles with more information about each survey (including the websites selected for the sectors).
| Sector | n | Date | Websites | Female (%) | Male (%) | Under 35 (%) | 35 or older (%) |
|---|---|---|---|---|---|---|---|
| Real Estate | 269 | Apr 2022 | 5 | 48 | 51 | 48 | 52 |
| Travel Aggregator | 452 | May 2022 | 9 | 48 | 51 | 48 | 52 |
| Business Info | 183 | Jul 2022 | 3 | 46 | 53 | 42 | 58 |
| Domestic Air | 350 | May 2022 | 7 | 48 | 49 | 58 | 42 |
| International Air | 200 | May 2022 | 5 | 53 | 46 | 61 | 39 |
| Ticketing | 234 | Jun 2022 | 5 | 45 | 52 | 40 | 60 |
| Clothing | 550 | Dec 2022 | 13 | 52 | 45 | 48 | 52 |
| Wireless | 523 | Jan 2023 | 10 | 47 | 50 | 40 | 60 |
| Overall | 2,761 | – | 57 | 49 | 49 | 48 | 52 |
Table 1: Summary of participant gender and age for eight consumer surveys.
The eight surveys shown in Table 1 were retrospective studies of the UX of websites in their respective sectors. Some survey content differed according to the nature of the sector being investigated, but all surveys included the SUPR-Q and basic demographic items. For each survey, we conducted screeners to identify respondents who had used one or more of the target websites within the past year, then invited those respondents to rate one website with which they had prior experience. On average, respondents completed the surveys in 10–15 minutes (there was no time limit).
Figure 2 shows the results of the CFA. The loadings (link weights) for each item with respective factors were very strong (from .74 to .89) and statistically significant (p < .0001). The model had excellent fit statistics (CFI: .993, RMSEA: .05, BIC: 284.6). The reliability of the overall and all subscales exceeded .70 (Overall: α = .90, Usability: α = .88, Trust: α = .87, Appearance: α = .80, Loyalty: α = .73).
Figure 2: Confirmatory factor analysis of the SUPR-Q (n = 2,761, CFI = .993, RMSEA = .05, BIC = 284.6). The ovals are what we want to measure but can’t observe. The rectangles are the items in the SUPR-Q that attempt to measure the constructs.
The Basic SUPR-Q Measurement Model
When we developed the SUPR-Q model, we knew the four factors (Usability, Trust, Appearance, and Loyalty) were correlated from both published literature and our data. Correlation, of course, does not mean causation, as it can be difficult to disentangle order effects without controlled manipulation in experiments. However, some work has shown that attitudes to usability affect attitudes towards appearance. We had reason to believe that UX quality and its components affect intent to use and likelihood to recommend (Loyalty).
In addition to its usefulness as a single measure of the UX of websites, the components of the SUPR-Q can be used in a framework in which Usability, Trust, and Appearance predict (are antecedents of) Loyalty. The model shown in Figure 3 is based on the data set described in Table 1. Values on the links from Usability, Trust, and Appearance to Loyalty are multiple regression beta weights (all statistically significant with p < .0001, beta weights ranging from .22 to .32), with the three predictors accounting for almost half (46%) of the variation in Loyalty—a highly significant model.
Figure 3: The basic SUPR-Q measurement framework (n = 2,761).
In the future, we plan to use this basic framework to model additional consequences, such as brand attitude and additional antecedents such as perceived clutter and usefulness.
Summary and Discussion
The SUPR-Q has over a decade of data and usage. Our psychometric analyses (CFA and regression model) of the basic SUPR-Q model using data from retrospective studies of eight sectors (n = 2,761 across 57 websites) found:
The SUPR-Q exhibits strong evidence of validity. The results of the CFA showed that the SUPR-Q items had strong fit. All links between items and their respective factors were statistically significant and strong (ranging from .74 to .89, all p < .0001). The fit statistics were excellent (CFI: .993, RMSEA: .05, BIC: 284.6). These findings strongly support the construct validity of the SUPR-Q.
The SUPR-Q exhibits acceptable to good reliability. For these analyses, the SUPR-Q scale reliabilities, assessed with coefficient alpha, all exceeded the commonly used criterion of .70 (Overall: α = .90, Usability: α = .88, Trust: α = .87, Appearance: α = .80, Loyalty: α = .73). These estimates of reliability were very close to those reported in the original SUPR-Q publication, but this time the estimate for Loyalty, originally .64, was .73. Some analysts have suggested that the Spearman-Brown method provides better reliability estimates than coefficient alpha when scales have just two items, but there were no meaningful differences in the reliability estimates for these data.
SUPR-Q components predict loyalty. The antecedents of the basic SUPR-Q measurement model account for almost half of the variation in Loyalty. All three antecedents are significant key drivers of Loyalty with beta weights in roughly the same range (.22 for Usability, .26 for Appearance, and .32 for Trust), accounting for 46% of the variation in Loyalty.
Bottom line: The basic SUPR-Q measurement model is psychometrically strong, making it an excellent starting point for investigating how its components relate to additional constructs such as brand attitude, usefulness, and perceived clutter.






