How to Score and Interpret the Five-Item SUPR-Qm V2

Jim Lewis, PhD • Jeff Sauro, PhD

Feature image showing SUPR-Qm® logo, a smartphone, and two piles of itemsWe developed the SUPR-Qm® to measure the uniqueness of the mobile app user experience.

You can measure mobile apps using technology-agnostic questionnaires such as the UX-Lite® and SUS. But our research and experience suggest that the mobile app experience warrants a tailored questionnaire, like how the SUPR-Q is for websites.

People have different expectations for a mobile app compared to websites, especially given how it can integrate with their phone’s features and use personal data like their face, fingerprints, and location.

We published the first version of the SUPR-Qm in 2017. It was a 16-item questionnaire developed using Rasch analysis. Since its introduction, we’ve been analyzing data to create benchmarks and to see whether we could reduce the number of items.

We’ve published several articles documenting our research to improve the SUPR-Qm, including:

Figure 1: The SUPR-Qm V2.

Figure 1: The SUPR-Qm V2 (created in our MUiQ® platform).

The research was promising, suggesting the five-item version was reliable and stable. While the SUPR-Qm has a simple raw scoring system (16 to 80 for the sixteen-item version and 5 to 25 for the five-item version), we wanted to create an easier way to interpret it by scaling scores from 0 to 100 (like the SUS and UX-Lite) and applying a grade scale.

In this article, we describe the development of curved grading scales to support easy interpretation of the SUPR-Qm V1 and V2.

Method

We’re all familiar with letter grades from our time as students. While they certainly can bring back bad memories, letter grades (A to F) are almost universally understood, not just in school, but in business and UX metrics, too. Letter grades were first applied to the System Usability Scale (SUS) in 2008, and we also created a grading scale for the UX-Lite.

To create the grading scale, we pooled data collected from February 2019 through May 2023. We used our MUiQ platform to collect UX data for 23 industries (such as dating, pets, and office supplies) from a total of 155 websites. The primary purpose of these surveys was to refresh a normative database for the interpretation of SUPR-Q® scores, but we also collected SUPR-Qm data from respondents who indicated that they used the mobile app for the company or service they were rating.

All participants were members of a professional online consumer panel, all from the United States. Suspicious cases were removed before analysis using standard methods (such as inspection of completion times, responses in free text fields, and person fit statistics). The total sample size was 4,149 (48% male, 50% female, 42% less than 30 years old, and 58% 30 years or older).

We used the logit scales for V1 and V2 of the SUPR-Qm to estimate the probabilities for each possible SUPR-Qm score, then used those probabilities to create curved grading scales for the interpretation of the SUPR-Qm V1 and V2.

Results

One of the advantages of Rasch scaling is the alignment of all possible scale scores on a logit scale, which enables conversion of scores to percentile-like probabilities with the formula p = exp(logit)/(1 − exp(logit)). When there are 16 five-point items, as in the SUPR-Qm V1, summed scores can range from 16 (if respondents select 1 for each item) to 80 (if respondents select 5 for each item). When there are five items, as in the SUPR-Qm V2, the summed scores range from 5 to 25.

For all possible SUPR-Qm V1 and V2 summed scores, we converted the summed score to a five-point mean score by dividing the summed score by the number of items, converted five-point scales to a 0–100-point scale for easier reporting and interpretation, computed the associated logits, transformed logits to probabilities, and assigned standard letter grades and grade points to those probabilities using common probability ranges for curved grading scales. We used those analyses to create curved grading scales for both versions of the SUPR-Qm (Table 1).

SUPR-Qm V1 ScoreSUPR-Qm V2 ScoreCurved GradeGrade PointProbability Range
 89.4–100.0 87.0–100.0 A+4.0 96–100%
81.2–89.379.0–86.9A4.090–95%
76.2–81.174.5–78.9 A−3.785–89%
71.9–76.170.5–74.4 B+3.380–84%
64.8–71.863.5–70.4B3.070–79%
61.5–64.760.5–63.4 B−2.765–69%
58.4–61.457.5–60.4 C+2.360–64%
46.3–58.346.5–57.4C2.041–59%
42.2–46.242.8–46.4 C−1.735–40%
25.0–42.127.5–42.7D1.015–34%
 0.0–24.9 0.0–27.4F0.0 0–14%

Table 1: Curved grading scale for interpreting SUPR-Qm V1 (sixteen-item) and V2 (five-item) scores after interpolation from five-point to 0–100-point scales.

Note that the score ranges for the V1 and V2 grades are similar but not identical, so be sure to use the right one.

Because the SUPR-Qm V2 is more efficient than V1, we now use it exclusively in our mobile app research. We included V1 for UX practitioners who want to interpret past data collected with the original questionnaire or who simply prefer to continue using it.

Applying the Scoring and Grading with Data

To illustrate the scoring system with real data, we’ll use SUPR-Qm V2 ratings we collected of Spotify’s free (n = 46) and paid (n = 60) services (results shown in Figure 2). After interpolating the raw scores to a 0–100-point scale, the mean rating for the free service was 62.4 (a grade of B−) and for the paid service was 71.3 (a grade of B+). This difference was statistically significant (t(104) = 2.28, p = .025). For the free service, the 95% confidence interval ranged from 56.3 (C) to 68.5 (B), and for the paid service, the interval ranged from 66.3 (B) to 76.2 (A−), so it is implausible that the population mean for the free service has a grade lower than C or higher than B while for the paid service the population mean is unlikely to have a grade lower than B or higher than A−.

Figure 2: SUPR-Qm V2 comparison of Spotify free and paid services (error bars are 95% confidence intervals, with the paid version scoring significantly higher than the free version).

Figure 2: SUPR-Qm V2 comparison of Spotify free and paid services (error bars are 95% confidence intervals, with the paid version scoring significantly higher than the free version).

Summary and Discussion

One of the final steps in the development of a standardized metric is to obtain reference data to enable the interpretation of the metric—what’s a good score, what’s an average score, what’s a poor score?

SUPR-Qm is scored from 0 to 100 and uses letter grades. To simplify score interpretation, we transformed SUPR-Qm scores (both V1 and V2) onto a 0–100 scale and applied a curved letter-grade system, like those used for the SUS and UX-Lite.

Grades were based on the large database of participants. The grading scale is based on Rasch-scaled logit probabilities from over 4,000 participants across 155 websites and 23 industries, offering percentile-like guidance and intuitive letter grades from A+ to F.

Real-data example: For example, a SUPR-Qm V2 score of 71.3 (from Spotify Premium users in our dataset) translates to a B+ grade, significantly higher than the free version’s score of 62.4 (B−).

Recommended use: Practitioners can use the new curved grading tables to interpret current and historical data across both SUPR-Qm versions, with V2 now recommended for its brevity and stability.

For more details about this research (e.g., the wording of all SUPR-Qm items and the websites/industries we measured), see the paper we published in the Journal of User Experience (Lewis & Sauro, 2025). Our new SUPR-Qm calculator is available for purchase — check out the video tour.

0
    0
    Your Cart
    Your cart is emptyReturn to Shop
    Scroll to Top