Verifying the Stability of the Five-Item SUPR-Qm V2

Jim Lewis, PhD • Jeff Sauro, PhD

August 26, 2025

We developed the SUPR-Qm^® in 2017 to measure the quality of the mobile app experience.

Its original form had 16 items. That is long for a UX questionnaire (e.g., the SUS has ten and the SUPR-Q^® has eight). The reason it had 16 items was that it was developed using a technique called Rasch analysis, which, among other things, enables the dynamic presentation of a subset of items from the total set of 16 items.

During the dynamic presentation, if participants strongly agreed to the first item presented (e.g., “I like to use the app frequently”), it would present an item that should be harder to agree with (e.g., “I would never delete the app”). Using this technique, most participants would only need to answer between four and eight items to get a final score.

But presenting the items dynamically requires specialized software (like our MUiQ^® platform), which many researchers don’t have. Instead, when the SUPR-Qm is used nondynamically, researchers present participants with all 16 items (typically in two eight-item grids). That’s not bad for measurement; it just takes longer to answer and score. So, as part of our program to streamline the SUPR-Qm, we developed a second version of the questionnaire (SUPR-Qm V2) that has five items (a carefully selected subset of the original set of 16 items).

In a previous article, we demonstrated the stability of the original 16-item version of the SUPR-Qm over an eight-year period, but until recently, we did not have the data needed to assess the stability of the new five-item SUPR-Qm V2.

In this article, we describe research conducted to verify the stability of the SUPR-Qm V2 and to reassess the stability of the original SUPR-Qm. How stable are those five items compared to the original 16?

Method

From February 2019 through May 2023, we used our MUiQ platform to collect UX data for 23 industries (like dating, pets, and office supplies) from a total of 155 websites. The primary purpose of these surveys was to refresh a normative database for the interpretation of SUPR-Q scores, but over this time, we also collected SUPR-Qm data from respondents who indicated that they used the mobile app for the company or service they were rating.

All participants were members of a professional online consumer panel, and all were from the United States. Suspicious cases were removed before analysis using standard methods (such as inspection of completion times, responses in free text fields, and person fit statistics). The total sample size was 4,149 (48% male, 50% female, 42% less than 30 years old, and 58% 30 years or older).

An advantage of Rasch scaling is the theoretical stability of scales across changes in time, with some empirical estimates of Rasch scales being stable for as long as 15 years. To investigate the stability of the original SUPR-Qm and SUPR-Qm V2 scales, we divided the data into two parts, Group A and Group B (see the Appendix).

The data in Group A were collected from February 2019 through August 2021, covering 11 industries and 58 websites with n = 2143. Group B included data collected from February 2022 through May 2023, covering 12 industries and 97 websites with n = 2006. The only industry included in both groups was Airlines.

Results

To check the stability of the SUPR-Qm V1 (original version with 16 items) and V2 (streamlined version with five items), we superimposed Rasch logit scales for Groups A and B for each version. As shown in Figures 1 and 2, the locations of scores on the logit scales were nearly identical for both the original SUPR-Qm and the SUPR-Q V2, demonstrating their scale stability with varying dates and industries.

Figure 1: Stability of Rasch scale for the original SUPR-Qm, indicated by the overlap of lines for Groups A and B.

Figure 2: Stability of Rasch scale for SUPR-Qm V2, indicated by the overlap of lines for Groups A and B.

Summary and Discussion

Is the five-item version of the SUPR-Qm reliable and stable? In short, yes. Scores computed from the five-item SUPR-Qm V2 look just as stable as those computed from the 16-item original.

We split the sample into two time periods (2019–2021 vs. 2022–2023). To investigate the stability of the original SUPR-Qm and the SUPR-Qm V2, we divided our sample SUPR-Qm scores into two groups that had a common method (retrospective UX surveys) but differed in their time periods (Group A: February 2019 through August 2021 and Group B: February 2022 through May 2023) and industries (for details, see the Appendix).

The results over the two time periods showed remarkable similarity. As shown in Figures 1 (SUPR-Qm V1 with 16 items) and 2 (SUPR-Qm V2 with five items), the locations of the groups’ scale scores on the underlying logit scales were almost indistinguishable. These results show that the scales for both versions of the SUPR-Qm have been stable for over four years (February 2019 through May 2023) and should remain stable for years to come.

In a future article, we’ll discuss the research we’ve conducted to establish norms and curved grading scales for the interpretation of SUPR-Qm scores.

For more details about this research, see the paper we published in the Journal of User Experience (Lewis & Sauro, 2025).

Appendix

The Appendix table provides details about the industries included in our analysis. It also shows the division of the data into the two groups (A and B), which we used to analyze the stability of the original SUPR-Qm and SUPR-Qm V2 scales over differences in time and industries.

The table shows the data collected over two time periods: February 2019 to August 2021 (n = 2143) and February 2022 through May 2023 (n = 2006). This grouping divides the large dataset roughly in half, allowing us to investigate the stability of Rasch measurement over differences in time and industry (the only industry in common across the time periods was Airlines). The total demographics are the averages over industries weighted by the sample sizes.

Group A (2/19 – 8/21)	Apps	n	Male	Female	< 30 years	≥ 30 years
Airlines	5	105	54%	44%	52%	48%
Auto	4	49	59%	39%	51%	49%
Dating	7	277	46%	52%	43%	57%
Dieting	5	135	41%	58%	53%	47%
Food Delivery	4	159	47%	53%	49%	51%
Job Search	4	38	48%	50%	57%	43%
Mass Merchants	9	182	33%	66%	31%	69%
Meeting Software	4	73	58%	41%	73%	27%
Music	7	1058	49%	50%	50%	50%
Pets	4	33	43%	56%	47%	53%
Outdoors Stores	5	34	57%	41%	48%	52%

Group B (2/22 – 5/23)	Apps	n	Male	Female	< 30 years	≥ 30 years
Airlines	12	242	47%	51%	61%	39%
Business Information	3	92	53%	46%	26%	74%
Clothing	13	144	45%	52%	28%	72%
Electronics	9	131	62%	37%	18%	82%
Grocery	8	251	40%	59%	31%	69%
News	14	133	41%	57%	30%	70%
Office Supplies	4	62	58%	38%	19%	81%
Real Estate	5	93	51%	48%	49%	51%
Seller Marketplaces	6	238	44%	54%	60%	40%
Ticketing	5	203	52%	45%	40%	60%
Travel Aggregators	8	133	51%	48%	48%	52%
Wireless	10	284	50%	47%	25%	75%

Total (23 industries)	155	4149	48%	50%	42%	58%

Appendix table: Industries, sample sizes, and gender/age demographics for retrospective UX data collected from February 2019 through May 2023.

Verifying the Stability of the Five-Item SUPR-Qm V2

Verifying the Stability of the Five-Item SUPR-Qm V2

Method

Results

Summary and Discussion

Appendix

Stay informed.

Platform

Blog

Most Popular

Most Recent

Upcoming Events

Books

Surveying the User Experience

Benchmarking the User Experience

Customer Analytics For Dummies

Quantifying The User Experience: Practical Statistics For User Research