
We developed the SUPR-Qm® to measure the mobile app user experience and published our findings in 2017. Our more recent analysis revealed that the items we used were remarkably stable after eight years, despite clear changes in technology.
Although tech has advanced in those years, our collective attention span has not. The original published version contained 16 items—long for a questionnaire, especially one that you take on your phone. But those 16 items were originally meant to be adaptively administered based on people’s responses, so they would only need to answer a few items to generate a stable measure of their experience.
In practice, however, administering an adaptive questionnaire is hard; you need specialized software (like the MUiQ® platform) that allows dynamic presentation. An alternative to adaptive presentation, suitable for easily answered agreement items, is to build a short version. The shorter form should not be built by just lopping off items. Instead, it should be generated by looking at common response patterns and quantitatively determining whether a few items can replicate the overall SUPR-Qm score. After eight years of research, we have a lot of data to support this kind of analysis.
In this article, we describe the process we used to determine how many items we could remove while still ensuring the remaining subset produces scores comparable to the 16-item version.
Method
First, we started by reviewing the data we collected using the 16 items from the last eight years with our MUiQ platform. We compiled two datasets with participants from a U.S.-based professional online consumer panel. Suspicious cases were removed before analysis using standard methods (such as inspection of completion times, responses in free text fields, and person fit statistics).
Dataset 1: Participants and Procedure
The first dataset was a compilation of data collected in retrospective UX surveys from February 2019 through May 2023 for 23 industries (like dating, pets, and office supplies), using a total of 155 websites. The total sample size was 4,149 (48% male, 50% female, 42% less than 30 years old, and 58% 30 years or older).
The primary purpose of these surveys was to refresh a normative database for the interpretation of SUPR-Q® scores, but over this time, we also collected SUPR-Qm data from respondents who indicated that they used the mobile app for the company or service they were rating. In these surveys, all SUPR-Qm items had been randomly assigned to one of two eight-item grids and then randomized within those grids for each participant.
Dataset 2: Participants and Procedure
The second dataset was collected to investigate two proposed subsets of SUPR-Qm items. The total sample size of this group was 454 (41% male, 57% female, 34% less than 30 years old, and 66% 30 years or older). The sample was divided between investigations of a three-item version of the SUPR-Qm (SUPR-Qm03, n = 200) and a five-item version (SUPR-Qm05, n = 254).
The participants in the second dataset were users of at least one mobile music service app (Amazon paid, Apple paid, Pandora free, Spotify free, Spotify paid, and YouTube free). These apps were selected based on their frequency of occurrence and ratings in Dataset 1 to increase the likelihood that members of the online panel would be users of at least one service and because the services covered a range of SUPR-Qm ratings (poorest for YouTube free, best for Spotify paid).
We conducted these additional surveys to investigate two proposed subsets of SUPR-Qm items in support of our research goal to streamline the questionnaire: SUPR-Qm03 and SUPR-Qm05 (with 3 and 5 of the 16 SUPR-Qm items selected to provide full coverage of the underlying logit scale). As detailed in the Results section, we found that these subsets produced scores that closely corresponded to scores obtained from all 16 items when analyzing subsets of Dataset 1. We were concerned, however, that the manner of collecting the items in two eight-item grids might have influenced the scores we were getting from the subsets due to the influence of the other items in the grids.
For Dataset 2, we varied the assignment of items to grids to get SUPR-Qm03 and SUPR-Qm05 scores that were not influenced by the other items. For SUPR-Qm03, the first grid showed only the three items selected for that version, followed by two more grids, one with six randomly assigned items and one with the remaining seven items. For SUPR-Qm05, the first grid showed only the five items selected for that version, followed by a grid with the remaining eleven items. The order of presentation of items within grids was randomized for all participants.
Results
We used Rasch analysis of Dataset 1 to create a Wright map.
Interpreting a Wright Map
One key output of Rasch analysis is a Wright map (also called an item-person map), which places the difficulty of the items (how hard it was for respondents to agree with them) on the same measurement scale as the participants’ ratings. Each # represents the number of participants on the left side, and the label shows each item’s location on the right side of the map.
A Wright map is organized as two vertical histograms with the items and respondents (persons) arranged from easiest (most likely to agree) on the bottom to most difficult (least likely to agree) on the top. For example, most participants agreed or strongly agreed (4s and 5s) with the items “Easy” and “EasyNav.” In contrast, few participants highly rated apps as “AppBest.”
On the left side, the Wright map shows the mean (M) and two standard deviation points (S = one SD and T = two SD) for the measurement of participants’ tendency to agree. On the right side of the map, the mean difficulty of the items (M) and two standard deviation points (S = one SD and T = two SD) for the items are shown.
Looking for Redundant Items on the 16-Item Wright Map
Figure 1 shows the Wright map for the 16 items collected in Dataset 1. Table 1 shows the mapping between the labels that appear in Figure 1 and the item wording.
Figure 1: Wright map of the SUPR-Qm.
| Wright Map Label | 3 Item | 5 Item | Item Wording |
|---|---|---|---|
| CantLiveWithout | x | x | I can’t live without this app on my phone. |
| AppBest | The app is the best app I’ve ever used. | ||
| CantImagineBetter | I can’t imagine a better app than this one. | ||
| NeverDelete | x | I would never delete the app. | |
| EveryoneHave | Everyone should have the app. | ||
| Discover | I like discovering new features on the app. | ||
| AllEverWant | x | x | The app has all the features and functions you could ever want. |
| Delightful | The app is delightful. | ||
| Integrates | The app integrates well with the other features of my mobile phone. | ||
| UseFreq | I like to use the app frequently. | ||
| DefFuture | I will definitely use this app many times in the future. | ||
| AppAttractive | I find the app to be attractive. | ||
| FindInfo | x | The design of this app makes it easy for me to find the information I’m looking for. | |
| AppMeetsNeeds | The app’s features meet my needs. | ||
| EasyNav | It is easy to navigate within the app. | ||
| Easy | x | x | The app is easy to use. |
Table 1: Mapping between Wright map labels and item wording. The 3 Item and 5 Item columns indicate which items were retained for the two shorter versions.
Our examination of the Wright map in Figure 1 showed opportunities to streamline the SUPR-Qm by removing redundant items. Redundant items are those that are located around the same place on the y-axis (having similar logit positions). Note that there are no statistical methods for deciding which of a set of redundant items should be excluded or retained. Therefore, these decisions become part of the craft of standardized questionnaire development. For example, the three items at the exact center of the scale (AllEverWant, Delightful, and Discover) have the same measurement properties, so it doesn’t matter which one is selected for inclusion during the streamlining process; the choice is simply up to the questionnaire developer.
The most extreme reduction was the retention of just three items, one item in the middle of the scale (AllEverWant) and the two most extreme items (Easy and CantLiveWithout), referred to as the SUPR-Qm03 in this article. We were concerned that three items might not provide sufficient coverage of the scale, so we also defined a five-item version (SUPR-Qm05) by adding two items between the middle of the scale and each extreme (Easy, FindInfo, AllEverWant, NeverDelete, and CantLiveWithout).
As described in the Methods section, we collected Dataset 2 to enable the comparison of data collected with the designated three and five items in isolation or, as in our large Dataset 1, mixed with the other SUPR-Qm items. Figures 2 and 3 show the results of those comparisons (for clarity, the full SUPR-Qm is labeled SUPR-Qm16).
Figure 2: Comparison of SUPR-Qm03 and SUPR-Qm16 collected in standard (8 × 8) and alternate (3 × 6 × 7) grids (n = 200; error bars are 95% confidence intervals; the difference is significant for SUPR-Qm03 but not for SUPR-Qm16).
Figure 3: Comparison of SUPR-Qm05 and SUPR-Qm16 collected in standard (8 × 8) and alternate (5 × 11) grids (n = 254; error bars are 95% confidence intervals. The difference is not significant for SUPR-Qm05 or SUPR-Qm16).
In Figures 2 and 3, the means for the full SUPR-Qm were not significantly different (Figure 2: t(198) = .44, p = .66, d = .05 ± 23; Figure 3: t(252) = 1.04, p = .30, d = .10 ± .20). In Figure 2, the means for the streamlined SUPR-Qm03 were significantly different as a function of allocation to grids (t(198) = 2.7, p = .009, d = .32 ± .24), but in Figure 3, the means for the SUPR-Qm05 were not significantly different (t(252) = .60, p = .55, d = .07 ± .21). Based on these findings, we found the SUPR-Qm05 (but not the SUPR-Qm03) to be a suitable short version of the SUPR-Qm.
Summary and Discussion
The original SUPR-Qm has 16 items. Examination of the Wright map in Figure 1 revealed opportunities to streamline the questionnaire by removing redundant items. We tested two short versions of the SUPR-Qm, one with just three items (SUPR-Qm03) and one with five items (SUPR-Qm05).
The key variable in those tests was the extent to which the retained variables were shown to respondents before the other items or were embedded in two eight-item grids (which is how the data were collected for Dataset 1). To justify using the large amount of data in Dataset 1 to establish norms for the short forms, we had to estimate the extent to which respondent behavior was different when retained items were rated in an initial grid versus rated in the context of the other items.
The three-item version had order effects. As shown in Figures 3 and 4, SUPR-Qm03 ratings were significantly affected by the presentation variable, but SUPR-Qm05 ratings were not. Consequently, we do not recommend using the SUPR-Qm03. Practitioners can confidently use the full SUPR-Qm, or when a short form would be advantageous, the SUPR-Qm05.
Most respondents really only needed five items. Note that, although we have developed an adaptive program for the SUPR-Qm, we now do not recommend that approach for attitudinal questionnaires like the SUPR-Qm. While working on the program, we found that measurement converged (the program stopped) after the presentation of four or five items, so it’s more efficient to simultaneously present a set of five good items in a grid. Furthermore, the primary advantage of adaptive testing for a high-stakes test like the SAT is the dramatic reduction in the amount of time required to complete the test (from three to two hours) because there are dozens of questions in each test section, and each SAT item takes a fair amount of time to complete. That is not the case with the type of agreement items used in the SUPR-Qm, which respondents typically complete in a few seconds per item.
The SUPR-Qm05 achieved the dual goals of consistent measurement and increased efficiency. This new five-item version enhances the usefulness of the SUPR-Qm for UX practitioners and researchers who need a standardized questionnaire. It provides a quick measure of the UX of mobile apps that is easy to interpret with norms that should remain stable for many years. For these reasons, in our practice, we have replaced the first version of the SUPR-Qm with this new version, SUPR-Qm V2 (Figure 4).
Figure 4: The SUPR-Qm V2 (we recommend randomizing the order of presentation of items).
For easy copy/paste of the text, the items are:
- I can’t live without this app on my phone.
- The app has all the features and functions you could ever want.
- The app is easy to use.
- I would never delete this app.
- The design of this app makes it easy for me to find the information I’m looking for.
In future articles, we will discuss the research we’ve conducted to verify the stability of full and streamlined versions of the SUPR-Qm and develop norms for the interpretation of SUPR-Qm scores.
For more details about this research, see the paper we published in the Journal of User Experience (Lewis & Sauro, 2025).




