
Clutter can lead to a poor user experience. Poor experiences repel users.
So how does one measure clutter?
Earlier, we did a deep dive into the literature to see how clutter has been first defined and then measured. We found the everyday concept of clutter was defined with two components: disorganized collection of what’s relevant and the presence of irrelevant objects.
But most measures of clutter were based on objective measures such as grouping and layout complexity. The only questionnaire we found measured airplane cockpit displays, which didn’t seem relevant to website clutter.
So, we began to build our own questionnaire for measuring website clutter.
In this article, we briefly review the exploratory research we conducted and then analyze new data to validate what we found using a statistical technique called confirmatory factor analysis.
Review of Clutter Questionnaire Research
In the first iteration of that exploratory research, we started with a preliminary clutter questionnaire that measured two aspects of clutter based on the literature: content clutter (e.g., irrelevant ads and videos) and design clutter (e.g., too much text, illogical layout).
The first iteration of the Perceived Website Clutter Questionnaire (PWCQ) included one item for overall clutter, six for content clutter, and ten for design clutter (see Figure 1 for the entire questionnaire used in our surveys).
The format for overall clutter was an 11-point agreement item (“Overall, I thought the website was too cluttered,” 0: Strongly disagree, 10: Strongly agree). The format for content and design clutter used five-point agreement items (1: Strongly disagree, 5: Strongly agree). The short labels and item wording for the content and design clutter items were:
- Content_ALot: These types of content made up a lot of the clutter.
- Content_TooMany: There were too many ads or videos.
- Content_Space: These types of content took up too much space.
- Content_Distracting: These types of content were distracting.
- Content_Irrelevant: These types of content were irrelevant.
- Content_Annoying: These types of content were annoying.
- Design_HardToRead: The text was hard to read.
- Design_SmallFont: The font size was too small.
- Design_DistractingColors: The colors were distracting.
- Design_UnpleasantLayout: The layout was unpleasant.
- Design_WhiteSpace: There wasn’t enough white space.
- Design_TooMuchText: There was too much text.
- Design_NotLogical: The content was not logically organized.
- Design_Disorganized: The layout was disorganized.
- Design_VisualNoise: There was too much visual noise.
- Design_HardToStart: It was hard for me to find what I needed to get started.
Figure 1: First iteration of a standardized questionnaire for the measurement of perceived clutter of websites (overall question presented first, then, on separate screens, content and design clutter grids with item order randomized in each grid).
After applying the exploratory techniques of parallel analysis, factor analysis, item analysis, and item retention, the revised version of the PWCQ had two items for content clutter (Content_ALot, Content_Space) and three for design clutter (Design_UnpleasantLayout, Design_TooMuchText, and Design_VisualNoise). Using regression analysis, these five items accounted for 45% of the variation in the one-item measure of overall clutter (highly significant) with excellent scale reliabilities (ranging from .88 to .91 overall and for the two subscales).
When developing a standardized questionnaire, however, exploratory research is just the first step. To have confidence in the questionnaire, it’s important to follow exploratory research with confirmatory research.
Validating the Clutter Questionnaire with New Data
We used three approaches to validate the clutter questionnaire: confirmatory factor analyses, sensitivity analyses, and range analyses. The data for these analyses came from eight retrospective SUPR-Q® consumer surveys conducted between April 2022 and January 2023. Each survey targeted a specific sector, and, in total, we collected 2,761 responses to questions about the UX of 57 websites. The sample had roughly equal representation of gender and age (split at 35 years old). Table 1 shows the participant gender and age for each survey, with sector names linking to articles with more information about each survey (including the websites selected for the sectors). Participants were members of an online consumer panel, all from the United States.
| Sector | Date | Websites | Female (%) | Male (%) | Under 35 (%) | 35 or older (%) | |
|---|---|---|---|---|---|---|---|
| Real Estate | 269 | Apr-2022 | 5 | 48 | 51 | 48 | 52 |
| Travel Aggregator | 452 | Apr-2022 | 9 | 48 | 51 | 48 | 52 |
| Business Info | 183 | Jul-2022 | 3 | 46 | 53 | 42 | 58 |
| Domestic Air | 350 | May-2022 | 7 | 48 | 49 | 58 | 42 |
| International Air | 200 | May-2022 | 5 | 53 | 46 | 61 | 39 |
| Ticketing | 234 | Jun-2022 | 5 | 45 | 52 | 40 | 60 |
| Clothing | 550 | Dec-2022 | 13 | 52 | 45 | 48 | 52 |
| Wireless | 523 | Jan-2023 | 10 | 47 | 50 | 40 | 60 |
| Overall | 2,761 | – | 57 | 49 | 49 | 48 | 52 |
Table 1: Summary of participant gender and age for eight consumer surveys.
Some survey content differed according to the nature of the sector being investigated, but all surveys included the SUPR-Q, basic demographic items, and the first iteration of the perceived clutter questionnaire. For each survey, we conducted screeners to identify respondents who had used one or more of the target websites within the past year, then invited those respondents to rate one website with which they had prior experience. On average, respondents completed the surveys in 10–15 minutes (there was no time limit).
To support independent exploratory and confirmatory analysis, we split the sample into two datasets by assigning every other respondent to an exploratory (n = 1,381) or confirmatory (n = 1,380) sample by sector and website in the order in which respondents completed the surveys. These sample sizes ensured that we far exceeded the recommended minimum sample sizes for exploratory and confirmatory factor analysis.
Confirmatory Factor Analysis of the Initial Item Set
Figure 2 shows the item loadings for a confirmatory factor analysis (CFA) assuming no structure in the items (i.e., a one-factor model, left panel) and the same items in a two-factor model (Content and Design, right panel).
Figure 2: One-factor (left panel) and two-factor (right panel) CFA models of the initial 16-item set.
There are many ways to assess the quality of CFA. Following the recommendations of Jackson et al. (2009), we focused on Comparative Fit Index (CFI), Root Mean Square Error of Approximation (RMSEA), and Bayesian Information Criterion (BIC). There are guidelines for good levels of model fit for CFI (> 0.90) and RMSEA (< 0.08), but not for BIC, which is used for the relative comparison of models (smaller is better).
For the one-factor model, the CFI was 0.74, RMSEA was 0.20, and BIC was 6,166. For the two-factor model, the CFI was 0.92, RMSEA was 0.11, and BIC was 2,144. Thus, accounting for the Content/Design two-factor structure led to better fit statistics including an acceptable level of CFI, but RMSEA was greater than 0.08.
Confirmatory Factor Analysis of the Final Item Set
Figure 3 shows the CFA for the five items retained during the exploratory analyses. The fit statistics for this model were excellent, with a CFI of 0.997, RMSEA of 0.047, and BIC of 96. This CFA model confirmed the construct validity of the two-factor structure identified in the exploratory analyses. For the final version of the PWCQ, see Figure 4.
Figure 3: Two-factor CFA model with the final five-item subscale set.
Figure 4: Final version of the perceived clutter of websites questionnaire.
Sensitivity Analyses
Using the full dataset (n = 2,761), we conducted ANOVAs to check the sensitivity (significance of the main effect of website) of the three clutter metrics, all of which were statistically significant:
- Content Clutter: Mean of Content_ALot and Content_ Space (F(55, 2760) = 6.3, p < 0.0001, h2 = 0.11)
- Design Clutter: Mean of Design_UnpleasantLayout, Design_TooMuchText, and Design_VisualNoise (F(55, 2760) = 9.5, p < 0.0001, h2 = 0.16)
- Overall Clutter: The one-item measure of overall clutter (F(55, 2760) = 3.9, p < 0.0001, h2 = 0.07)
Range Analyses
We assessed the range of these metrics across websites (after rescaling to a common 0–100-point scale for ease of comparison) to get a sense of the extent to which the dataset included websites with different levels of clutter. The distributions are shown in Figure 5 and summarized in Table 2.
Design Clutter scores tended to run lower than Content Clutter scores, with a ten-point difference in medians (50th percentiles). For Content Clutter and Design Clutter, the range of scores was slightly more than half of the possible range of the metric. The range for Overall Clutter was a little more restricted, covering about 40% of the possible range of the metric. The 5th–95th percentiles for the metrics were from 20 to 51 for Content Clutter, 12 to 41 for Design Clutter, and 20 to 45 for Overall Clutter. None of the websites had a mean score on these metrics higher than 65.
Figure 5: Dotplots of the distributions of Content Clutter, Design Clutter, and Overall Clutter across the websites included in the consumer survey.
| Clutter Metric | Min | 5th | 10th | 25th | 50th | 75th | 90th | 95th | Max | Range |
|---|---|---|---|---|---|---|---|---|---|---|
| Content | 11 | 20 | 21 | 26 | 33 | 37 | 44 | 51 | 62 | 51 |
| Design | 9 | 12 | 16 | 20 | 23 | 29 | 36 | 41 | 65 | 56 |
| Overall | 11 | 20 | 23 | 29 | 32 | 38 | 44 | 45 | 50 | 39 |
Table 2: Summary of distributions for Content Clutter, Design Clutter, and Overall Clutter after conversion to a 0–100-point scale.
For the eight surveys we conducted, our focus was to gather information about top websites in their sectors, so we did not focus on including websites with unusually high levels of clutter. There is some possibility that including very cluttered websites might have led to different analytical solutions. That said, our exploratory and confirmatory analyses are appropriate for the types of websites we typically study in our consumer surveys and may work well when assessing very cluttered websites because we did not see evidence of ceiling or floor effects with these clutter metrics.
Summary and Discussion
Confirmatory analysis of over 1,000 ratings of the perceived clutter of 57 websites found:
Confirmatory factor analysis of the five subscale items of the PWCQ indicated excellent fit. CFA of the five subscale items of the clutter questionnaire had excellent fit statistics (CFI = 0.997, RMSEA = 0.047, BIC = 96), better than a similar two-factor CFA of the 16-item version (CFI = 0.92, RMSEA = 0.11, BIC = 2,144).
Clutter questionnaire scores varied across websites but with possible range restrictions. The sensitivity analyses of Content Clutter, Design Clutter, and Overall Clutter showed significant variation in the means of these metrics by website, but our analyses of the ranges of these values showed that none of them, after rescaling values to 0–100-point scales, had any clutter score greater than 65, covering about half the possible range for Content Clutter and Design Clutter and about 40% of the range of Overall Clutter.
Bottom line: We expect UX researchers and practitioners to be able to use this version of the clutter questionnaire when the research context is similar to the websites we studied in our consumer surveys. We don’t anticipate serious barriers to using the clutter questionnaire in other similar contexts including task-based studies, mobile apps, and very cluttered web/mobile UIs, but because that research has not yet been conducted, UX researchers and practitioners should exercise due caution.
For more details about this research, see the paper we published in the International Journal of Human-Computer Interaction (Lewis & Sauro, 2024).




