5 Techniques to Identify Clusters In Your Data

Jeff Sauro, PhD

Understanding who your users are and what they think about an experience is an essential step for measuring and improving the user experience.

Part of understanding your users is understanding how they are similar and different with respect to demographics, psychographics, and behaviors. These groupings are often called clusters or segments to refer to the shared characteristics within each group.

Clusters play an important role in both marketing and product decisions and they don’t only apply to people. You can use them to organize content on websites, features in software, or items in a questionnaire.

Like many approaches in data science and statistics, there are different approaches for uncovering clusters. The process involves examining observed and latent (hidden) variables to identify the similarities and number of distinct groups. Here are five ways to identify segments.

1. Cross-Tab

Cross-tabbing is the process of examining more than one variable in the same table or chart (“crossing” them). It allows you to see to what extent groups differ on variables. For example, the graph below shows which activities participants reported doing online, crossed by device type (laptop/smartphone). We can see some activities that cluster on smartphones (e.g. taking photos/videos with the camera, downloading apps, and listening to music) versus desktops.

You can also cross-tab using more than two variables and then create a visualization to better see the clusters. For example, the graph below shows three variables that describe clusters with smartphone usage: task importance, task frequency, and the stage in which the activities were performed.

Cross-tabbing is most commonly done using observed variables (like in the examples above) to illustrate similarities and differences between groups. You can also cross-tab using created and latent variables but first need to create a way to represent hidden variables using one of the subsequent methods.

2. Cluster Analysis

Cluster analysis groups related items together using different algorithms to identify the “clusters.” These clusters are latent variables, meaning they aren’t directly measured but instead are inferred from the relationship items have with each other. Cluster analysis is the approach used in card sorting when you want to know how closely products, content, or functions relate from the users’ perspective.

For example, the graph below—a dendrogram—shows a visualization of the similarities (from a similarity matrix) in ratings participants provided with respect to smartphone usage. It revealed six clusters and how participants grouped items together (e.g. looking at user reviews and detailed specifications on consumer electronics products).

3. Factor Analysis

Factor analysis is a staple of quantitative research and has history dating back to some of the earliest research into measuring intelligence. In an exploratory factor analysis (EFA), a researcher looks to identify underlying latent groups of variables, called factors, by using software to examine the intercorrelations between many variables. The researcher then increases or decreases the number of factors and variables in an iterative fashion to identify both the number of factors and which variables “load” on each factor.

Factor analysis is often used in questionnaire development to identify underlying constructs from many items participants respond to. When we created the SUPR-Q we used a factor analysis that identified four factors. We later confirmed the four factors in a new dataset using confirmatory factor analysis (CFA).

Factor analysis works best with continuous or ordinal data. The table below shows the typical output for an exploratory factor analysis, which shows items that group or “load” together.

For example, the loading of the items “It is easy to navigate within the website” and “The website is easy to use” both have high loadings on the first factor. Based on which items load together, the researcher names the factor accordingly. For these two items we named it the Usability factor, because both items addressed the underlying concept of website usability (ease and findability).

Usability Trust Loyalty Appearance
It is easy to navigate within the website. 0.88 0.01 0.00 0.00
The website is easy to use. 0.87 0.01 0.02 -0.01
I am able to find what I need quickly on the website. 0.58 0.09 0.12 0.11
I feel comfortable purchasing from the website. 0.02 0.86 -0.07 0.01
I feel confident conducting business on the website. 0.06 0.84 0.05 -0.04
I can count on the information I get on the website. -0.01 0.35 0.31 0.20
I will likely return to the website in the future. 0.05 -0.01 0.78 -0.03
How likely are you to recommend the website to a friend or colleague? 0.04 -0.02 0.77 0.04
The website keeps the promises it makes to me. -0.01 0.35 0.39 0.16
I find the website to be attractive. -0.01 0.01 0.03 0.76
The website has a clean and simple presentation. 0.34 -0.01 -0.03 0.56
Extraction Sums of Squared Loadings 5.90 0.91 0.44 0.20
% of Variance 53.67 8.25 3.97 1.80
Cumulative % 53.67 61.92 65.88 67.68
Rotation Sums of Squared Loadings 4.82 3.89 4.54 4.62


4. Latent Class Analysis (LCA)

Latent class analysis is another method that identifies latent variables to segment customers, content, and ideas. We use it as part of our process for creating a customer segmentation analysis and the process of making personas more scientific.

An LCA can handle both nominal and ordinal data well. The process is iterative, as a researcher has software identify the correlations between responses to uncover segments. These segments are called classes and are analogous to the factors in a factor analysis. The researcher often starts with four or five classes and adjusts the number of classes and variables, retaining the variables that differentiate groups.

The number of classes the researcher settles on is a combination of finding the best statistical fit, and something that matches the theories of what differentiates the segments. For example, in an LCA we conducted as part of a segmentation on home buying and selling, we suspected variables like whether participants had kids or were budget conscious would differentiate the respondents. After the iterative process of adding and removing variables we identified four classes using twelve variables as shown in the graph below.

We then used the four classes as new variables in a cross-tab, as shown in the graph below, which crosses class membership and gender (showing disproportionally more women in Classes C and D).

Classes, like factors, can be named based on the dominance of certain variables.

5. Multidimensional Scaling (MDS)

Multidimensional scaling is another technique related to cluster analysis and latent class analysis that groups items or responses into latent variables. It’s often used to transform participants’ judgments or preferences for products or experiences into distances in multidimensional space. Participants typically rate the same products or websites and an MDS provides a visualization of the similarities in two or three dimensions.

We recently did an analysis on how participants differentiate retail websites on a number of dimensions, including price, variety, quality, and website ease. The graph below shows how participants in the study perceive Amazon and Best Buy as similar on price for consumer electronics compared to Walmart, HSN, and hhgregg.

Summary and Recommendations

Identifying clusters plays an important role in both marketing and product decisions. Here are some things to consider when identifying clusters.

  1. Clustering techniques generally require larger sample sizes. Statistical techniques like factor analysis and LCA generally need a minimum of 100 responses (and ideally a lot more) for the algorithms to provide stable clusters. You can use cross-tabbing on any sample size but the limits of small sample sizes still apply.
  2. You need specialized software. You’re going to need specialized software like R, SPSS, or Minitab to do most of the advanced clustering techniques.
  3. You need someone with good statistical knowledge. In addition to software, you’ll need to have access to someone with advanced statistical knowledge on how to interpret the statistical output and make recommendations on the right number of clusters (this is when you contact MeasuringU).
  4. Data types matter. The type of data you collect (e.g. categorical data versus rating-scale data) will help you determine the best clustering technique to use. So don’t wait until you’ve collected your data to consult the stats person; you’ll want their input on how to provide response options before you collect data.
  5. You can use combinations of methods. You can combine clustering techniques (especially cross-tabbing) with other techniques to help answer your research questions.
    Your Cart
    Your cart is emptyReturn to Shop
    Scroll to Top