Rarely is a customer population made up of a homogenous group of customers who share the same attributes.
Consequently, our samples contain a mix of customers who may or may not reflect the composition of the customer population.
There are a number of variables that affect how customers think and behave toward products and services. One of the most common variables that impacts our measurements is prior experience. More than gender, age, income, and occupation, prior experience with products, software, and websites has a major impact on customer attitudes and behavior.
We see this in usability tests and surveys measuring brand attitudes. In general, the more experience study participants have had, the better their performance on tasks and the more positive their attitudes toward the product or service being tested.
So in any sample of participants in a research study, you’ll want at least to measure participants’ prior experience. Even if you aren’t planning on using this measure, you should add it as it often comes in handy when it’s analysis time.
One way researchers control for prior experience is to match the experience level of the sample with the experience level of the population. If you believe, for example, that 60% of your website visitors use the site weekly and the other 40% use it less, you can recruit participants to match that composition. You can then compute confidence intervals and run statistical comparisons (between, say, two design alternatives) and draw conclusions as to which design users perform better on or prefer. Most of our clients choose this method—matching the sample to the population—because, when you explain it to stakeholders, it makes sense to them.
You can’t always weight your sample to match the population. Even though, for example, your data shows that 30% of your mobile website users have not accessed your website in the last year, it may be difficult to find these users to participate in a study. When you need to determine which design is preferred, or to make any comparison, you don’t want the decision to be based on the improper composition of your sample.
With unbalanced samples, two approaches can mitigate and control for the effects of prior experience on your outcome measures: a weighted t-test and a Type I ANOVA. The Analysis of Variance (ANOVA) is the statistical procedure you use to compare more than two means at once. More importantly, it enables you to see the effects of multiple variables simultaneously. The ANOVA is more computationally intensive than the t-test and usually requires specialized software, such as SPSS, R, or Minitab, to conduct. You’ll also generally want the help of a statistician to assist with the setup and analysis of ANOVA results.
About the Weighted t-Test
A relatively simple method for handling weighted data is the aptly named weighted t-test. When comparing two groups with continuous data, the t-test is the recommended approach. The t-test works for large and small sample sizes and uneven group sizes, and it’s resilient to non-normal data. (We cover it extensively in Chapter 5 of Quantifying the User Experience.) While the t-test is a “workhorse” of statistical analysis, it only considers one variable when determining statistical significance. This means that you can’t compare participants’ attitudes on Design A vs Design B AND factor in their prior experience (say low experience and high experience) with your product.
However, the weighted version of the t-test does factor in a second variable. It adjusts the means and standard deviations based on how much to weight each respondent. Participants that should account for, say, 60% of the population have scores that are weighted at 60%, even if they make up, say, only 20% of your sample. You can see the computation notes in the paper by Bland and Kerry.
Using the Weighted t-Test
Here’s how the weighted t-test works.
We recently examined how users of an online retail website would react to a different design of product information. We presented two variants and wanted to see which one was statistically preferred on a number of dimensions, including comprehension and ease. 857 qualified participants were randomly assigned Design A or Design B. We assessed comprehension and ease of use using ten-point scales.
The mean, standard deviation, and sample size for both groups on a confidence question are shown in Table 1 below.
| Design Variant||Mean||StDev||N|
Table 1. Unweighted mean scores for two design variants tested.
Even though Design A had a nominally higher mean score (8.58 vs 8.37), using a standard t-test to compare the means, we find no significant difference at the alpha = .05 level of significance (p = 0.095).
However, we know that prior experience has a major impact on attitudes toward interfaces, and packed within both samples are four groups of participants, each with progressively more experience with the website.
Not only did the sample contain a heterogeneous subgroup of experience, it was not proportionally representative of the population’s experience breakdown. Table 2 shows the breakdown of the sample in Design A and Design B compared to the makeup of the user population.
|Experience Level||Design A||Design B||Population|
Table 2. Experience level for the sample of customers assigned to Design A and B, compared to the population composition.
The biggest difference is seen with experience level 4. While this group makes up half of the population, it only comprises between 41% and 42% of the sample in Design A and B.
These groups also have differing opinions about the designs they were exposed to. Table 3 shows that one of the biggest differences in attitudes was for Experience Level 4, which rated Design A .39 points higher than B. What’s more, the smallest subgroup preferred Design B over A.
|Experience Level||Design A||Design B||Difference||Population|
Table 3. The mean responses to a confidence question (higher is better), the difference in means by experience level (1 to 4) and the population composition of that experience level.
The weighted t-test creates a composite mean and standard deviation to proportionally account for the subgroup size. The updated means and standard deviations are shown in Table 4 with the original data.
Table 4. Experience level for the sample of customers assigned to Design A and B, compared to the population composition.
The results of the weighted t-test generate a p-value of .03, which is statistically significant at the alpha = .05 level of significance. You won’t always see differences in significance values between the weighted and unweighted approaches–it depends both on how disproportionate your sample is and on how much the lower-weighted groups differ from the higher-weighted groups.
With these results we can conclude both that Design A had higher ratings and that the rating difference wasn’t attributable to incorrectly proportioned sample sizes. You can also use the approach for any mediating variable (such as geography, gender, occupation), and not just for prior experience.
A quick note of caution: you should have a good reason and actual data to support using weights. Don’t just weight your data to achieve statistical significance. While many variables in your sample will differ from the population, many won’t have a large enough effect (if any effect at all) to justify weighting.
A few things to remember about weighted data.
- While many variables could affect our measures, participants’ prior experience with a product is one of the most salient.
- The weighted t-test is the statistical test to re-balance your sample. The weighted t-test adjusts means and standard deviations to generate p-values based on the correct representation.
- Using the weighted statistical test versus an unweighted statistical test doesn’t necessarily yield different conclusions.
- Have a good reason and actual data to support weighting your sample.