Conducting a benchmark study is an excellent way to understand the quality of the website user experience.

Adding a competitive component makes the benchmarking effort even more valuable.

Having a comparison makes interpreting the subsequent data you collect easier to understand as you can immediately see how good or bad the experience was relative to relevant comparisons.

In a comparative study, the focus becomes less on absolute scores and more on relative comparisons. You can argue the tasks are artificial, but if the performance and attitudes for a competitor are consistently higher than your website, you have a compelling argument for change.

Here are ten best practices and considerations we share with our clients, which you should consider next time you’re ready to run a competitive UX benchmark study.  I’ll also be presenting these during a webinar on October 21st, 2015.

Consider Prior Experience

Prior experience with a website is one of the biggest influences on task metrics and overall attitudes. In general, participants with more experience rate the experience as a higher quality and perform better (faster, higher completion rates). This is especially important in competitive studies. You don’t want to declare one website as having a superior experience if all the participants had more experience on one site. You can control for prior experience by setting a quota of participants who have the same level of experience. Another alternative is to statistically control for prior experience.

Collect Data at the Task and Study Level

You’ll want to collect data about the overall experience (macro view) and the detailed task-level experience (micro view). Having participants attempt tasks, instead of just looking around the website, is the most effective way for exploring the nooks and crannies of your website. What’s more, task-level metrics like completion rates, time, difficulty, and errors help diagnose interaction problems and generate ideas on what to improve.

Study-level metrics like the SUS and SUPR-Q provide an overall impression of the website. While this impression is shaped by the task experiences encountered during the benchmark, participants also bring with them their experiences prior to the study and this is usually reflected in the study-level metrics. Both provide insights into the quality of the experience.

Have Task Success Criteria

There’s a lot you can learn from open-ended tasks scenarios where users are asked to search for products or information on their own volition. However, in unmoderated studies, such open-ended tasks often show little differentiation between websites. You’ll want to include a closed-ended task that has clear success criteria, such as finding the right product, right price, or store location. If you’re running a competitive benchmark, not all websites will have the same success criteria, but ensure that the level of difficulty is equivalent when creating the task and success criteria.

Choose Between- vs. Within-Subjects Studies

While a between-subjects approach (different participants on each website) is the more familiar one to researchers, the within-subjects approach (same participants on all websites) has some important advantages. The right choice however is based on a few factors.

By far the biggest advantage to using a within-subjects approach is that you can use a much smaller sample size to detect the same differences as a between-subjects approach. The cost of recruitment and honorariums are usually one of the biggest costs of a study, so reducing the time and cost is a strong appeal of within-subjects studies. The disadvantage of a within-subjects approach is that you’ll have carryover effects, an impact on the attitude metrics as participants make relative judgments.

If you can’t decide whether you need a within- or between-subjects approach, you can compromise by using a combination of the two. For example, all participants get your website and one of three random competitors to compare. The analysis will be more complicated consequently, so give us a call if you get stuck.

Measure Preference

Asking which website participants prefer is an excellent arbiter of choice. It’s most intuitive to ask this question in a within-subjects study where participants encounter all websites. However, you can still ask participants which website they prefer, assuming they have had some experience with them. You can then see how much the recent experience affected their preference. We like to measure both selection (which did a user prefer) and intensity (how much more did the user prefer).

Measure Website and Brand Attitudes

Like prior experience, existing attitudes toward the website and brand have a lot to do with the measures you collect. Negative press can really influence people’s attitudes, and those affect UX metrics. Collect those at the beginning of the study and you can also control for attitudes, such as prior experience, which allows you to hold constant existing attitudes while assessing actions and attitudes. With before and after data you can also measure brand lift to see whether the experience hurt or helped customer attitudes.

Use Standardized Measures

While it’s OK to come up with your own questions to ask participants, you get more accurate results when you use existing standardized questionnaires at the task and study level. Research has shown [pdf] that standardized questionnaires provide a more reliable and valid view of the user experience than homegrown questionnaires.

Instruments like the SUPR-Q at the study level provide a picture of the quality of the website user experience in just eight items. It also highly correlates with SUS, which provides a view of website usability using 10 items and is comparable to 500 other product experiences. The Single Ease Question (SEQ) asked after each task has been shown [pdf] to discriminate well between poor and excellent tasks.

Compare to Other Reported Metrics

There’s a plethora of published data on many industries, like airline, healthcare, and retail websites. Don’t reinvent the wheel when coming up with tasks, metrics, or findings; your findings will end up being redundant. Use these existing data sources as a point of corroboration with your findings or to help take your study to a more focused level.

Have a Sufficient Sample Size

To differentiate between random variations in scores, you need a sufficient sample size. Just because you find no difference doesn’t mean UX website quality is the same. All too often I see competitive studies with insufficient sample sizes to detect even a large difference. Our sample sizes for competitive studies are usually between 150 and 300 participants per website when using a between-subjects approach.

With a sample size this large we can detect differences of about 15 percentage points for completion rates (and smaller differences for continuous metrics. You can detect the same size of a difference with just 50 participants using a within-subjects approach. Before launching your study, run a power analysis to see if your sample size is sufficient to detect a meaningful difference.

Rinse and Repeat: Compare Over Time

Conducting a benchmark involves a lot of effort and coordination. To make a line, you need at least two points. The same can be said for benchmark studies. If you conduct an initial benchmark study, even a competitive one, it becomes a lot more valuable when you can compare future data to it. Plan to conduct benchmarks at regular intervals (e.g. every year or quarter). One of the hallmarks of measuring the user experience is seeing whether design efforts actually made a quantifiable difference over time. A regular benchmark study is a great way to institutionalize that.

We conduct benchmarks for clients in many industries including retail, travel, B2B and consumer electronics. If you want to learn more, I’m hosting a Webinar on Competitive UX Benchmark on October 21st.