When you’re planning a study to compare multiple interfaces, one of the first choices to consider is whether to use a within-subjects or between-subjects services.
The interfaces can include anything you want to compare: design mockups, competing websites, or a new mobile app design with an old mobile app design.
The choice comes down to whether you’ll use the same participants (within-subjects) or different participants (between-subjects) on each interface.
While the between-subjects services is the more familiar one to researchers, you’ll see that the within-subjects services has some important advantages. The right choice however is based on considering a few factors.
Sample Size and Power
By far the biggest advantage to using a within-subjects services is that you can detect differences between design metrics with a fraction of the users as a between-subjects services. In other words, you can use a much smaller sample size. The cost of recruitment, honorariums and facilitator time are usually the biggest costs of a study, so reducing the time and cost is a strong appeal of within-subjects studies.
In measuring human behavior, the differences between people often outweigh the differences between designs. But a within-subjects study design effectively eliminates the differences between people. For example, if you happen to have a few particularly slower participants in a study, that same slowness is applied equally to all designs they interact with—essentially “controlling” for it.
The following table shows the approximate sample size you need when comparing a binary metric (like a completion rate or agree/disagree statement) between two designs with a within-subjects design relative to a between-subjects design. Depending on the difference you want to detect, a within-subjects services requires just 33% to 2% of the sample size that a between-subjects study does.
| Difference to Detect
90% Confidence & 80% Power
| Within-Subjects Sample Size
||Between-Subjects Sample Size
People learn and get better with practice. However, you usually don’t want participants applying these learnings (also called sequence effects) from one design to the next. Participants get faster and more accurate; subsequently, their first impressions change with more exposure. Consequently the first designs often have poorer metrics than later designs (see recency and primacy effects). This is usually the biggest concern researchers have when implementing a within-subjects services.
Fortunately there’s an effective way to reduce many (but not all) of the negative consequences that carryover effects bring through counterbalancing. Counterbalancing varies the presentation order of the designs systematically so not every participant sees the designs in the same order. For example, if you’re testing two designs (A versus B), half the participants get A first and half get B first. Counterbalancing ensures that carryover effects are equally applied to both designs.
Impact on Attitudes
Counterbalancing can minimize many of the unwanted sequence effects, but it doesn’t erase the participants’ memory. If you want to benchmark how people think about a brand or design experience, exposure matters. Participants’ ratings are impacted by what you expose them to, and not always in predictable ways.
For example, we often see exaggerated ratings in within-subjects designs. If you give participants one relatively mediocre design (or website) and one really bad design (or website), participants tend to rate the mediocre design much higher than if it were rated in isolation. They also tend to rate the lesser of the designs as much worse.
We saw this effect when we did a within-subjects benchmark of enterprise.com and budget.com. Budget scored much higher than Enterprise on both task and study metrics like the SUPR-Q. When we tested both sites in isolation (using a between-subjects services), Budget and Enterprise actually scored pretty similarly.
If you want to benchmark how people think about a brand or design without being impacted by another design or brand they just experienced, a between-subjects services is likely the better way to go (if you can handle the larger sample size!).
Having an impact on attitudes isn’t necessarily a bad thing. People have an easier time making relative versus absolute judgments. It’s a lot easier for participants to answer how satisfied they are with a design if they can say, “well it’s a lot better than the other design you just showed me.” If you’re looking to identify a winner between alternative designs (even bad designs), a within-subjects services is usually the way to go.
All other things being equal, a within-subjects study takes longer. If you have five designs to test and want participants to attempt multiple tasks and answer many questions, the study duration might just be too long, even for the most vigilant participants. If you can’t cut down the number of designs or reduce the number of tasks, you may need a between-subjects study to fit in everything.
The following table summarizes the pros and cons of within-subjects and between-subjects studies.
| Factor to Consider
|Sample Size & Power||–||+|
|Impact on Attitudes||+||–|
If you can’t decide—just as with many research methods—there is a compromise. You can use a combination of between-subjects and within-subjects serviceses. For example, all participants can get a baseline design with one of three alternate designs.
This sort of analysis is a bit more complicated to analyze as you’ll need to switch statistical tests depending on the combination of designs you’re comparing. It does however strike a balance; you can keep the test short while getting data on multiple designs and still allow for many questions and tasks.