Much of market and UX research studies are taken by paid participants, usually obtained from online panels.
While these huge sources of participants help fill large sample studies quickly, there’s a major drawback: poor quality respondents. Reliable and valid responses only come when your data is properly cleaned.
In our experience, around 10% of responses (often ranging from 3% to 20%) “cheat” in surveys and unmoderated UX studies, and need to be tossed out or “cleaned.” These responses are usually a combination of cheaters, speeders, respondents misrepresenting themselves, or participants just not putting forth effort. All of which threatens the validity of the studies’ findings.
There’s no simple rule for excluding participants. Instead we use a combination of the following methods to flag poor quality respondents, progressing from the more to the less obvious indicators. This process helps ensure we’re obtaining higher quality results and valid findings from our studies.
Most Obvious and Easiest Detection Methods
First, you’ll want to start with the easiest methods to eliminate responses. These methods help you remove the most obvious of the poor quality responses. Notice that if one response falls into this category, you don’t automatically throw it out; but if a response falls into two or more categories (or combined with an advanced method), you should think twice about including it.
- Poor verbatim responses: A participant who has multiple verbatim responses that consist of gibberish (“asdf ksjfh”) or terse, repetitive responses (“good,” “idk”) is often an indicator that he or she is not taking the study seriously and may be speeding through rather than answering thoughtfully. Answers to the open-ended questions should be one of the first and easiest ways to identify a poor-quality respondent. While multiple poor responses are usually grounds for removal, a single gibberish or nonsense response is not necessarily as big a problem, provided the other verbatim responses are answered thoughtfully. For example, we’ve found in some cases it could have been from requiring a respondent to answer a question he or she was unable to answer meaningfully.
- Irrelevant responses: Occasionally, respondents will provide responses that do not match the question asked but are not gibberish. These could be lines copied/pasted from other places, or even the question itself copied/pasted into the answer. At first glance these responses look legitimate (because they are long and contain common words you’d expect in surveys), but on close examination they may be indications of participants gaming the study. Unfortunately, there are even cases of automated “bots” providing random plausible responses to open-ended questions and answering close-ended questions. (It also happens with scientific publications.) As with poor responses, multiple nonsense responses are a bigger concern than a single suspicious response.
- Cheater questions: If a “cheater question” was included (e.g. “Select 3 for this response”) and the participant answered it wrong, it is cause for additional examination. One wrong answer can be simply a mistake, so use caution when deciding to include/exclude participants based on this criterion. We’ve found that incorrectly answering a single cheater question may exclude too many participants. (Those who made mistakes when responding or were perhaps only distracted temporarily.)
- Speeders: A participant who completes the study too quickly is a cause for concern. For example, if a participant takes 2 minutes to finish a 50-question survey, it is highly unlikely that he or she is providing genuine, thoughtful responses. It’s more common for participants to speed through questions, pages, or tasks (if running an unmoderated study) than through the entire study.Two suggestions on speeding. First, be sure not to be too strict in defining your “too fast threshold.” We have been quite surprised how quickly some people can answer survey questions and complete tasks. Second, where possible, look at the speed of individual tasks, pages, or questions as opposed to the entire study, which we’ve found is more sensitive to detecting speeders.
While it would seem flagging speeding participants for removal is an easy first step, earlier research and our own data has found it’s not necessarily a good arbiter of poor and high quality responses in and of itself. The same goes for very slow participants; taking very long may just be an indication of getting distracted while taking a survey, and not necessarily indicative of poor quality. The primary exception being: when you need to collect time-on-task in online usability studies and task times are unrealistically long, they need to be removed.And the good news is with speeders, even if you don’t remove all of them, we’ve found little difference between speeder and non-speeder data; similar to earlier research[pdf] again that suggests speed alone isn’t a good indicator of poor quality responses.
Less Obvious and More Difficult Detection Methods
Like hackers who learn to evade anti-spam detectors or speeding motorists who know where the speed traps are, some panel participants have found increasingly more sophisticated ways to game the system (receive an honorarium without contentiously responding). This is especially concerning when a small number of participants belong to multiple panels.
- Inconsistent responses: Some questions in a study tap into a similar concept but are phrased with a positive or negative tone. The SUS is a good example of a questionnaire that uses this alternating tone. It doesn’t make much sense for a participant to strongly agree to the statement, “The website is easy to use” and strongly agree to the statement “I found the website very cumbersome to use.” This is an indication of an inconsistent response. Be careful when using this technique. We’ve found people can make a legitimate effort to complete a study but still make the mistake of forgetting to disagree to negatively toned statements . If you’re using this approach, consider using multiple inconsistent responses as a safer approach.
- Missing data: When compensating participants we’ll often make many, if not all, questions in our studies mandatory. When not all questions are required and participants neglect to answer many of them, this non-response is another symptom of poor quality responses. Just as concerning with non-response though is how your data might be biased if participants are systematically not responding to some questions. You should consider examining whether the missing data is random or more systematically biased.
- Pattern detection: Participants that respond using conspicuous patterns such as straight lining (all 5’s or all 3’s) or alternating from 5’s to 1’s to rating scales also indicate a bot or a disingenuous respondent. But again be careful—if participants had a good experience on a website, it’s not unsurprising for them to rate the experience as exceptional on say 8 or 10 items. There’s more concern if you see straight lining or patterns on 20 or 30 questions in a row.
- Session recordings: If the study is task-based with screen-recordings, then you can observe what the participants are doing while completing the study. For example, you may observe no activity happening on the screen during a task, or users who are distracted by Facebook or in some cases we’ve seen participants haphazardly clicking as if to fool this fraud detection, similar to dodging speed traps.
- Disqualifying questions: For many studies we look for participants with particular characteristics/criteria. If a participant somehow is admitted into a survey by answering a screening question a certain way, but then reveals in the open-ended answers they are not qualified, they are excluded (e.g., a participant needs to have a particular credit card, but in an open-ended question reveals that he or she doesn’t have any credit cards). This can also be done in combination with a cheater question if, for example, participants state they are familiar with fictitious brands or have bought products that don’t exist.
Applying These Methods to Clean Your Data
We use the above approaches and informally score each candidate for quality of response and increase the “strictness” of the criteria depending on the needs of the study. Respondents who failed multiple checks are the first to be flagged for removal.
For example, respondents who provided gibberish responses, incorrectly answered cheater questions, and completed the study in a very fast amount of time are the first to be removed. A participant who answered a cheater question wrong but provided helpful responses to the open-ended questions is often worth retaining.
We also highly recommend not permanently deleting respondents from your dataset! Instead, flag them as poor quality in a way that allows you to unflag them in the future. We’ve found that in some cases we were too strict in applying our criteria and removed too many people and then needed to add them back into the results.
Keep in mind that you’re measuring people, not robots, and people get tired, bored, and distracted but still want to provide genuine input. Some level of poor quality responses is inevitable, even from paid respondents, but the goal is to winnow out those who don’t seem to provide enough effort from those who perhaps got a little distracted.