How much do they affect the quality of online research?
With in-person studies, we see each participant’s engagement level. With data collected remotely, we need another way to determine whether participants have engaged with the study or whether–as may happen when they are in it strictly for the money–they have rushed through it.
We use three methods to weed out suspect respondents:
- cheater questions
- open-ended prompts
- duration tracking
Cheater Questions: Traditionally, “select this response” questions, scattered throughout a survey, have exposed participants who pay no attention, answer haphazardly, or give “straight-line” responses. Because more sophisticated paid participants have learned to look for such questions, these questions now tend to catch only the most egregiously reckless responders. Cheater questions have become the CAPTCHA of surveys–intended to keep out the robots but not doing the job very well. Instead, these questions often create false positives.
Open-Ended Prompts: Examining responses in open-ended comments can provide insight into the quality of responses. Unfortunately, examining open-ended responses can be laborious and subjective. What’s more, participants can still provide a marginally useful answer, such as “no comment,” and not be excluded this way.
Duration Tracking: Online survey software tracks the time that participants spend in a study. This data can help flag speeders. This method is attractive because it’s more objectively applied and harder for a disingenuous participant to work around.
At MeasuringU we use all three methods when filtering our responses, since each method, on its own, has flaws. Duration tracking tends to be one of the more universally applied methods for winnowing out poor-quality responses, so we wanted to understand what effect speeders have on the quality of the findings. First we looked for extant research on the effects of speeders.
The subject of speeding on data quality was examined in a recent paper, “The Impact of Speeding on Data Quality,” by researchers from Germany, led by Robert Greszki. They examined responses to two large datasets (1100+ and 1500+). Participants were asked about their attitudes toward the heads of governments in the US and Germany. Both surveys used paid participants. The median survey-completion time was between 27 and 33 minutes.
As a first filter, the researchers removed the participants deemed to have completed the survey much too quickly. The fastest respondents took between 2 and 6 minutes. These durations were considered impossibly fast for a survey designed to take 30 minutes.
For the remaining respondents, the next question was How quick is too quick? Participant attentiveness waxes and wanes throughout a study. Instead of looking at total study time, the researchers focused on response time per page (most likely 1-3 questions per page) with both surveys having around 50 pages of questions (yikes!).
They defined “typical time” as the median response time per page. Speeders were defined using three buckets: those who completed the studies 30%, 40%, and 50% faster than the median time. The latter group contained participants who answered the questions in less than half the time of the median. This group, contained between 3% and 11% of responses. In comparison, the 30% group had 20%-25% of the sample.
To my surprise, the researchers found no significant differences in response patterns across all speeding categories and across a number of questions—even such politically charged questions as the president’s handling of the economy. They concluded, using a number of criteria, that excluding speeders didn’t impact the results in a substantial way.
Our experience with speeders is that they add noise to the data and likely respond haphazardly. But we never examined the data closely to see if the fastest responses systematically bias the results.
To see how well these findings applied to user research data, we examined 10 datasets from some recent unmoderated usability studies and surveys we conducted. Participants in all studies were obtained from paid panel respondents (Americans of various ages and both genders). These studies focused primarily on websites of retailers and software manufacturers. The median sample size for the studies was 253, with a range of 104 to 1763.
Following the delineation by the Greszki et al research, we identified participants who completed the studies in less than 50% of the median time. For example, if the median time to complete the study was 500 seconds, 50% speeders were those who took 250 seconds or less.
Unmoderated usability studies include a mix of task-based questions and study-based questions. Surveys include only study-based questions. We selected task- and test-based metrics to look for patterns for the fastest respondents. We made 84 comparisons on questions ranging from perceived task difficulty to brand favorability. We found 9 differences to be statistically significant ( p <.05). Given the large number of comparisons, however, we’d expect chance alone to yield 8 significant differences, so this finding alone isn’t compelling. See Chapter 10 in Quantifying the User Experience for more discussion on making multiple comparisons.
The surveys Greszki et al examined enabled them to focus on the page-level response time. Our study data isn’t broken out by page, but we do have task-based data. We looked at 30 tasks across 7 of the datasets which contained task-based metrics of task-ease (SEQ), confidence, and, for a subset, completion rate.
As a first pass, we removed the fastest participants “superspeeders” for each task (those who took less than 25% of the median time), keeping those that took between 25% and 50% of the median time. The percentage of speeders for each task then ranged from 4% to 16%, which represented between 14 and 294 people.
We made 74 comparisons between speeders and nonspeeders, of which 20 were statistically significant (p <.05). Given this number of comparisons, we’d expect 7 to be statistically significant just from chance alone, suggesting that these differences might mean something (because the number of statistically significant differences  is greater than what we’d expect from chance ). Digging deeper into the differences, we found that 10 had higher scores for speeders and 10 had lower, suggesting no pattern. Again, this includes a combination of task-completion rate, task ease, and confidence.
With smaller sample sizes in each group (some tasks had only 14 speeders) the power to detect a difference (if one exists) is low. Therefore, using a meta-analytic technique, we ignored the significance levels and looked at the raw differences for each task metric. The results are shown in Table 1.
|Speeder Score||Completion Rate||SEQ||Confidence|
|95% CI Around Difference||(-.16 to .16)||(-.17 to .17)||(-.41 to .41)|
Table 1: Difference in task-level metrics between speeders and nonspeeders.
For example, 14 tasks had completion data; of those, 7 showed speeders with higher task-completion rates and 7 had lower task-completion rates. For task difficulty, 18 of the 30 tasks we examined had speeders rating the task slightly easier. For confidence, 17 of the 30 tasks showed speeders rated confidence lower. Notice how the percentages hover around 50% in the “% Higher” column, suggesting speeders are just as likely to rate higher or lower than the rest of the responses.
Table 1 shows a subtle but not statistically significant pattern in responses for the speeders. In general, speeders have a slightly lower completion rate (1% lower), rate task ease slightly higher (.12 points higher) and confidence slightly lower (.1 point lower).
Our data also shows that speeders do not substantially affect the responses. We would have expected the task-completion rates, at least, to be a lot lower for speeders, as task completions provide objective answers. Surprisingly, there isn’t much of a pattern in the responses.
To see whether this non-pattern in responses holds for the most egregious respondents (most of whom would likely be cut due to impossibly fast times) we looked at the extreme group of speeders initially excluded (“superspeeders”). These took 25% or less of the median task time. If the median time was 100 seconds, superspeeders took 25 seconds or less to complete the task. The percentage of superspeeders for each task ranged from 6% to 19%, similar to the participants per task of the speeders.
Superspeeder data (Table 2) reveals a subtle but not statistically significant pattern, slightly different than the speeders. In general, superspeeders have the lower completion rate we expected (25% lower but not statistically different than 0), rate task ease slightly lower (.05 points lower—in the opposite direction of the speeder group), and confidence slightly lower (.44 points lower).
|SuperSpeeder Score||Completion Rate||SEQ||Confidence|
|95% CI Around Difference||(-.35 to .35)||(-.38 to .38)||(-.76 to .76)|
Table 2: Difference in task-level metrics between superspeeders and non-speeders.
The lower task completion rate provides some evidence that these superspeeding participants are answering the task-verification question wrong more than the general population of respondents. Future analysis can look to see if this pattern holds.
Our research corroborated the results of the earlier study, which indicated no substantial impact of speeders on the results of survey data. We extended that finding to task-based usability data and user-research surveys. The patterns that did emerge could be attributed to chance. Typically, we would throw out data from superspeeders, but our data suggests that doing so would have little impact on the results, (other than, perhaps, showing lower task-completion rates).
We will continue to examine the effects of speeding, looking for a linear relationship between the quality and duration of the response. For now, we can conclude and recommend the following:
- Speeders (50% of the median task-time) and superspeeders (25% of the median task time) account for between 5% and 20% of the responses
- Responses from speeders and superspeeders appear to have little detectible impact (and no consistent pattern) on the metric averages (including task completion, perceived ease, and confidence)
- Nothing indicates that that participants with fast overall study times have different responses from the average-speed responses, suggesting that we can’t use study-completion time as an effective filter. (Exception: a participant spending 2 minutes to answer 50 pages of questions can safely be ignored.)
- Effects of speeding are more likely to be detected at the task level, page level, or question level as opposed to the study level, which suggests that you would watch for fast responses as opposed to fast responders.
- When filtering data for quality, use multiple methods for examining data quality (cheaters, speeders, and open-ended comments)
|UX Measurement Boot Camp : Three Days of Intensive Training on UX Methods, Metrics and Measurement Aug. 7th-9th 2019|