It is common to think of time-on-task data gathered only during summative evaluations because, during a formative evaluation, the focus is on finding and fixing problems, or at least finding the problems and delivering a report. For a variety of reasons, time-on-task measures often get left out of the mix. In this article, I show that time-on-task can be a valuable diagnostic and comparative tool during formative evaluations.
The three most common reasons I’ve heard for not using time-on-task in formative studies are:
- Using quantitative measures requires larger samples (>20).
- Average task-times are an inaccurate metric when users think-out-loud.
- Task times are only for benchmarking and not for identifying problems.
Below I discuss why these reasons should NOT prevent you from collecting time-on-task in your next formative evaluation.
Small Samples are Fine
One can collect time-on-task and use parametric statistics for the small (<10) sample sizes in usability tests. The major caveat is that small-sample statistical parameters should be used. For example, when calculating confidence intervals for task time, use t-statistics instead of the normal deviate (z-statistic) because t-statistics take into account the size of the sample in generating the interval. The smaller the sample the larger this value will be and, as your sample gets larger (especially above 30), then these two figures converge. For task completion or problem occurrences, the Adjusted Wald procedure for computing confidence intervals around a proportion also performs well for small samples (Sauro & Lewis 2005). In short, your sample size alone does not preclude the use of taking time-on-task metrics or using statistics to describe them.
Task-Time as a Benchmark between Designs
If you are doing a formative evaluation as part of an iterative testing plan and you have used think-aloud during all iterations, the mean time-on-task becomes a benchmark to help judge the efficacy of subsequent designs. Although you will need larger differences between iterations, statistically significant differences are well within reach (for example see Bailey 1993). That is, assuming you use the same tasks and have users think-aloud concurrently with their task attempts, you can compare the mean completion times across iterations. For improving the usability of a system, practitioners should also strongly consider relaxing their Type I rejection criteria (Sauro 2006) from the conventional publication threshold of p <.05 to say <.10 or .20. While this is always context dependent, in business applications one should look for a sufficient amount of evidence–not necessarily a preponderance of evidence–to conclude a design improves over its predecessor (Kirakowski 2003).
Getting an accurate and stable measure of the actual user time-on-task is more problematic that comparing designs. One would expect task times to increase as users are asked to think-aloud while completing tasks. The published data, however, is mixed, with some published studies actually showing faster performance while thinking-aloud possibly due to the invocation of cognitive processes that improve rather than degrade performance (Berry and Broadbent (1990). For a good summary of the evidence, see Lewis 2006 p. 1282. More research is needed to draw a conclusion on this aspect. Regardless, I recommend focusing on relative task time improvements between designs because this avoids this issues altogether.
Task Times as Symptoms of UI Problems
While the absolute time might not be the best measure of the true task completion time, it allows analysis of outliers and patterns as a diagnostic tool. It might not tell you exactly what the problem is, but it can help tell you where there is a problem. For example, the following task data graphed in Figure 1 were taken from the publicly available CUE-4 (Molich 2004) data from Team M, which timed 15 users while they thought-out-loud as they completed tasks on a hotel reservation website. This task asked the users to cancel a reservation.
Figure 1: Time to cancel a reservation on a hotel-website (in log-transformed seconds). One user took over 4 times the mean time to complete the task. Red solid line is the geometric mean and the green-dashed lines are the upper and lower bounds of the 95% Confidence Interval.
In graphing the report we quickly see that one user took over 4 times longer than the mean time to cancel the reservation (I graphed the data using the Graph and Calculator for Confidence Intervals for Task Times). This simple graph of the task times allows the investigator and reader of a report to zero in on potential causes of such a long task time (relative to the other users). While it’s unclear from the report as to what was occurring during this task, an analysis of this user’s profile shows that she had never visited a hotel website or ever made a reservation at a hotel website prior to the test. Her comments also reinforce her being a “novice” Internet user: “I feel that my inexperience with the web had a lot to do with difficulties.” Whether it was just the user’s inexperience or some specific interface problems, perhaps particularly damaging to a novice, it is clear this user had trouble during the task. A few pixels tell the story.
Time-on-task is an under-utilized tool for formative evaluations. It costs nothing (just start and stop the time), is useful with any-number of users and it can be a valuable tool for diagnosing problems as well as making objective comparisons between iterations. I encourage you to collect time-on-task during your next formative evaluation.
- Berry, D. C., and Broadbent, D. E. (1990). The role of instruction and verbalization in improving performance on complex search tasks. Behaviour & Information Technology, 9, 175-190.
- Bailey, G. (1993) Iterative methodology and designer training in human-computer interface design. In Proceedings of the INTERACT ’93 and CHI ’93 Conference on Human Factors in Computing Systems (Amsterdam, The Netherlands, April 24 – 29, 1993). CHI ’93
- Kirakowski, J, (2005)”Summative Usability Testing: Measurement and Sample Size” in R.G. Bias and D.J. Mayhew (Eds): “Cost Justifying Usability: An Update for the Internet Age.” Morgan Kauffman Publishers, CA, 2005.
- Lewis, J. R. (2006). Usability testing. In G. Salvendy (ed.), Handbook of Human Factors and Ergonomics (pp. 1275-1316). New York, NY: John Wiley.
- Molich, Rolf (2004) Comparative Usability Evaluation CUE-4.
- Sauro, J. (2006) “The User is in the Numbers” in ACM Interactions Volume 13, Issue 6 November-December.
- Sauro, J & Lewis, J R (2005) ” Estimating Completion Rates from Small Samples using Binomial Confidence Intervals: Comparisons and Recommendations” in Proceedings of the Human Factors and Ergonomics Society Annual Meeting (HFES 2005) Orlando, FL