How many jelly beans are in the jar?

Your best guess is probably wrong.

But if I were to ask a few hundred people their guesses and calculate the average, the average would turn out to be pretty accurate. Some guesses are in fact correct, but they are rare and you don’t know ahead of time which guess will be correct.

If you’re familiar with the game show “Who Wants to be a Millionaire” (or the movie based on the Indian version of the show), you may know that when a contestant gets stuck trying to answer a question, they have an option to ask the audience.

Turns out the audience is right over 90% of the time (at least in the U.S.).

This track record doesn’t come from asking the smartest audience member, but instead comes from selecting the top-picked answer by the audience–in essence using the wisdom of the crowd.

In many cases, the judgment from multiple people collected independently and then aggregated is better than even the best individual judgment. And this principle applies to more than trivial thingsā€”it applies to predictions with real consequences and from experts with advanced degrees.

Forecasting the growth (or contraction) of an economy and unemployment has major consequences for both public policy and the private investor. However, it’s usually the case that the average forecast from different economists is better than any individual forecaster. This has been called the Reverse Lake Wobegon effect, in reference to the fictional town where everyone is above average. In the reverse, everyone predicts worse than the average forecast, but it’s not fictional.

The idea of aggregating results is a powerful methodological tool that can smooth out unusual forecasts, scientific conclusions, and judgments from experts and novices alike. It’s the power behind meta-analysis and using the average of several polls to predict the winner of an election.

Aggregating Judgment in User and Customer Research

This same principle of aggregating can be applied to user and customer research as well. Here are five applications where aggregating is to your advantage.

  1. Problem Detection: When observing users interact with an app or website, what you get is often what you see–but people see different things. Different evaluators detect different problems, even when watching the exact same users. By aggregating the problem results, we benefit from the multiple perspectives by identifying more problems and a more diverse set of issues.
  2. Facilitation: The effect a trained facilitator can have on a participant or group of participants is subtle, but can influence their behavior and the data you collect. It could be the amount of time a facilitator gives a participant to respond or struggle with an application or the reaction to a comment that affects future comments. Such subtle influences can affect both the problems uncovered and the metrics collected. This is one of the reasons we use different evaluators in our sessions with clients. Using multiple evaluators can help offset subtle biases.
  3. Expert Reviews and Heuristic Evaluations: There is considerable judgment involved in examining an interface to identify potential problems for users. Using multiple independent evaluators, even junior ones, helps paint a more complete picture of the potential pitfalls. Aggregating the findings from multiple evaluators finds on average around a third of the problems found in a usability test. It also identifies plenty of other issues that if fixed will likely improve the user experience.
  4. Severity Ratings: Judging the severity of a usability problem is an inexact science. What makes a problem critical versus cosmetic varies depending on your perspective and experience. Using the principle of consensus forecasting here, averaging the severity ratings from multiple independent judges is likely to provide a more valid assessment of a problem’s impact.
  5. Multiple Participants: In-person testing, unmoderated testing, and surveying all share an important feature: they use the average results of multiple independent participants. While it may seem straightforward, averaging metrics like completion rates and responses to questions like those on the SUS and SUPR-Q provides a more accurate estimate of the unknown population. The average response from a sample, even a small one, is a surprisingly good estimate of the average from even very large populations.

The Downfalls of Aggregating

Of course, unlike polling, economic forecasts, jelly beans, and correct answers to factual questions, it’s often difficult to verify the accuracy of ideas and insights gleaned from most customer research. There isn’t an objectively correct severity rating or gold-standard list of usability problems to verify our methods with.

What’s more, relying on a group forecast isn’t a panacea; it has its pitfalls. For example, some audiences actually try and trick the contestant by giving the wrong answer! Judgments can also be based on false data or poor assumptions. Additionally, if people influence each other when making a judgment, it’s easy for group think to set in, thus removing the advantage of pooling the judgments. This is why it’s essential to collect the judgments independently.

Despite the potential pitfalls and problems with checking the accuracy of judgments, the principle of aggregating independent judgments helps improve the reliability and validity of the assessments of the customer experience.