One box, two box, red box, blue box …

Box scoring isn’t just something they do in baseball.

Response options for rating scale data are often referred to as boxes because, historically, paper-administered surveys displayed rating scales as a series of boxes to check, like the one in Figure 1.

The most favorable response to an item, like “Strongly agree” in Figure 1, is called the *top box.*

When analyzing rating scale data, you can compute the average of the numeric response from the box scores (usually the mean) or compute a frequency distribution for each response option like the one in Figure 2. The data in Figure 2 comes from our 2022 UX report on airline websites. Respondents were asked to rate several items, including the experience of selecting a seat (1 = Very Poor, 7 = Excellent) so, the top-box score here is the number of times respondents selected 7 divided by the total number of respondents.

Both the mean and box scores can be helpful in describing participants’ sentiments and tracking changes in attitudes, but we’ve found compelling data that extreme attitudes (positive or negative) tend to be better predictors of behavior than the mean.

In this article, we dive deep into data collected for a rating scale item to guide initial insights about whether it’s better for UX and CX researchers to report the mean, a box score, or both, and if reporting a box score, which is better (top box, top-two box, bottom box, or net box).

# Example of Mean and Box Scores from a Study of Airline Websites

Table 1 shows different ways of analyzing the responses (mean, top-box score, top-two-box score, bottom-box score, and net-box score) to a seven-point rating scale used to assess the ease of selecting seats on the websites of 12 airlines.

Airline | Mean | Mean100 | Top Box | Top-Two Box | Bottom Box | Net Box |
---|---|---|---|---|---|---|

Air Canada | 5.4 | 73% | 21% | 53% | 2% | 19% |

Air France | 5.5 | 75% | 22% | 58% | 0% | 22% |

Alaska Airlines | 6.1 | 85% | 35% | 83% | 0% | 35% |

American Airlines | 5.8 | 80% | 41% | 67% | 0% | 41% |

British Airways | 5.6 | 77% | 21% | 60% | 0% | 21% |

Delta | 5.7 | 78% | 29% | 67% | 2% | 27% |

Frontier | 5.0 | 67% | 24% | 42% | 3% | 21% |

JetBlue | 5.7 | 78% | 37% | 56% | 0% | 37% |

Lufthansa | 5.4 | 73% | 24% | 53% | 0% | 24% |

Ryanair | 5.0 | 67% | 16% | 37% | 0% | 16% |

Southwest | 5.7 | 78% | 39% | 64% | 3% | 36% |

United | 5.9 | 82% | 38% | 70% | 4% | 34% |

Average | 5.6 | 76% | 29% | 59% | 1% | 34% |

For this type of scale (the same as Figure 2, 1 = Very Poor, 7 = Excellent), the indicators of a better experience are higher means, higher top-box, top-two-box, and net-box scores, and lower bottom-box scores.

## Mean

The Mean column of Table 1 shows the arithmetic mean of the responses to the seven-point scale. For easier comparison with the box score percentages, the table also shows Mean100 values, which are the means interpolated from seven to 101 points (0–100%). The poorest mean rating of ease of seat selection was 5.0 (67%) for Frontier and Ryanair; the best was 6.1 (85%) for Alaska Air.

## Top Box

The top box score for this scale is the percentage of respondents selecting the best response option (7 = Excellent), ranging from a low of 16% for Ryanair (consistent with the mean) to the best score of 41% for American Airlines (instead of Alaska Airlines).

## Top-Two Box

The top-two-box score for this scale is the percentage of respondents selecting 6 or 7, so it is always equal to or greater than the top-box score. The poorest top-two-box score was 37% for Ryanair (consistent with the mean and top-box scores). The best top-two-box score was 83% for Alaska Airlines (instead of American Airlines, consistent with the mean but not the top-box score).

## Bottom Box

The bottom box score for this scale is the percentage of respondents selecting the worst response option (1 = Poor), so low percentages indicated better experiences. Air Canada and Delta had only 2% and United was the worst performer with 4%. Seven airlines were tied for the best performers at 0%.

## Net Box

There is usually a strong need to provide a single number for analyzing or communicating the attitude measured with a rating scale item, hence the use of the mean, top-box, top-two-box, or bottom-box scores. However, sometimes there is information contained in the bottom box that isn’t necessarily reflected in the top box (and vice versa). One strategy for packing more information into a single score is to use net scoring—subtracting bottom-box from top-box percentages (using any specified number of top and bottom boxes).

Three examples of standardized measures that use net scoring are the Net Promoter Score (NPS), the Microsoft Net Satisfaction (NSAT), and Forrester’s Customer Experience Index (CxPi).

- Net Promoter Score: Top-two boxes minus bottom-seven boxes for one eleven-point item.
- MSFT NSAT: Top-box score minus bottom-box score for one four-point item.
- Forrester’s CxPi: Average of top-two boxes minus bottom-two scores over three five-point items.

The net score in Table 1 is the top-box minus the bottom-box score. Because the bottom-box scores were all equal to or nearly equal to 0, the net scores were almost identical to the top-box scores in this example (in many cases they were identical). The poorest net score was 16% for Ryanair and 41% for American Airlines (same as their top-box scores).

## Correlations and Profile Graph

Because we’re using different methods to analyze the same dataset, we expect the measures to be related. Table 2 and Figure 3 show, respectively, the correlations among the different metrics at the product level and a profile graph for the 12 airline websites.

Mean | Top Box | Top-Two Box | Bottom Box | |
---|---|---|---|---|

Top Box | 0.76** | x | x | x |

Top-Two Box | 0.97*** | 0.69** | x | x |

Bottom Box | −0.01 | 0.25 | 0.05 | x |

Net Box | 0.79** | 0.98*** | 0.70** | 0.07 |

The correlations in Table 2 were highly significant (*p* < .001) for means with top-two-box scores (.97) and for top-box scores with net-box scores (.98) and very significant (*p* < .01) for top-box scores with means (.76) and top-two-box scores (.69). For this item (ease of seat selection), bottom-box scores did not correlate significantly with any other metric.

The profile graph in Figure 3 illustrates these correlations and provides additional information about how much the metrics differed in magnitude. For these data, the bottom-box scores were not informative because their range was restricted from 0% to 4%; this range restriction prevents the bottom-box scores from significantly correlating with any other metric. Also, because the bottom-box scores were so small, there was virtually no difference in magnitude between top-box scores and net-box scores (Average top-box score: 29%, Average net-box score: 28%: Difference: 1%). The correlation between means and top-two-box scores was also very high, but the lines were nearly parallel at different levels rather than essentially overlaid (Average mean(100): 76%, Top-two box: 59%, Difference: 17%).

# Discussion

These results are from analyses of one item (ease of airline seat selection) collected in a retrospective UX survey of websites from one business sector (airlines). Note, however, that the response distributions for most rating scale items used in UX research have a positively skewed distribution like the one shown in Figure 1. In the future, we plan to expand the number of items in our analyses, but even with this limited scope, these results can provide some insights into two key questions:

- Should I report means, some type of box score, or both?
- If I report a box score, which is best?

## Mean or Box Score?

**Both means and top-box scores can be helpful in UX research because they answer different questions.** A common criticism of box scores relative to means is that box scores lose information because, unlike the mean, box scores are based on a fraction of the available responses. There are times when we have seen significant differences in the mean score of a rating scale but not in top-box scores. In other cases, however, there’s no significant mean difference, but there is a significant difference in measures of extreme responses (e.g., top-box scores). There are also times when their statistical outcomes are the same.

There is some evidence that the percentage of extreme responders tends to be a better predictor of future behavior than the mean, while the mean does a better job of characterizing changes in central tendency that might not be detected at the extremes (Sauro, 2018, May 2). It seems like the best way to cover both bases is to compute and report means and box scores … but which box score would be the best?

## Which Box to Choose: Top Box, Top-Two Box, Bottom Box, or Net Box?

**We prefer top-box over top-two-box scores when the research focus is on the prediction of future behavior.** Participants can have a very positive attitude, a tepid attitude, or a very negative attitude. Because measurements of extreme responses tend to be better predictors of future behavior than tepid responses, we prefer top-box to top-two-box measurements for most rating scale items.

A strong argument against top-two-box scores, especially when there are only five or seven response options, is that the second box “dilutes” the measure of the percentage of extreme respondents. This is logically less of an issue when there are, for example, eleven response options, as in the likelihood-to-recommend item used to collect data that are transformed into the Net Promoter Score.

The correlations in Table 2 are consistent with a preference for top-box over top-two-box scores for seven-point scales (and logically even more so for five-point scales) in this research context (prediction of future behavior). With a correlation of .76 between means and top-box scores, the shared variance of these metrics was the square of the correlation, 58%. This is a substantial amount of variance accounted for between the metrics. It still leaves 42% unaccounted for, which means these two metrics could differ in the extent to which they predict other metrics such as future behavior. In contrast, with a correlation of .97 between means and top-two-box scores, the shared variance is 94% (only 6% unaccounted for), so these metrics would differ very little in how well they predict other metrics.

**Top-two-box scores can be an appealing, broad measure of the tendency of people to agree with an item.** If the research focus is on measuring the percentage of people who simply agree with a sentiment like brand attitude (without regard to the intensity of agreement), it is common to report top-two-box scores. It’s easier for stakeholders to understand a percentage agreement than to interpret the mean of a multipoint scale (e.g., top-two-box scores of 53% for Air Canada versus 83% for Alaska Air rather than their respective seven-point means of 5.4 and 6.1). It is possible, as we have done in this article, to express means of multipoint rating scales as percentages of their location on the scale from 0–100%, but this is not common and would be harder to explain to stakeholders than a simple top-two-box score.

We have seen top-two boxes in common practice for five- and seven-point items. A top-two box for a five-point scale covers 40% of the response options, all the levels of agreement above the neutral point. For a seven-point scale, the top-two boxes cover just 29% of the response options, excluding the weakest level of agreement captured in the box just above the neutral point, striking a balance between focusing on intensity and coverage.

Although it is less common than top-two-box scores, when there are seven response options a top-three-box score is also possible, including all levels of agreement above the midpoint. Three out of seven boxes cover 43% of the response options, making this closer to the properties of a top-two-box score for a five-point scale both conceptually (inclusive of all levels of agreement from tepid to extreme) and percentage of coverage (40% for five-point scales, 43% for seven-point scales).

Figure 4 shows where a top-three-box score for the airline seat selection rating data falls relative to the mean expressed as a percentage, top-box, and top-two-box scores (copied from Figure 3). As expected, adding the third box increased the magnitude of the percentages, making it the highest line in the graph. The correlation between means and top-three-box scores (.96) was almost identical to the correlation between the mean and the top-two-box scores (.97), but its location was just above and almost overlaid the line of means expressed as percentages. That seems reasonable, because the top-three-box score is a summary measure of the ratings above the neutral point of the scale, while the mean is a summary measure of all the responses.

**Top-box scores will usually be more informative than bottom-box scores.** The decision to measure bottom-box rather than top-box scores depends on which seems to offer the best opportunity to discriminate between the products being rated. For certain questions, there may be few if any respondents who select the top box, but for most rating scales used in UX research, it is common for respondents to avoid selecting the bottom box (e.g., Figure 3). When the distribution for an item has more responses below the scale midpoint than above it, it might be more informative to report the bottom-box score.

**Net-box scores are sometimes useful but carry a lot of baggage.** It’s a bit harder to provide guidance for net-box scores. An advantage of net box is it uses information about both negative and positive extreme responses, so you don’t need to worry about whether top-box or bottom-box scores are more discriminating on a case-by-case basis. This can be especially helpful when analyzing and reporting many rating scales with different types of distributions (left skewed, centered, right skewed).

However, a disadvantage of the net-box score is that it is an uncommon type of metric—a trinomial. Unlike a binomial, which takes only two values (usually 0 and 1), a trinomial can take three values (−1 when a response is in the bottom box(es), +1 when a response is in the top box(es), and 0 for all other boxes). Our experience in working out statistical methods appropriate for trinomials indicates that the variability of trinomials is usually higher than for binomial metrics, so for a given sample size, the trinomial will be less precise, or alternatively, for a given precision, the sample size for trinomials will need to be larger than for binomial metrics.

Another disadvantage of net-box scores is they are only useful in conveying information about both top and bottom boxes when there is variation in both top and bottom boxes. In our example item, few respondents selected the bottom box, so the net-box score was almost identical to the top-box score.

Furthermore, as shown in Figure 5, there are possible distributions where the net-box score will be 0 even though there are large differences in the distributions. We expect Panel A to be very rare while Panel B is more likely, though still not as common in UX research as the distribution in Figure 2. An important step when working with these types of rating scales is to examine their distributions for unusual patterns regardless of how you plan to summarize the data.

# Summary: Thinking Outside and Inside the Box

Because they measure different aspects of responses to rating scales, UX and CX researchers should compute means (walk) **and** at least one type of top-box score (chew gum).

In our internal research, we prefer to measure top-box scores because they are more independent of the mean than top-two-box scores and are more likely to predict behavior.

Researchers who report top-two box scores should have a clear rationale for that decision. An important part of that rationale is a focus on using an easily explained general measure of agreement while understanding that this reduces the intensity of the signal being measured. If the rating scales have seven response options (less common than five in most research contexts), researchers need to decide whether to stick with top-two-box scores (which exclude the most tepid level of agreement) or to use top-three-box scores to be more consistent with the measurement properties of top-two-box scores from five-point scales.

Bottom-box scores should be used only when there is enough variation for them to provide insight into important differences, either in place of or in addition to top-box scores depending on how informative the top-box scores are.

We find it hard to recommend the routine use of net-box scores due to their complex statistical properties and potential for obscuring differences instead of revealing them. The NPS appears to be a useful net box score, but it is different from the top-minus-bottom net-box scores computed from common five- and seven-point scales.

Finally, there’s a clear need for additional research to get a better estimate of how frequently the rating scale distributions collected in real-world UX research are similar to or different from the example we put under the microscope in this article.