Card Sorting + Tree Testing : The Science of Great Site Navigation

Jeff Sauro, PhD

Card sorting is a popular method for understanding the mental model of the user.

Instead of organizing a website by some byzantine corporate structure, you base it on how the users think by having them sort items into categories.

It’s a method used as often as lab-based usability testing with 52% of practitioners using it in 2011.

Variations of Card Sorting have been used in psychological research for around 100 years and now most card sorts are conducted online using virtual “cards” from software like optimalworkshop.

Yet as helpful as card sorts are for understanding the users’ point of view, users don’t come to a website and sort cards– they look for products and information. To understand how easily users can find items in a taxonomy you need to have them try and find items.

For this reason I recommend conducting a follow-up tree test with a card-sort. A tree-test is just another name for a reverse card sort. Closed card-sorts have predefined categories and open card-sorts have users define their own category names.

Tree-tests present the user with a stripped-down version of the navigation so there’s no visual cues or context to aide findability—it’s just pure taxonomy.

A good time to use these two methods together is when you’re looking to improve upon a website’s navigation. First get a baseline measure of how well the current structure matches users’ mental models, identify which items are harder to find and see where users would put new items you’re looking to introduce.

For example, if a retailer wanted to introduce a line of small appliances, where do they place that category? Would users look for say sewing machines and blenders in this category? What would they call it and should it be at the top level or bottom level? Card sorting and tree-testing can help.

I hosted a webinar on July 12th with Userzoom and walked through an example using Here are some high-level insights for conducting both a Card Sort and Tree test as well as answers to the questions from the 200 participants that joined.

The Open Card Sort

Here is the procedure we used to conduct the open-card sort.


40 Items to Sort: We selected a range of what we suspected were easy and difficult items (from experience with other retailers) across a range of Target’s departments.

We recruited 50 Users: At this sample size this would give us a margin of error of around 10% around most of our metrics (with 90% confidence). We dropped two participants from the study whose sorts were haphazardly done.

We fixed the number of categories to 10: This is large enough to allow freedom to sort, but few enough that we could realistically place this number of categories at the top-level navigation.

It took users on average 11 minutes to sort the 40 items. We also had users select which item they thought were the most difficult to sort and why.


Based on the below dendrogram, we found 12 categories with 3 single item categories (called runts).

Users provided 269 unique names for the categories, many were just variations on the same name (e.g. ladies, women and men’s and men’s items) while others differed by plural form or misspelling.

We combined the similar categories to generate 44 categories. The graph below shows the percentage of participants that used the category names and 90% confidence intervals. A larger version is available on slideshare.

For example, 84% of participants had “Electronics” or some similarly named category in their sorted items and only 15% used “Appliances.”


We went through all 40 items and determined which categories were assigned and assessed strong and weak agreement. For example, 65% of participants assigned Black Griffin Case for Apple iPod touch into the “Electronics” category with “Entertainment” as the second choice with 15% of participants. This item showed strong agreement for placement in the Electronics category. On the other hand, the “NBA Golden State Warriors Blue Water Bottle” had low agreement with the first and second choice categories getting an equal percent of placement “Misc” and “Accessories” both at 21%. These slides are available in the webinar or on slideshare.


We also asked participants which items they found the most difficulty to sort. The graph below shows the items sorted by difficult along with the 90% confidence intervals.

The gift card turned out to be the most difficult item to sort. We think this might be because participants were confused whether the $5-$1000 was for a puppy or if it referred to the puppy with the Target logo over its eye. A future test would use a more generic name for gift card to avoid this confusion (unless there was data that users searched for a gift card with the puppy).

The Tree Test

Here is the procedure we used to conduct the tree-test (reverse card sort).


We selected a combination of the most difficult and easy items to sort, across a range of departments. We limited it to 14 as tree testing takes longer than card-sorting.

We recruited another 50 Users: This sample size would also give us a margin of error of around 10% around our metrics (with 90% confidence).

It took users on average 17 minutes to sort the 14 items. After users selected a category where they thought the items would be located we asked them how confident they were and how difficult they though the task was to complete.


The primary measure in a tree test is the percent of users that found the item. The graph below shows the percent of participants that successfully located each item along with the 90% confidence intervals. For example, all users found the Wrangler Men’s reversible belt while only 2% found the Wildkin Kaleidoscope backpack.


As in the open-sort, we asked participants which items they had the hardest time locating. The graph below shows the items that were most difficult to find along with the 90% confidence intervals.

For example, 64% of participants had the hardest time finding the Brother Sewing Machine, while only 6% had a hard time finding the Nano Tech Shaver. We asked participants why they selected the items as difficult and categorized the open ended responses. For example, for the Men’s Brown Swiss Gear Jackson Hiker, most participants who had trouble were unsure if this was a boot or other article of clothing related to hiking. See the webinar for more details.

Confidence & Success

By combining the percent of participants that found the item along with how confident they were we created a four-block diagram of success and confidence.

We want as many items as possible in the upper right quadrant (found and highly confident). The other three quadrants speak to some level of failure. The lower right quadrant shows “disasters,” where uses are sure this is where an item is located but in fact it isn’t. Items in these three quadrants are good candidates for either relocation or certainly cross-listing at the item page.

Open Sort and Tree-Test Agreement

When we look at the overlap in the most difficult items to categorize between the open sort and tree test we can see why I recommend conducting both. The graph below shows the percent of users that selected each item as difficult depending on the test.

There is certainly some agreement: the Shantung dress, Men’s belt and short-sleeve polo were all relatively easy to find. But look at the Brother Sewing Machine. Only 7% of users in the open-sort identified that as difficult to sort, whereas 64% had trouble finding it in the existing navigation.

The correlation between difficulty sorting and difficulty finding in this study was r =.4. That means difficulty sorting predicts about 16% of difficulty finding—a good reason to use both methods in your next analysis. This is a similar correlation to other combined studies we’ve conducted.

Here are answers to questions that were asked at the end of the webinar. In case you missed it you can still view a recorded version and download the slides.

Questions and Answers


  1. Can you speak to the pros and cons of conducting studies like these in an unmoderated fashion vs. moderated fashion? For instance, moderated studies would allow for a think aloud to dig deeper into a user’s mental model.I think you touched on the major draw-back. With an unmoderated study you can’t ask users about their thought processes. We do include open-ended questions, such as to which items they had difficulty with, and this does provide some insights. Not all users are great at articulating their thoughts and want to type them out in a text-box, so there will never be a replacement for in-person testing. However, we’ve found that with the larger sample sizes we can use, we tend to get enough insights from the users who do spend the time (some a lot of time) articulating their thoughts, problems and suggestions.
  2. Given your comment that a tree test should usually accompany a card sort, can you delve more into how you selected the subset to test in the tree test? For instance the issues that surfaced with the sewing machine, may have also surfaced with the umbrella, for instance.We selected a mix of items users had a hard time sorting, some easy ones (for comparison) and again some that crossed departments. When working with a client, it’s usually very easy to pick the ones you want to test as there are often internal debates about why sales are low: is it because users can’t find it or is it that they can find it but don’t want to purchase it for other reasons? The tree-test can add a lot of data to that debate.
  3. Are the confidence intervals calculated manually or automatic?MUIQ outputs the raw data for us and we generate the confidence intervals using another software program we developed. It takes in the formatted spreadsheets then generates accurate confidence intervals based on the type of measure.
  4. Do brands require the way certain things are labeled and categorized on a site? Could be good or bad.The brands might have a say in how their products get displayed and we certainly used the brand names in our study, but we’re not aware of any requirements. We like to include the brands as we know for certain items they are brand specific searchers and on others the brand is less known.
  5. Is the tree test done on the current navigation at the site or the result of the open sort?We conducted the tree-test on the existing navigation structure.
  6. Do you ask participants what they felt was the most difficult to categorize — or do you just look at the #’s?We ask participants explicitly to identify which items that had a hard time with. Userzoom displays all the items they sorted on the previous screens so they don’t have to remember. We also look at the arrangement of the items across the categories. Where there is a low percentage of agreement that’s also a flag. For the tree-test we look at confidence, task-level difficulty using the SEQ and the percent of users that were able to find the item successfully (based on where it’s currently located).
  7. Great study set up and analysis.. Wondering, when implementing changes, do you consider other requirements such as SEO and commercial such as seasonal variances e.g. holiday clothing in summer and roi for eadch product. if so how does this influence the final designThese are all very important factors when considering changing any website navigation. These are also factors the stakeholders at our client companies are very aware of when we conduct these studies. It’s one reason why we conduct a lot of them in the summer months. However, like any user research, it alone shouldn’t dictate business decisions, but rather provide data to help inform them.
  8. For the items with < 40%…. was that based on a close card sort?We conducted only a closed card sort. After assigning the categories based on the close matches as described above we looked for a split in the category agreement. We found that 40% was a reasonable (albeit arbitrary) cut off for identifying high or low agreement in first choice assignments.
  9. Is it necessary to include full product name (kaleidoscopes backpack) for sorting or would it be better to just have generic name like ‘kid’s backpack”Good question. It depends on the goal of the study. We did this as an example case study so we picked specific brands. There’s nothing wrong with going with just generic items, especially if that’s what the search data specific to the site suggests, or if it matches the specific business questions. With the backpack in particular, we had enough users get confused about the “kaleidoscope” part that it certainly would be worth following up to see if just “backpack” netted more confident and successful results.
  10. Relating to the “Items with <40% of Votes as Primary Category” page, what is your cut-off for adding secondary navigation? 20%? 15%, 10%?We simply reported the categories as we found them. So secondary categories were secondary only because there were fewer participants who assigned a category to the item than another category. As described above, for the “strong agreement items” we found there was typically a substantial break between the first and second categories. But your question raises the important issue of the often nebulous nature of card sorts. There are often a lot of “good-enough” categories where participants are largely mixed on where they’d sort. Having the tree test adds an additional data point but it’s also balancing business (we need to promote a category) and technical constrains (we can only present so many categories) that provides a check to UX research.
  11. Would you recommend adding items that are not available within navigation to reduce confidence bias that is often inherint to Tree Sorts?Absolutely. We are often testing new items to see where users would look or place them. Keep in mind in a tree-test participants are presented only with the categories, so they don’t have specific feedback on whether an item is placed correctly. Self reported confidence still plays a role and is worth collecting. Unlike a task-based usability test where users have to locate an item and usually know if they didn’t find it, in a tree-test users won’t know if a new item is placed appropriately.
  12. Subject categorizations can be strongly affected by context. Within UserZoom how do you insure that you’ve controlled or managed the context? What do you do to set the appropriate context when you’re running subjects remotely?For tree-testing we’ve removed most of the context from the analysis. We’ve removed design elements, heading styling and other visual cues that certainly can aide users in their search for information. This allows us to focus just on the taxonomy and not design. But to your point, users look for information in context, so conducting additional follow-up test with design elements provides one of the best tests of findability. I call this type of test a Trick Test (for Tree/Click).
  13. Does UserZoom allow researchers to offer compensation through the system or are all participants seperate from that process?Userzoom does have direct links within the application to several panel providers. For many public facing websites, panel providers can provide qualified users to take the tests, typically for between $20 and $30.
  14. What is the best way to phrase the task in a tree-test?We like to keep it simple and direct and just ask the user to locate the item. We don’t provide much background or a more detailed scenario like in a task-based usability test. We cut to the chase and ask the user where they would find a Brother Sewing Machine for example.


    Your Cart
    Your cart is emptyReturn to Shop
    Scroll to Top