A sequence of letters that didn’t mean much to people a few months ago is now ubiquitous. It seems like every social media post mentions ChatGPT (especially the latest version, 4) and hypes it as a game changer.
It’s been included as a feature in some UX software, and some have even suggested that it can do the jobs of knowledge workers, including those of UX researchers to generate usability test scripts and create synthetic personas used in usability tests.
But despite significant improvement relative to ChatGPT-3.5, some recent reports have suggested ChatGPT-4 may still generate inaccurate and inconsistent results, has trouble with math problems, makes things up, and can be a security risk as information used becomes publicly available to others. A critical skill in effectively using ChatGPT for anything beyond very straightforward requests is crafting good prompts.
We wanted to understand some of the capabilities and limitations of ChatGPT. We’re a bit dubious about it replacing UX researchers and conducting nuanced analysis, but we wanted to keep an open mind. We can’t test everything, so we started with a task we thought ChatGPT was designed for—rapidly summarizing large amounts of text and comparing those summaries to similar output from UX researchers.
To compare the labor-intensive task of coding open-ended statements from UX research studies into themes, we had three researchers and three different runs of ChatGPT-4 pull themes out of three datasets containing roughly 50 statements each. We then compared the themes to see how consistently humans and machines categorized the statements.
We gathered raw data from our internal SUPR-Q benchmarking program (data we own that we were willing to share through ChatGPT to conduct this research). As part of these studies, participants described problems or frustrations they had with their most recent visit on one of the following websites:
- Office Depot (52 statements from Office Supplies Websites Survey)
- AT&T (52 statements from Wireless Service Provider Websites Survey)
- Food Lion (50 statements from Grocery Websites Survey)
Three UX researchers (human coders) separately took the verbatim comments, developed categories, and coded each statement as either a member or non-member of each category. We provided no other guidance as to how many categories or types of categories to create. This resulted in three presence/absence datasets of human-generated category themes for each of the three studies. Following the same process, we next asked ChatGPT-4 to code the same verbatim comments into themes, doing this three times to generate three additional datasets for a total of six (three human coders, three ChatGPT).
A brief note about ChatGPT prompts. Getting good results from ChatGPT takes some practice in inventing the right sequence of prompts. To generate themes, we experimented with different prompts and settled on a prompt that provided ChatGPT with the context that it was a UX researcher classifying statements about problems or frustrations with a website into common themes. The prompt also included details about allowing statements to fit multiple categories, listing a description of each category, and listing statements that did not fit any categories (see Appendix).
We compared the total number of themes identified by each coder and then investigated how similar, individual statements were grouped into themes.
Comparing the Number of Themes
Table 1 shows the number of themes each researcher came up with (labeled Coder 1 to Coder 3) by dataset compared to the three ChatGPT runs (labeled ChatGPT A to ChatGPT C). One initial takeaway from the differences in theme counts is that the human coders had several more themes with only one statement compared to ChatGPT (see columns labeled “Themes with 1 Statement” in Table 1). On average, the three human coders generated about twice as many themes as ChatGPT (13.3 vs. 7) across the three datasets.
For example, Coder 2 generated 25 themes for the Grocery dataset but 16 of those themes had only 1 statement (e.g., classifying the statement “It sometimes glitches” in a one-off theme labeled Glitches).
|Number of Themes||Number of Themes with|
|Number of Themes||Number of Themes with|
|Number of Themes||Number of Themes with
After removing themes with only one statement, the average number of themes between humans and ChatGPT converged. Table 2 shows the coders averaged 7.1 themes versus 6.4 for ChatGPT across the three datasets.
|Grocery Themes||Wireless Themes||Office Supplies Themes||Average|
This similarity in means suggests ChatGPT-4 produces a comparable number of categories/themes as a human researcher (after removing single statement themes). The next step was to assess the similarity of ChatGPT and human coder themes.
Comparing Themes Based on Names
Comparing the content of the categories was more challenging than counting them because of subtle differences in the wording and in how human coders and ChatGPT named each category. We had to make judgments about whether theme names were similar enough. Table 3 shows examples of how we aligned the themes generated by the three human coders and ChatGPT for the Office Supplies dataset.
|Theme||Coder 1||Coder 2||Coder 3||ChatGPT A||ChatGPT B||ChatGPT C|
|1||Design||Appearance was plain||Plain/boring||Aesthetic issues||Website|
|2||Items Out of Stock/Unavailable||Incorrect Inventory||Out of stock/|
|Stock and Inventory issues||Stock & Availability||Availability
|3||N/A||No Issues||None||No Issues/|
|4||Cluttered||Cluttered/ Overwhelming||Cluttered/ Overwhelming/ Busy||Navigation and Clutter||Clutter & Overwhelming Interface|
|5||Difficult to Navigate||Navigation Issues||Navigation|
|Website Navigation & Organization|
For example, although themes named N/A, No Issues, and None were easy to match up despite slightly different wording, other categories like GPT A’s Navigation and Clutter could match up to a broader theme on issues related to clutter (Row 4) or to a broader theme of issues related to navigation (Row 5). The challenge of matching themes from their name requires just as much judgment when comparing output from two humans (e.g., Design vs. Appearance was plain) as comparing output from two runs of ChatGPT (e.g., Website Design vs. Aesthetic issues).
This is the same problem confronted by UX researchers who conduct analyses like cluster analysis of open card sorts and factor analysis of questionnaire items, who then need to name the clusters or factors based on the constituent items. In most cases, you’d expect different researchers to come up with at least somewhat different names.
We can ask UX researchers to explain the naming of themes, but it’s not currently possible to do this with ChatGPT, so we turned to another approach. Instead of comparing the names of the themes, we compared how similar the statements were in each theme, regardless of what it was called.
Comparing Themes based on Overlapping Statements
To compare the overlap of statements in themes, we still needed a way to align themes. To do so, we used the original researcher’s themes (Coder 1) for each dataset and compared the number of overlapping statements with other coders and ChatGPT runs. For example, returning to Table 3, Coder 1 had a theme called N/A that had 18 statements from participants indicating they had no issues using the Office Depot website. ChatGPT A had a theme called Positive Experiences that had 19 statements, including all 18 of Coder 1’s statements. The number of overlapping statements strongly suggests that N/A and Positive Experiences are the same themes with different names.
We repeated this process of identifying the highest overlapping statements to identify the best-fitting theme across each coder and ChatGPT run. At the end of this process, we had five Office Supplies themes (No Issues, Appearance, Navigation, Cluttered, and Out of Stock), five Wireless Service Provider themes (No Issues, Hard to Find Info, Slow Loading, Cluttered, and Navigation), and four Grocery themes (No Issues, Slow Loading, Cluttered, and Hard to Use).
We computed kappa and percent agreement for all 14 themes and all 15 possible pairings of the human coders and ChatGPT runs (e.g., Coder 1 with Coder 2, Coder 1 with Coder 3, … Coder 1 with ChatGPT A, … ChatGPT B with ChatGPT C) as shown in Table 4.
A brief note about kappa. There are different methods for assessing the magnitude of interrater agreement. One of the best-known is the kappa statistic (Fleiss, 1971). Kappa measures the extent of agreement among raters that exceeds estimates of chance agreement. Kappa values can be between −1 (perfect disagreement) and 1 (perfect agreement) and are often interpreted with the Landis and Koch guidelines (poor agreement: ≤ 0, slight: 0.01–0.20, fair: 0.21–0.40, moderate: 0.41–0.60, substantial: 0.61–0.80, almost perfect agreement: 0.81–1.00).
As indicated in the literature for the assessment of interrater agreement, we found discrepancies between kappa and simple percent agreement (the number of times two coders placed the same statement in a category and not in another category divided by the number of comments that were classified). For example, the percentage agreement between Coders 1 and 2 for the No Issues theme in the Office Supplies comments was 98.1% (51/52), with a corresponding kappa of .958—not much difference. However, their percent agreement for the Appearance theme in the Office Supplies comments was 96.2% (50/52), but the corresponding kappa was .731 (still a substantial level of agreement, but much smaller than .958). This happens when the distribution of classifications between two raters has a greater likelihood of agreement by chance. For this reason (plus the availability of interpretive guidelines), we focused our analyses on differences in kappa.
We averaged kappas across themes within products and then across products to get the overall results shown in Table 4. We saved the kappas computed at the theme level in an SPSS data file with a row for each of the 15 pairings of human coders and ChatGPT runs, treating rows as subjects in analyses of variance (ANOVA).
|Coder 1||Coder 2||Coder 3||ChatGPT A||ChatGPT B|
Human Coders vs. ChatGPT Runs vs. Combined
We were especially interested in the levels of agreement among human coders (the three values in Table 4 highlighted in blue), among ChatGPT runs (the three values in Table 4 highlighted in green), and between human coders and ChatGPT runs (the nine unhighlighted values for human/AI pairs). The mean kappas were
- Human Coders: .704
- ChatGPT Runs: .684
- Combined: .632
There was a statistically significant difference among these means (F(2, 14) = 2.9, p = .09) with about the same level of agreement among the three pairs of human coders and three pairs of ChatGPT runs but a little lower for the nine human/ChatGPT pairs. Even though kappas were slightly depressed in the nine combined pairings, it’s hard to argue that this is a practically significant difference because all three mean kappas indicated substantial agreement according to the Landis and Koch guidelines.
No Issues vs. Other Themes
While compiling the kappas for each theme, we noticed that agreement for the No Issues theme was unusually high relative to the other themes. The No Issues theme was present for all three product types (Office Supplies, Wireless Service Providers, and Grocery), so we collapsed the kappas for the other themes into an Issues variable for each product type and ran an ANOVA to compare them. The difference in the main effect of No Issues vs. Issues was strongly significant (F(1, 12) = 284.5, p < .0001), with a mean kappa of .949 for No Issues and .631 for Issues. Across all possible pairings (human coder, ChatGPT run, or their combination), mean kappas consistently exceeded .90, almost perfect agreement according to the Landis and Koch guidelines. We found no significant interactions between the No Issues/Issues effect and product type (Office Supplies, Wireless Service Provider, Grocery) or rater source (human coder, ChatGPT run, combined).
Summary and Discussion
In our comparison of comment coding between UX researchers and ChatGPT-4, we found that
Human coders produced more single-statement themes. From our prompts, ChatGPT rarely produced themes for individual statements (averaging 0.7 across the three studies in Table 1). The comparable mean for the human coders was 6.2, mostly produced by Coder 2 (11.3 compared to 5.7 for Coder 1 and 1.7 for Coder 3). Using the taxonomic terminology of lumpers and splitters, ChatGPT appears to be more of a lumper (putting more items into fewer categories), while some human coders showed splitter tendencies (putting fewer items into more categories). But the human tendencies in our small sample were more variable than ChatGPT.
Human coders and ChatGPT-4 produced comparable numbers of multi-comment themes. After removing themes with single statements, the mean number of categories was comparable (7.1 for human coders and 6.4 for ChatGPT).
It’s sometimes hard to match themes by names. Sometimes thematic category names are easy to match (e.g., N/A, No Issues, None), but it wasn’t obvious until examination of their overlapping statements that Positive Experiences was effectively a synonym for No Issues.
Agreement for categories matched on overlapping statements was strong. Regardless of whether interrater agreement was assessed among human coders, ChatGPT runs, or their combinations, overall values of kappa ranged from .632 to .704, indicating substantial levels of agreement. The categorization of statements into the No Issues theme was especially strong, with a mean kappa of .949 across all possible pairs of human coders and ChatGPT runs.
Will UX researchers who use tools like ChatGPT replace those who don’t? In the 2019 essay “Rise of Robot Radiologists,” radiologist Curtis Langlotz from Stanford University was quoted as saying, “AI won’t replace radiologists, but radiologists who use AI will replace radiologists who don’t.” When we started this exercise, we thought we would quickly demonstrate that the classification output of ChatGPT-4 would be far inferior to that of human coders. Instead, the kappas associated with pairs of classifications that included a human coder and a ChatGPT run were only slightly lower than the interrater reliabilities for the human coders. ChatGPT seemed to perform roughly as well as human coders in creating categories and grouping similar statements, with three caveats:
- Caveat 1—Management of single-comment themes: ChatGPT rarely creates themes from individual statements, in contrast to the classification behavior of the human coders in our study. Whether this is universally good or bad is hard to determine, but it is important to keep in mind that our classification metrics were computed on themes that had multiple comments associated with them, ignoring the single-comment themes.
- Caveat 2—Difficulty of constructing an effective prompt: As the classic observation “garbage in, garbage out” suggests, it took some time to get the prompts right to get these results, and each prompt we tried greatly influenced the results. With simple prompts, ChatGPT provided summaries that were generally accurate but lacked detail; they were just simplified lists of the data without combining themes. More detailed prompts seemed to improve output (see the Appendix), but ChatGPT still faced limitations regarding accuracy and consistency.
- Caveat 3—Need to run ChatGPT multiple times: One way to deal with the limitations of ChatGPT for this task is to run it multiple times. In this study, we ran the same prompt against the same data three times. We expected to get kappas over .900 for comparisons of these runs, but instead, they ranged from .584 to .784, close to the range we got for three human coders (.683 to .726). Clearly, ChatGPT is not deterministic.
It’s too early to judge the extent to which the work of UX researchers will be affected by tools like ChatGPT. Unlike some of the more strident claims made in the wake of the release of ChatGPT-4, it seems unlikely that AI will replace UX researchers, but it also seems unlikely that these tools will play no role in UX research. For example, the task in this study was to classify three sets of roughly 50 statements each. What if the task was the same, but there were 5,000 statements in each set, or 50,000? In that case, it seems like you could confidently use a single run of ChatGPT to identify comments that indicated No Issue, followed by multiple runs to see which of the remaining comments were consistently classified together. Ideally, the process would include human review of ChatGPT-generated theme names and at least some spot-checking of the assignment of comments to themes.
The Verbatim Prompt
As a UX researcher, you are tasked with analyzing the dataset provided below, which contains answers to the question, “What are some problems or frustrations you’ve had with the XXXX website?”. Your goal is to classify each numbered statement according to common themes. Create as many categories as necessary to group similar statements together. If a statement fits multiple categories, include it in all relevant categories.
[dataset numbered by participant]
After analyzing the dataset, follow these steps:
- List the categories you have created, along with a brief description for each category.
- For each category, list the numbers of the statements that belong to it.
- If there are any statements that do not fit into any of the categories you have created, list their numbers separately.