The User Experience of AI-Based Chat Software (2025)

Jim Lewis, PhD • Jeff Sauro, PhD

Feature image showing a monitor with AI appsAI is rapidly changing. By the time we write about the latest features and performance benchmarks, they are replaced by newer features and benchmarks.

But are all those features and benchmarks noticed by users? Perhaps.

The speed of change in AI shouldn’t stop us from taking a snapshot of the user experience. Even with rapidly changing features, such snapshots will help us identify if core parts of the experience remain stable and if new features are underutilized or unnoticed.

We conduct a regular benchmark of consumer software every other year. During those two years, it’s typical for the products we measure to add new features and change the experience. Sometimes that shows up in the UX metrics we collect, and sometimes it doesn’t.

Our most recent software benchmark was the first time we included measures of what we’re calling AI-based chat software—the ChatGPT-like services that are becoming universal.

In this article, we present some key findings from our 2025 investigation of the UX of AI-based chat software. For more details, see the full report for our 2025 survey of consumer software.

AI-Based Chat Software Benchmark Study

In January and February 2025, we conducted a retrospective study that included three AI-based chat products with 153 U.S.-based panel participants. This study included metrics from our standard UX & NPS survey as part of our larger consumer software data collection effort.

There was about a 50–50 gender split (48% female, 52% male). Respondents tended to be younger, with 71% under the age of 45. Participants were asked to reflect on their most recent experiences with the software and answer a number of items, including the NPS, SUS, UX-Lite®, and TAC-10™. The AI-based chat products and sample sizes were:

  • ChatGPT: 54
  • Claude: 49
  • Gemini: 50

The sample sizes are modest but adequate to establish baselines and identify medium-sized differences relative to the other software products we measure.

UX Metrics Results

Ease of use and usefulness affect whether people use and recommend products. We’ll first review how likely people are to talk about their AI chat bot usage and how measures of ease and usefulness may drive that behavior.

Recommendation Intention (NPS)

The Net Promoter Score provides a good gauge of word-of-mouth recommendations, which can portend rapid growth of products as friends tell friends about what they are using. The NPS is calculated using an eleven-point (0 to 10) likelihood-to-recommend (LTR) question, computed by subtracting the percentage of detractors (0–6) from promoters (9–10).

Figure 1 shows the NPS we obtained for these three AI-based chat products (just a nonsignificant four-point difference between low and high scores; p > .10).

Figure 1: NPS with 95% confidence intervals.

Figure 1: NPS with 95% confidence intervals.

Nominally, Claude had the highest NPS and ChatGPT the lowest. The overlap of the confidence intervals strongly indicates no significant difference in the NPS among the products. General guidelines for the interpretation of the NPS are that anything above 0 is good (more promoters than detractors), above 20 is favorable, and above 50 is excellent. All three products have scores between favorable and excellent.

Perceived Usability (SUS)

We used the popular System Usability Scale (SUS) to compute the perceived usability of the three products (Figure 2). The SUS is a ten-item questionnaire with possible scores ranging from 0 to 100. The average SUS score from over 500 products (including websites and business software) is 68 (a grade of C on the Sauro-Lewis curved grading scale).

All three products had above-average perceived usability (at least a grade of B) with scores at least in the mid-70s. The Sauro-Lewis curved grades for the products were all above average, ranging from B (ChatGPT) to A (Gemini), but none of the mean differences were statistically significant (p > .10).

Figure 2: SUS with 95% confidence intervals (no significant differences).

Figure 2: SUS with 95% confidence intervals (no significant differences).

Perceived Usefulness and Ease (UX-Lite)

Research on the Technology Acceptance Model (TAM) in the mid- to late-1980s revealed the joint importance of measuring both ease of use and usefulness as key drivers of the intention to use a product, which is in turn a significant driver of actual use.

We use the two-item UX-Lite as a mini-TAM because it has one item to rate perceived ease of use (“This product is easy to use”) and one to rate usefulness (“This product’s features meet my needs”). UX researchers can use it in aggregate as a measure of acceptance or, even more broadly, satisfaction or product quality.

The UX-Lite scores were similar to the SUS score performance. Using the curved grading scale for the UX-Lite, the scores for the three AI-based chat products were 77.3 (C) for Claude, 81.5 (B) for ChatGPT, and 84.5 (A−) for Gemini. For all three products, ease-of-use scores were significantly higher than usefulness scores (p < .05), and the UX-Lite score for Gemini was significantly higher than Claude (p = .05).

Figure 3 shows a scatterplot of the ease and usefulness scores. The dashed red lines indicate the overall means for these three products. There were no significant differences in ease-of-use scores among the products (p > .10). Gemini was rated as marginally more useful than Claude (p < .10; no significant difference between Claude and ChatGPT).

Figure 3: Scatterplot of the two UX-Lite subscales for three AI-based chat products (means across products indicated by red dashed lines).

Figure 3: Scatterplot of the two UX-Lite subscales for three AI-based chat products (means across products indicated by red dashed lines).

Analysis of Verbatim Comments

To dig into the “why” behind the current numbers, we asked participants to name one thing they disliked about the product they rated. Table 1 shows the top three issues for each product (with sample participant quotes).

ProductTop Three IssuesSample User Quote
ChatGPTAccuracy/Reliability Issues"It isn't 100% accurate and needs to be proofread every time."
Generic/Lacking Depth"One thing I particularly dislike about using ChatGPT is that sometimes responses can be overly generalized or lack depth when addressing niche or highly specific topics."
Subscription/Access Limitations"You can't access the best version without paying."
ClaudeLimited Capabilities/Restrictions"It has no sense of synthesis or being able to put various things together; it just regurgitates basically whatever is at the top of a Google search."
Subscription/Rate Limits"Limitation on number of free interactions (i.e., no true free tier)."
Response Quality/Personality"One thing I dislike about using Claude is that it can sometimes provide responses that feel a bit shallow, especially on more complex topics, and I’d prefer more detailed explanations."
GeminiInaccuracy/Inconsistent Responses"Sometimes it gets in a loop where it won't answer correctly but it answers as if it is 100 percent certain it has the right answer, then you question it and it changes its answer."
Limited Features/Capabilities"It doesn't have a lot of the capabilities that other AI chat bots have."
Slow Performance/Processing Issues"What I cannot stand about using Gemini is that it runs slower sometimes when processing particular requests or when it comes to offering the most accurate answers."

Table 1: Top issues for three AI-based chat software products.

Do Users of These Products Differ in Tech Savviness?

We don’t want differences in metrics to just be a result of differences in participant tech-savviness. It could be that less tech-savvy users gravitate toward ChatGPT, and more tech-savvy users prefer Claude. To differentiate between differences in participant ability and differences in the user experience, we looked at the tech-savviness scores using our TAC-10 measure for each of the three products.

The TAC-10 (Technical Activity Checklist with ten items) is a reliable (consistent) and valid (predictive) measure of tech savviness. The TAC-10 score for a person is the number of items selected from its checklist.

Across the three products (Figure 4), the range of mean TAC-10 scores was from 4.7 for ChatGPT to 7.0 for Claude. Using the TAC-10 criteria for assigning individual scores to three levels (Low: 0–4, Medium: 5–7, High: 8–10), all three TAC-10 means round to the medium range. Within that range, however, the mean differences were all statistically significant (p < .03). This suggests that in our sample, ChatGPT users are slightly but significantly less tech savvy than Gemini users, who are, in turn, slightly but significantly less tech savvy than Claude users.

Figure 4: TAC-10 scores by product (with 95% confidence intervals).

Figure 4: TAC-10 scores by product (with 95% confidence intervals).

Summary and Discussion

Results of our AI-based chat benchmarks based on 153 participants revealed:

All products had high and similar Net Promoter Scores. The NPS for the three products were about the same, with only a four-point difference between the low of 39% for ChatGPT and the high of 43% for Claude.

Perceived usability ratings for the products ranged from above average to very good. Using the Sauro-Lewis curved grading scale for the SUS, Gemini had an A, Claude had a B+, and ChatGPT had a B. For the curved grading scale we developed for the UX-Lite (which is the combination of perceived ease and perceived usefulness), Gemini received an A−, ChatGPT received a B, and Claude received a C.

Gemini was rated as the easiest and most useful. Gemini had the nominally highest SUS and UX-Lite scores (respectively, 81.5 and 84.5) and was significantly higher than Claude (p = .05).

There were no significant differences in perceived ease or usefulness between ChatGPT and Claude. Despite the slightly different assignment of grades to ChatGPT and Claude, there were no significant differences in their SUS or UX-Lite scores. For the UX-Lite, there were no significant differences between their ratings of perceived ease-of-use or perceived usefulness.

Reported issues included accuracy, generic content, and limited free versions. Respondents noted inaccurate output (often presented with high confidence) by ChatGPT and Gemini, generic output by ChatGPT and Claude, and subscription limitations for ChatGPT and Claude.

There were significant differences in the tech savviness of users of the different products. The mean TAC-10 scores were all in the medium range of tech savviness. However, within that range, there were significant differences among all three products, from 4.7 for ChatGPT to 5.8 for Gemini to 7.0 for Claude. We’ll dig into the covariate of tech-savviness in future analyses of AI chat bots. For now, we recommend UX researchers routinely include measures like the TAC-10 as part of the characterization of their sample.

For more details on all the products we measured, see the full report.

0
    0
    Your Cart
    Your cart is emptyReturn to Shop
    Scroll to Top