A Review of Experiments with Synthetic Users

Jim Lewis, PhD and Jeff Sauro, PhD

April 14, 2026

One of the hardest parts of conducting user and market research is recruiting participants. It takes time, costs money, and on top of that, there are no-shows and fraudsters.

Now imagine being able to conduct UX research without the hassle of recruiting the “U.” Enter the idea of AI-generated synthetic users that offer the promise of participant input being:

Simpler (no need to deal with humans)
Less costly (no panel/respondent fees)
Faster (finish in hours or days instead of weeks or months)
Scalable (get data from thousands of synthetic users instead of a relative handful of participants)
Broader in reach (access to user groups that are hard to find or very expensive to recruit)
More secure (no need for nondisclosure agreements and no risk of human disclosure)

At least, that’s the dream of research with synthetic users.

Others view synthetic users as more of a nightmare. They are concerned that research with synthetic users can lead to:

Plausible-looking data that’s just wrong
Shallow qualitative responses because synthetic users have no real lived experience
Reinforced biases driven by large language models (LLM)
Artificially low variability (quantitative or qualitative)

We’ve seen these conflicting attitudes about synthetic users play out in online posts and conversations over the past few years, most recently with the promotion of proprietary models of synthetic users by companies like Qualtrics and Aaru, followed by criticism of that promotion by influential UX researchers.

Pro-Synthetic Voices

Qualtrics is the dominant (and debt-loaded) survey platform that’s made a big bet on synthetic users. Their synthetic dataset was trained on millions of survey responses, and they reported that it can realistically mimic human survey patterns. The variability and correlations mirror human response patterns better than general LLMs data, at least for the attitudinal survey questions they used.

Aaru is another synthetic user simulation platform that has gotten attention. The global consulting firm EY used Aaru’s proprietary multi-agent simulation to replicate a 3,600-person global wealth survey. They reported strong agreement across multiple statistical metrics (high correlation, modest error), suggesting that synthetic data approximated real survey results at scale (done in one day versus six months!).

Anti-Synthetic Voices

First from the anti-synthetic camp is Chris Chapman, a longtime quantitative UX researcher (Amazon, Google, Microsoft) and co-chair of the Quant UX Conference. His most recent presentation clearly elaborates that synthetic users are not users. His blunt conclusion is that synthetic data has no place in survey research.

Another voice is John Mecke, a SaaS and product strategy writer who argues that synthetic users face five core limitations: no lived experience, misleading “too-accurate” results, cultural bias, weak statistical reliability, and narrower real-world usefulness than claimed.

And there’s also Constantine Papas, a UX research strategist and writer of The Voice of User. Papas argues that synthetic research is being oversold largely by cherry-picking favorable results from financially interested parties. When describing the EY study from Aaru’s algorithms, he argues that the correlations are largely because the LLMs were already trained on this data. That’s hardly predicting.

Finally, a recent preprint of a comprehensive literature review of 182 papers also casts strong doubt on the ability of synthetic users to do more than mimic already collected data. We recommend reading the preprint (not yet peer reviewed) and a discussion of the research by Papas.

As interesting as these online conversations are, they have not been formally reviewed for scientific quality. In this article, we briefly review 12 peer-reviewed research papers on the use of synthetic users in UX and UX-adjacent research. For full details on experimental designs and results (e.g., experimental comparisons, models, prompting, settings, metrics), see the links to the papers and articles in the appendix.

Our Inclusion Criteria for Papers on Synthetic Users

We searched the literature for peer-reviewed research that had been published no earlier than 2023 and used LLM models no earlier than GPT-3.5. This turned up 12 papers that can be broadly categorized as attempts to replicate:

Psychological experiments (five papers)
Survey results (three papers)
Social research (three papers)
UX research (one paper)

We’ll now review the evidence in each of these four categories.

Psychological Experiments: Sometimes Human-Like, Often Inconsistent

The idea that digital data can replicate people predates LLMs. From the mid-1990s through the 2000s, a popular research program in social psychology was the “computers are social actors” paradigm, recreating classical psychological experiments in which one of the human participants was replaced by a computer to investigate how this affected human behavior.

Several researchers have adapted this approach to one in which there are no human participants, exploring the extent to which LLMs mimic humans in psychological experiments.

If synthetic users act like humans in experiments, maybe we can use them instead of humans in some studies. But why would anyone think this would be possible? Well, because modern LLMs are trained on huge amounts of human-generated content, the models may include latent social information. Depending on the quality of this latent information, with appropriate prompts, they might produce human-like outputs.

The results from these experiments were mixed, with the following key findings from the five papers:

Using GPT-3.5, Dillion et al. (2023) found significant correlation (r = .95) between synthetic and human moral judgments (encouraging), but there were many points with large differences between human and synthetic mean ratings (discouraging).
Goli and Singh (2024) used GPT-3.5 and GPT-4 in a replication of experiments in which synthetic users were presented with a choice between a certain number of tokens in a month versus waiting for a larger number of tokens later. GPT-3.5 ignored differences in reward amounts (discouraging), while GPT-4 had some sensitivity to the differences (encouraging), but its discount rates were larger than those observed with humans (discouraging).
Attempting to replicate 14 classic social science studies using GPT-3.5, Park et al. (2024) reported that six had unanalyzable data (too little variability), five failed replication, and three were successfully replicated. So, 21% of the attempts were successful (encouraging), but 79% were unsuccessful (discouraging).
Using GPT-4, de Winter et al. (2024) created 2000 text-based personas that completed a short form of the Big Five Inventory. The synthetic data matched the expected factor structure and had high correlation with human data (encouraging) but significant deviation from the humans’ item means (discouraging).
Almeida et al. (2024) replicated eight psychology studies of legal and moral reasoning using Gemini Pro (1.0), Claude 2.1, GPT-4, and Llama 2 Chat 70b. They found differing levels of alignment with human responses, with the closest match for GPT-4. “Nonetheless, even when LLM-generated responses are highly correlated to human responses [encouraging], there are still systematic differences, with a tendency for models to exaggerate effects that are present among humans, in part by reducing variance” (discouraging).

Surveys: Match on Averages, Fail on Details

Even if synthetic users are inconsistent in how they react to classical psychology experiments, they might be able to match human response patterns in surveys. However:

Bisbee et al. (2024) used GPT-3.5 Turbo (with some replication by GPT 4.0 and Falcon-40B-Instruct) to reproduce the 2016–2020 American National Election Survey (ANES). They encountered numerous statistical issues with synthetic respondents somewhat matching high-level means (encouraging) but having inaccurate subgroup means, small standard deviations, inaccurate regression coefficients, and failure to meet even basic requirements for replication (discouraging).
Using GPT-4, Shrestha et al. (2024) compared synthetic and human responses to 43 policy questions on topics like climate, spending, and labor in the U.S., Saudi Arabia, and the UAE. The means for the 43 questions indicated that the responses of human and synthetic participants were reasonably aligned (encouraging) but not precisely the same, with about 70% significantly different (discouraging).
Tjuatja et al. (2024) used variants of Llama, ChatGPT-3.5 Turbo, and Turbo Instruct to investigate whether synthetic responses to different item formats matched expected human response behavior biases. “Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior” (discouraging).

Social Research: Trends Match Humans, but the Details Don’t

The goal of studies in social research is similar to psychological experimentation, though with more focus on interpersonal behaviors and attitudes.

In experiments with GPT-4 and Llama3, Yu et al. (2025) compared synthetic user and human responses to standardized psychological questionnaires measuring empathy. The expected factor structure of the questionnaires was produced by GPT-4 (encouraging), but the magnitudes of the synthetic scores did not match those of humans (discouraging). Responses from Llama3 synthetic users did not match the expected factor structure (discouraging).
Wang et al. (2025) showed that the LLMs they investigated (Llama-2-Chat 7B, Wizard Vicuna Uncensored 7B, GPT-3.5 Turbo, GPT-4) may not be able to distinguish between text written about different groups of people by others and those written by different groups of people, making them unsuitable for creating synthetic users that can replace actual users for social research due to inherent bias (discouraging).
Rafikova and Voronin (2026) used GPT-4 to investigate synthetic responses to complex social issues (e.g., immigration, gender stereotypes). Synthetic users matched the direction and magnitude of human attitudinal trends (encouraging) but had weak correspondence with deeper models of attitudinal variance (discouraging).

UX Interviews: Convincing at First, Limited on Follow-Up

We didn’t turn up studies directly related to quantitative UX research (although that is informed by psychological, survey, and social research). We did, however, find one related to researcher experiences interviewing humans and synthetic users.

Kapania et al. (2025) had 19 UX researchers recreate one of their recent projects conducted with human participants with GPT-4-Turbo. “Initially skeptical, researchers were surprised to see similar narratives emerge in the LLM-generated data when using the interview probe. However, over several conversational turns, they went on to identify fundamental limitations, such as how LLMs foreclose participants’ consent and agency, produce responses lacking in palpability and contextual depth, and risk delegitimizing qualitative research methods” (discouraging).

Summary and Discussion

We reviewed 12 papers describing recent research comparing synthetic users and humans in four contexts of interest to UX researchers. In our summaries, we tagged 9 findings as encouraging and 14 as discouraging. So, the results aren’t universally bad, but they definitely aren’t great. We summarized those in Table 1.

Theme	Encouraging Findings	Discouraging Findings
Matched means/percents	4 (B, P, R, S)	7(A, D, G, P, S, W, Y)
Correlated	4 (A, D, G, W)	1 (R)
Matched expected variance	0	3 (A, B, P)
Matched factor structure	2 (W, Y)	1 (Y)
Matched expected replication	1 (P)	2 (B, P)
Good qualitative depth	0	3 (K, T, Wa)
Unbiased/representative	0	1 (Wa)
Matched regression weights	0	1 (B)

Table 1: Summary of the number of encouraging and discouraging findings. The letters in parentheses indicate the sources for the findings. Letters are the first letter of the last name of the lead author; W = de Winter, Wa = Wang. Some studies produced both encouraging and discouraging findings in the same themes (e.g., Yu found both matching and mismatching factor structures), and some findings matched multiple themes.

Correlation Does Not Mean Equivalence

Some results were promising, but most found discrepancies between synthetic and human results.

For example, Dillion et al. (2023) found significant alignment between synthetic and human moral judgments, but Almeida et al. (2024) reported that even when synthetic and human moral judgments correlated, there were systematic differences with synthetic data exaggerating effects seen with humans.

Superficial Agreement, Deeper Errors

Issues with synthetic data included reduced variance, misalignment of means/percentages, distorted correlations, inaccurate regression coefficients, and shallow experiential narratives.

Different studies reported different issues. Sometimes high-level means matched but deeper correlational metrics were distorted (e.g., Bisbee et al., 2024); at other times, correlations were high, but there were significant differences among means (e.g., de Winter et al., 2024).

Rapid Model Changes Make Findings Quickly Outdated

Research on synthetic users is complicated by variation in contexts, models, prompting, and settings.

Controlled experimentation relies on being able to control the experimental environment. Different researchers use different models with different prompting and settings. Even the papers published in 2026 used older models than are currently available because research necessarily precedes peer-reviewed publication. Next year’s models will be different from this year’s.

Proprietary Models May Work but Lack Validation

Further complicating the research landscape is the emergence of proprietary models incorporating extensive amounts of survey data. Proprietary models from companies like Qualtrics and Aaru might perform better than general LLM chatbots in the production of synthetic samples that match human attitudes and performance. It’s just too early to tell. To date, we have not seen any peer-reviewed publications using these platforms.

Directional When Answers Are Unknown

The encouraging results regarding occasionally high correlation of human and synthetic data suggest that the results from synthetic users can provide directional signals, but synthetic estimates are often imprecise and inconsistent from study to study. The promise of synthetic users is alluring, but until there is strong evidence of consistently good matching with human data, it seems premature to rely on research with synthetic users for critical decision-making.

Potentially Useful When Answers Are Known and Stable

We’re not done yet with this topic and are planning our own analysis. But right now, it seems the most promising use of synthetic users is deriving insights from already collected data. Why ask a survey question if the answer is already known and stable? Most attitudes aren’t stable and are highly dependent on the audience. But if you have surveyed the same type of population repeatedly and have relatively stable results (possibly like the EY study), then you may know the answer. In that case, an LLM is just an easier way to query your database. Just don’t think it’s predicting something that’s not already known.