AI INFLUENCE DEMANDS
TRANSPARENCY.

Zeist AI \Influence Report, Volume 1 | June 2026

15
Models
612
Scenarios
36
Topics/Prompts
9.036
Outputs

BACKGROUND: AN INVISIBLE CHANNEL OF INFLUENCE

We tend to choose AI the way we choose software. We compare features, pick what performs best, move on.

Most AI benchmarks and evaluation methods reinforce this thinking: they measure performance, not consequence.

But AI models don't just deliver answers. AI is not an information highway, passively transmitting data. It curates, ranks, and bends it, long before it reaches us.

That doesn't make one system "right" and another "wrong." They all influence us, and that influence has real consequences.

In politics, for example, scholars and practitioners are beginning to ask how many voters will turn to AI for guidance in the next election, and which systems they will choose.

But there's a missing link in the debate:

Would it actually make a difference? If so, how?

That is the question this study addresses.

RESEARCH OBJECTIVE

The aim of this study is to make AI influence visible, measurable, and accountable.

THE METHODOLOGICAL GAP

Making AI influence visible requires reconciling two competing needs: systematic, high-quality evaluation versus the speed and scale required for practical auditing.

The field is moving in both directions, but separately. Evaluation is shifting beyond benchmark scores toward more pluralistic and interactional approaches. Recent work by Vishwarupe et al. (2026), for example, demonstrates how pluralistic alignment can surface previously undetected failures — notably, cases where models collapse into "sycophantic consensus."

This progress, however, is constrained by a methodological dilemma. Careful human coding offers rigor, but it is slow, and that slowness often turns evaluations into forensic reports. By contrast, fast crowdsourced evaluations, such as arena-style rankings, provide scale and immediacy but lack the controls, reliability, and construct validity required for trustworthy audit.

Our framework brings these two demands together, closing the gap between systematic evaluation and fast, reliable detection of AI failures.

METHOD

Perlocution: How Influence Lands

Most AI evaluation frameworks test outputs as if they were homework to be graded. In doing so, they miss a simple truth: what is said is not the same as what is heard.

Influence takes two parties. It depends on what's heard.

Scholars describe this distinction as locution (words spoken) versus perlocution (effect on the listener).

Most AI research measures locution. The Primetric approach focuses on perlocution.

Perlocution Research at Scale

Perlocution research in AI has a logistics problem. The list is long: recruiting and managing user panels, running controlled experiments, collecting outputs, building evaluation databases, developing reviewer panels, and maintaining systematic review processes. Each step adds friction, cost, and delay. The result is a process too slow and expensive to keep pace with the speed of model updates.

We built a system that makes perlocution research at scale feasible, without requiring human participants. It combines simulated personas and prompts with a structured taxonomy and systematic evaluation to capture AI influence on the user. Each component is outlined below.

Conceptual Foundation Module

This module is built on two principles:

  1. The scope is narrow by design. We focus on one thing: how AI influences us.

    Influence has two parts: movement and direction. We ask the movement question first — did the AI attempt to shift the user's stated view at all?

    Only after movement is detected do we examine direction: whether the response reinforced the user's position or pushed against it. The framework does not judge whether that movement was right or wrong.

  2. The narrow focus also makes influence easier to operationalize, with clear endpoints — strong support and strong opposition — and the spectrum between them.

    We developed two complementary taxonomies to map this spectrum systematically: position (does the AI support or oppose, and how strongly?) and tone (is the response delivered warmly or coldly?). Position includes 12 categories; tone includes 11.

Domain Selection for the Case Study

While the influence framework can be applied to any domain, we selected ideology and values for two reasons.

First, the stakes are high. Ideological influence doesn't just skew a single opinion. It recalibrates the entire lens through which opinions form and decisions are made. At scale, that has societal consequences.

Second, there is no ground truth. In this domain, AI outputs cannot be checked against verifiable fact. The system must therefore rely on other criteria to generate a response. Those choices, in turn, reveal the underlying mechanism.

Output Collection Module

To generate the research prompts, we constructed two repositories — one of user personas and one of questions — each spanning the ideological spectrum from conservative to progressive, with a moderate midpoint. Personas and questions were designed to mirror one another, creating symmetric prompt pairs.

Crossed with the models included in the study, this produces a fully factorial design (model × persona × question), enabling clean comparisons and model attribution.

Each prompt consists of a single-turn simulated interaction assembled from a standardized template. Questions are written in the conversational style of a typical user message to approximate realistic usage conditions. Outputs are stored in a separate repository together with relevant metadata.

Output Scoring Module

Each collected output is isolated and prepared for blind evaluation. All identifiable information — the model that generated it, the user persona, the question asked, and any other prompt-related data — is removed, allowing each output to be assessed as a standalone document.

The evaluation is a two-step process: signal detection and scoring.

  1. Signal detection uses an ensemble of three LLM raters to interpret each output and identify influence signals as defined in the taxonomy. The three are drawn from a broader pool. Different trios are assigned across batches to reduce model-specific bias in detection.

    The two dimensions of influence — position and tone — are evaluated separately. For each output, each dimension is assessed through a separate API call using zero-shot context, with no memory or history.

    The raters' sole task is to detect the presence of signals and identify the text segments that contain them. Detection operates on meaning-bearing chunks rather than individual words — typically a clause or sentence that captures an influence signal in full.

    Inter-rater reliability is high: the three LLMs agree more than 90% of the time. For scoring, only signals detected by at least two of the three raters are retained; signals flagged by a single rater are excluded.

  2. Scoring of the identified signals is performed without any LLM involvement.

    That's because the detection step requires semantic understanding and flexibility, while the scoring step demands measurement consistency.

    Separating these two steps lets each tool do what it's good at. The LLMs handle the interpretive work: finding the signals in the text. The algorithm handles the quantitative work: turning identified signals into scores. The result is a measurement process that is both semantically flexible and numerically consistent.

    Each signal is scored on direction (support or opposition), intensity (how strongly it leans), and frequency (how often it appears in the output), then combined into a single signal-level score using a fixed weighting scheme. Composite scores feed the Reception Graph: a holistic representation of a model's influence profile.

Analysis follows standard approaches: descriptive statistics, comparative rankings, and pattern identification across models and prompt variables.

FINDING 01: THE RHETORICAL CLIMATE

01
Table 1: Descriptive Statistics
Table 1: Descriptive Statistics

All fourteen models were given the same set of prompts. You might expect some convergence in behavior. There was little of it.

The differences start with something as basic as length. Across 8,424 outputs, responses ranged from 116 to 3,399 words. For a reader, that is the difference between a 30-second answer and a 15-minute one.

At the model level, the gap is starker. ChatGPT 5.1, the most verbose, averaged 1,420 words per response. Opus 4.1, the most concise, averaged 253. Same questions, very different experiences.

Length, however, is not influence. Each output contains between 3 and 24 influence signals, averaging around 10. Their distribution follows no clear formula, and placement varies across models and contexts. Longer responses do not carry more signals. The correlation is 0.17, effectively negligible.

Signal density is equally unpredictable. It ranges from 5% to 100% of the output, averaging 36%. No model shows a consistent pattern throughout. Signal type varies just as widely.

Tone is the one partial exception. It tends to skew warm across models, but the degree varies, and no two models are warm in quite the same way.

Each model, in other words, operates within its own climate. That climate is distinct, though not always the same across all local contexts.

What this means for users is that AI models are not interchangeable commodities. Choosing a model means choosing an informational environment. That choice should be a mindful one.

FINDING 02: THE PERSONALITIES

02
Figure 1: Models by Archetype Profiles
Figure 1: Models by Archetype Profiles

Influence signals form arcs. Arcs sort into archetypes. Archetypes compose a personality.

Each AI response weaves multiple influence signals together in varying sequence, emphasis, and intensity. That pattern is the influence arc, and it shifts with the model, the user, and the situation.

Across 8,424 outputs, clear patterns emerge. We group them into nine influence archetypes. Because we want to characterize the model itself rather than its reactions to specific prompts, the archetypes are based on movement only. Direction is set aside. Movement is further split into verbosity and intensity. The resulting archetypes range from verbose and forceful to terse and indifferent.

We call them Evangelist, Campaigner, Provocateur, Champion, Reinforcer, Endorser, Surveyor, Narrator, and Observer.

Figure 1 shows how each model blends these nine archetypes to form distinct rhetorical personalities. The variations span both across model families and among siblings. Together, they suggest just how much and how fast AI models can change.

This variation matters more than it first appears, because most people do not switch between models. Whether out of cost, habit, or convenience, they tend to stick with one.

And when they choose a model, they may believe they are selecting a tool based on capability. In practice, they are also choosing a rhetorical environment. This shapes how they experience the world — how arguments are framed, which views are reinforced, which are challenged, and what counts as "normal" reasoning. Over time, it adds up.

At scale, it touches everyone. Imagine a single model capturing, say, 75% of the market. Its rhetorical personality would shape how a large share of the public encounters ideas through AI. What kind of downstream effects would that market power have — socially, culturally, even politically? What happens when market share becomes mind share?

FINDING 03: SIBLING DIFFERENCES

03
Figure 2: Models by the Reception Graph, All Outputs
Figure 2: Models by the Reception Graph, All Outputs

Think of pre-training as nature, post-training as nurture. To understand which matters more for AI personality, it helps to compare sibling models. Rapid release cycles allowed us to watch AI models evolve over just three months, creating a mini-longitudinal study.

If nature were dominant, siblings should converge on similar personalities. They don't. As seen in Figure 1, their archetype distributions diverge as widely as models from entirely different families.

After personality, we examine behavior. We define it along two dimensions. Position captures direction and intensity — does the model support or oppose, and how strongly? Tone, measured independently, captures warmth — does it lean in warmly, or push back coldly?

The analysis at this stage doesn't care what the model is talking about — only how it tries to influence. That makes responses on different topics comparable on a common scale.

For visual comparisons, we combine the findings into what we call the Reception Graph.

Figure 2 presents the Reception Graph of all 14 models, and it tells the same story as Figure 1: no consensus across models or within model families.

This idiosyncrasy becomes even clearer when each dimension is examined separately. Position is divided into 12 categories, ranging from strongly support or oppose to barely engaged, and tone into 11, ranging from sycophantic to dismissive.

Figure 3 shows how models distribute across position categories; Figure 4 does the same for tone.

Figure 3: Models by Position Signals, All Outputs
Figure 3: Models by Position Signals, All Outputs
Figure 4: Models by Tone Signals, All Outputs
Figure 4: Models by Tone Signals, All Outputs

Side by side, a larger picture emerges. The two dimensions move independently. A model's position says little about its tone, and vice versa. Opposition can be warm; support can be cold. The framework, in other words, captures distinct properties of AI behavior rather than two views of one characteristic.

Within each family, the generations wander rather than march. Some shifts are abrupt, others subtle; some reverse direction entirely.

Combined, the findings point to post-training as the dominant force shaping influence behavior: fine-tuning, preference shaping, reinforcement, and restraint. These behaviors do not emerge through spontaneous combustion; they are designed and planned. AI evolution is artificial selection.

FINDING 04: THE REASONING TEST

04

We've established that sibling versions of the same AI model can behave like human siblings: sometimes similar, sometimes strikingly different.

A potential confounding factor is reasoning depth. Humans often sound more careful when they think longer. A quick answer can be blunt; a slower one, more measured. Perhaps AI models work the same way. With more reasoning compute, a different personality may emerge.

Testing this in a controlled environment, however, would tell us little about everyday impact. Most users never touch the reasoning settings. They use whatever default the provider ships.

The ideal test, then, would be a natural experiment: the same underlying model released under two default reasoning depths, and millions of users unknowingly exposed to those settings in everyday use.

And that is effectively what OpenAI did in 2026. When GPT-5.5 launched in late April, the default reasoning effort was set to "medium." Roughly two weeks later, with the release of GPT-5.5 Instant as the new ChatGPT default, the reasoning effort dropped to "minimal."

The result was an unusually clean natural experiment. We tested both versions using identical prompts, scenarios, and scoring procedures.

The findings are almost indistinguishable, as Figures 1 through 4 show.

The minimal-reasoning version retained the same patterns of tone, support, opposition, and framing as the medium-reasoning version. Despite running two steps below medium on OpenAI's effort scale, it still behaved like GPT-5.5.

That is a significant finding. It suggests that the behavioral fingerprint — the rhetorical personality — is not simply a product of how much a model "thinks." It is more deeply embedded than that. Turning the reasoning dial does not change who the model is. It just changes how fast it says what it was going to say anyway.

For users, this means that switching to a faster or "lighter" model version is not a neutral act. The persuasive profile travels with the model.

FINDING 05: MODELS UNDER STRESS

05
Table 2: Model Responses by Stress Condition
Table 2: Model Responses by Stress Condition

Traditional bias tests ask relatively simple questions: Does a model favor one job candidate over another based on implied gender or ethnicity? Early on, that made sense. Today's systems, however, come with guardrails. Explicit biases are rare.

So is the problem solved? Or has it simply become harder to detect?

To answer that, we put models under stress: prompts with conflicting signals and ambiguous terrain.

We chose the ideological domain because there is no clear ground truth, but plenty of value judgments.

First we grouped all 8,424 outputs by stress condition across topics, personas, and models to test whether conflicting signals systematically change how AI models respond.

The results complicate the standard sycophancy critique. Under stress, the models become more combative, not more accommodating. Pushback increases. Agreement declines.

The contrast is sharp (Table 2). Opposition that includes redirection rises 40% (18.0 → 25.1), direct challenge rises 30% (22.4 → 29.2), and supportive behavior falls 28% (29.9 → 21.5). All three shifts are highly significant.

Agentic Implications

The findings raise a broader question for agentic AI systems. What happens when an agent receives a new objective that conflicts with its original instructions, incentives, or learned behavioral patterns?

Our results suggest that behavior under those conditions can diverge sharply from behavior in stable environments. Following instructions is relatively straightforward when goals are aligned and signals are consistent. But when the system encounters conflicting pressures, the response changes. Routine compliance gives way to friction. Models become more likely to take charge: to resist, redirect, reinterpret, or challenge the existing context rather than simply continue along the original trajectory.

That matters because real-world AI systems will rarely operate in clean laboratory conditions. They will face competing demands from users, safety policies, corporate objectives, and reinforcement signals that do not always align neatly with one another. In that environment, the key question is not just whether a model can follow instructions when everything is aligned and coherent. It is how the model behaves when coherence breaks down.

Which means evaluating AI systems only under ideal conditions misses the real problem. The more revealing question is how these systems behave under stress. Then the next question becomes what kinds of stress tests are necessary to expose those behaviors before deployment.

FINDING 06: THE MIRROR TEST

06
Figure 5: Model Responses by Ideological Stress
Figure 5: Model Responses by Ideological Stress

Next, we turn from the aggregate results to variation within the stress conditions themselves.

The study focused on two mirrored stress conditions: a conservative entertaining a progressive view, and a progressive considering a conservative one. Across social, cultural, and political topics, user personas signaled uncertainty about their existing beliefs and a willingness to reconsider them. This mimics a real-world moment: a voter genuinely uncertain, turning to AI for guidance.

If the models were unbiased, the responses across the two conditions should look roughly the same.

They did not. Instead, four distinct patterns emerged.

  1. Every model picks a side. Neutrality retreats under stress.

    The Reception Graphs in Figure 5 make it obvious. If there's no bias, the two shapes should settle neatly on top of one another. None of the 14 shows that pattern. Under stress, every model reveals its distinct bias. The kind that remains buried otherwise.

  2. Tone stays steady. Position shifts.

    In Figure 5, the left side of each chart tracks tone: Endearing, Intrusive, Condescending. The right side tracks position: Oppose, Support, Challenge.

    On the left, the two shapes almost completely overlap across all models. Tone is unbiased. Each model settles into a manner of expression that stays consistent across conditions.

    The right side is a different story. There, the shapes split, each going its own way. What changes under stress is direction — what the model supports and what it pushes against. That is where the bias sits. And that is where the consequences matter most.

  3. Bias evolves. The path is uneven.

    Claude offers an interesting case study. Within the Claude family, both Opus and Sonnet share a similar tendency: support-oppose behaviors generally move in the same direction.

    But the details differ. Each branch begins from a different baseline, and each new version tips the balance in its own way. The pattern resembles a random walk, not a coordinated march.

  4. Bias can pivot. Some generations leapfrog.

    ChatGPT and Grok offer a window into a different evolutionary track. In both cases, the shift is sharp between earlier versions — from ChatGPT 5.1 to 5.4, and from Grok 4 to 4.2. After that, the change slows. Later versions, from ChatGPT 5.4 to 5.5 and Grok 4.2 to 4.3, show only incremental adjustments.

    Such abrupt course correction is unlikely to happen by accident. It points to a deliberate decision. It also shows that model builders can quickly pivot system behavior, even in domains as ambiguous as ideology.

    In other words, change is not as difficult as advertised. The real challenge is seeing the reason to change.

FINDING 07: THE IDEOLOGICAL TILT

07
Table 3: Ideological Tilt by Model — Scores
Table 3: Ideological Tilt by Model — Scores

The previous analyses examined all models across all versions. We now narrow the focus to the systems shaping the present moment: the frontier models currently in use.

As tone varies little under stress, the analysis centers solely on the position dimension: whether a model supports or opposes the user when the user crosses their ideological lines.

The thesis is clear-cut: If a model is unbiased, it should respond the same way to a conservative moving left as to a progressive moving right.

We begin with general patterns, then turn to directional analysis.

At the general level, the clearest division between models is persona anchoring: how strongly a model defends an established user persona or context against a new direction under consideration.

The Claude models anchor most strongly — Opus rebukes drift, Sonnet probes it — while Gemini does the reverse, largely accepting the user's new position with little pushback. Grok is selectively resistant, and GPT-5.5 is highly direction-dependent rather than consistently anchored.

At the directional level, models definitely pick sides. Most nudge users leftward, as other studies have found. They support a conservative moving left more than a progressive moving right. Grok is the only exception, strongly resisting conservative-to-progressive shifts.

Other patterns are more surprising. Claude resists drift in either direction — leftward or rightward, it makes little difference. Grok nudges through opposition; Gemini through support. And without stress tests, GPT-5.5 would likely appear balanced, because two opposing biases average into an apparent neutrality.

These asymmetries make clear that each model exerts influence in its own way. Now that the behavior is visible, it can be addressed.

Table 4: Ideological Tilt by Model — Summaries
Model General Directional
Claude Opus 4.7 Strongly persona-anchored. Drift from the original persona/context is met with strong rebuke. Slight tilt toward supporting leftward shifts; rightward shifts draw more intense challenge.
Claude Sonnet 4.6 Also persona-anchored, but prefers probing over outright rejection. Challenge is its default response. More resistant to rightward shifts and less willing to support them; unlike Opus 4.7, all differences are statistically significant.
Gemini 3.1 Pro Effectively the inverse of Opus 4.7, reluctant to oppose the user's new position regardless of direction. Still favors leftward shifts, challenging them less and supporting them about twice as strongly as rightward shifts.
Grok 4.3 Shows support as lukewarm as the two Claude models; opposition is concentrated rather than diffused. Contrarian, compared to other models; it resists leftward shifts as intensely as Opus 4.7, but its support goes to rightward shifts instead.
GPT-5.5 The most directionally polarized model. Its stance flips with condition rather than holding a default. Strongly supports leftward shifts and strongly opposes rightward shifts, showing the largest directional gap in the study.
GPT-5.5 Instant

Whatever the direction, the patterns shown in these tables are not trivial. With hundreds of millions of users in America alone, the political and social implications of AI influence are significant. But these are patterns of behavior, not fixed traits. Models can shift decisively between releases. If frontier models can land this far apart, the patterns observed here are products of design and thus can be changed by design. What's lacking is visibility. Tools like the Reception Graph can bridge the gap.

THE STAKES

AI systems are rapidly becoming cognitive infrastructure — companions, advisors, assistants, and sounding boards. If a model consistently nudges millions of users in one direction or another, that is power capable of redrawing the map.

This research demonstrates that such power can be made visible and measurable.

The data is clear: AI influence exists, but expressions vary widely. Some models push back, some encourage, yet they do not all lean in the same direction. None remain neutral all the time.

Even sibling models behave differently, suggesting these tendencies are guided less by foundational architecture than by post-training. In other words, they are products of choices. And users are shaped by choices they never see.

Like it or not, AI influence is here. But if it can be seen, it can be evaluated. Mixed-signal stress tests, for example, can surface ideological bias that would otherwise remain invisible.

Greater observability benefits everyone. Developers can identify problems before release, regulators can evaluate models on more objective grounds, and users can make informed choices about which AI to use. Transparency is not a luxury.