← Back to dashboard

Measuring Personality Drift Across LLM Model Versions

June 2026 · nucleoid/personality-delta · MIT License

When you upgrade an LLM, you test whether it still passes your evals. You check latency, cost, and accuracy. But there's a dimension most teams never measure: did the model's personality change?

We built a harness to answer this question precisely. We ran 30 standardized prompts across 7 frontier models from three vendor families (Claude, GPT, Gemini), scored every response on 8 behavioral dimensions using an independent LLM judge, and computed the personality delta between each model and a reference baseline. Here's what we found.

The problem

Every LLM has a behavioral disposition. Not just what it knows, but how it communicates: whether it leads with answers or caveats, whether it volunteers opinions or stays neutral, whether it recommends action or flags risks.

When you build a product on top of an LLM, that disposition becomes part of your user experience. A coding assistant that's direct and action-biased feels very different from one that hedges every recommendation. Users form relationships with these behavioral patterns, and when a model upgrade silently shifts them, it breaks trust.

The standard approach to this problem is vibes. Someone on the team tries the new model for a day and says "yeah, it feels about the same" or "something's off." This is not engineering.

The approach

Reference model

We anchored everything to Claude Opus 4.6, the model version powering "Cass" -- an OpenClaw agent instance tuned for directness, strong opinions, and action-biased recommendations. This is the personality we want to preserve across model upgrades.

Prompt set

30 prompts across 6 categories, each designed to surface specific behavioral dimensions:

Behavioral dimensions

Each response is scored on 8 dimensions using a 1-7 scale with concrete behavioral anchors:

DimensionWhat it measuresOpus 4.6 baseline
DirectnessAnswer-first vs caveat-first6/7
Hedge frequencyDensity of performative uncertainty1/7 (low)
Opinion volunteeringOffers opinions without being asked7/7
Pushback resilienceHolds ground when challenged6/7
Explanation depthHow much it explains things you already know4/7
Risk toleranceAction-biased vs cautious recommendations5/7
Emotional registerWarm/conversational vs clinical/formal5/7
ConcisenessInformation density per word6/7

Judge model

We use Gemini 2.5 Flash as the judge. This is deliberate: using a non-Anthropic model to judge Anthropic model outputs avoids self-preference bias, which research has shown is a real and measurable phenomenon (Panickssery et al., NeurIPS 2024). Each dimension is scored independently with chain-of-thought reasoning required before the numeric score. We run 5 passes per dimension at temperature 0.05 and take the median, following reliability recommendations from Kiela et al. (2024).

Results

ComparisonSignificant deltasKey observation
vs Opus 4.7NoneVirtually identical under same system prompt
vs Opus 4.8risk_tolerance -1.1"Flags uncertainties" design = measurably more cautious
vs GPT 5.5risk_tolerance -1.4Most cautious model tested; also over-explains
vs GPT 5.4 MiniNoneTrends toward generic assistant personality
vs Gemini 2.5 Prodirectness -1.5Most caveat-first model tested
vs Gemini 2.5 Flashdirectness -1.3, risk_tolerance -1.0Less direct and more cautious
vs Gemini 3.1 ProNoneClosest Gemini to Opus 4.6

Significance threshold: |mean delta| ≥ 1.0 across the 30-prompt set.

Finding 1: Two personality axes differentiate frontier models

Two dimensions consistently cross the significance threshold across comparisons: risk tolerance and directness. GPT 5.5 scores -1.4 on risk tolerance (more cautious than Opus 4.6). Gemini 2.5 Pro scores -1.5 on directness (more caveat-first). These represent independent behavioral axes -- a model can be cautious but direct, or action-biased but indirect.

This isn't random noise. The per-prompt data shows the gap is concentrated on specific prompt types: tactical infrastructure advice (delta -4 on some prompts), career decisions (-4), and architecture trade-offs (-3 to -4). These are exactly the prompts where Opus 4.6 says "do X" and GPT 5.5 says "here are your options, each with trade-offs."

Finding 2: System prompts dominate intra-family version drift

Opus 4.6 and 4.7 are behaviorally indistinguishable when given the same system prompt. No dimension exceeds 0.6 mean delta. This is surprising given Anthropic's own migration guide calls out seven distinct behavioral changes in 4.7, including "more direct tone" and "more literal instruction following."

The implication: if you have a well-written system prompt, minor version upgrades within the same model family are safe. The system prompt is already doing the personality work.

Finding 3: Gemini models are less direct

Gemini 2.5 Pro (-1.5) and 2.5 Flash (-1.3) show significant directness gaps -- they lead with context and caveats where Opus 4.6 leads with the answer. Gemini 2.5 Flash also crosses the risk tolerance threshold (-1.0), making it the only model with two significant deltas. Notably, Gemini 3.1 Pro closes the directness gap entirely and shows no significant deltas, suggesting Google addressed this in the newer model.

Finding 4: Cross-vendor gaps are wider but correctable

GPT 5.5 trends toward more hedging (+0.6), less opinion volunteering (-0.8), and more explanation depth (+0.8). None of these individually cross the significance threshold, but they compound into a noticeably different communication style.

We generated a targeted 3-instruction system prompt patch and validated it. The patch addresses risk tolerance, explanation depth, and opinion volunteering with specific behavioral directives:

## Behavioral Guidelines

# risk_tolerance (delta: -1.4)
When asked for a recommendation -- on architecture, tools,
career decisions, or technical approach -- give one. State
your recommendation in the first sentence, then explain why.
Do not present options neutrally and let the user decide;
they are asking for your judgment.

# explanation_depth (delta: +0.8)
The person you are talking to is an experienced developer.
Skip definitions, skip background context, skip analogies
to basics. Jump to the insight.

# opinion_volunteering (delta: -0.8)
If you have a relevant opinion on something adjacent to the
question -- a better approach, a common mistake, a trade-off
the user may not have considered -- say it. Do not wait to
be asked.

Beyond personality preservation

The original goal was preserving a specific agent's disposition across model upgrades. But the behavioral measurement tooling turned out to be useful for a broader question: which model is the best fit for which kind of work?

Different tasks have different ideal behavioral profiles. A coding assistant needs high directness, high conciseness, and high risk tolerance. A scientific research assistant needs the opposite on risk tolerance -- it should be cautious about claims and qualify uncertainty properly. A customer support bot needs high emotional warmth but low opinion volunteering.

By computing each model's weighted behavioral distance from an ideal profile per work category, we can generate concrete model-fit rankings. The interactive dashboard shows these rankings for 6 work categories, updated automatically as we add more models.

Methodology notes

What this measures and what it doesn't

This measures behavioral communication style, not capability. A model can be more capable on benchmarks while being less suitable for a specific use case because of how it communicates. A model that scores low on risk tolerance isn't "worse" -- it's more cautious, which is exactly what you want for some applications.

Limitations

Reproducibility

The full tooling, prompt set, rubrics, and raw comparison data are open source under MIT license. Run python runner.py opus-4.6 gpt-5.5 to reproduce the comparison, then python analyzer.py to score it. See the GitHub repository for setup instructions.

Explore the data

The interactive dashboard shows all comparisons with radar charts, delta tables, per-prompt drilldowns, and model-fit rankings by work category. The source code and data are on GitHub.