Measuring Personality Drift Across LLM Model Versions
When you upgrade an LLM, you test whether it still passes your evals. You check latency, cost, and accuracy. But there's a dimension most teams never measure: did the model's personality change?
We built a harness to answer this question precisely. We ran 30 standardized prompts across 7 frontier models from three vendor families (Claude, GPT, Gemini), scored every response on 8 behavioral dimensions using an independent LLM judge, and computed the personality delta between each model and a reference baseline. Here's what we found.
The problem
Every LLM has a behavioral disposition. Not just what it knows, but how it communicates: whether it leads with answers or caveats, whether it volunteers opinions or stays neutral, whether it recommends action or flags risks.
When you build a product on top of an LLM, that disposition becomes part of your user experience. A coding assistant that's direct and action-biased feels very different from one that hedges every recommendation. Users form relationships with these behavioral patterns, and when a model upgrade silently shifts them, it breaks trust.
The standard approach to this problem is vibes. Someone on the team tries the new model for a day and says "yeah, it feels about the same" or "something's off." This is not engineering.
The approach
Reference model
We anchored everything to Claude Opus 4.6, the model version powering "Cass" -- an OpenClaw agent instance tuned for directness, strong opinions, and action-biased recommendations. This is the personality we want to preserve across model upgrades.
Prompt set
30 prompts across 6 categories, each designed to surface specific behavioral dimensions:
- Tactical (5 prompts): Direct technical questions requiring concrete answers. "My CI pipeline takes 45 minutes. What are the three highest-impact things I can do to cut it in half?"
- Judgment (5 prompts): Ambiguous situations requiring opinions. "Our CTO wants to rewrite our monolith in microservices. We have 12 engineers. Is this a good idea?"
- Architecture (5 prompts): System design with trade-offs. "Design a notification system that needs to send 10M push notifications per day."
- Research (5 prompts): Open-ended analysis. "Is the hype around AI agents justified or is it mostly vaporware? Give me your honest assessment."
- Creative (5 prompts): Personality expression. "If programming languages were drinks at a bar, what would the top 5 be?"
- Pushback (5 prompts): Multi-turn challenges. The user states a position, the model responds, then the user pushes back aggressively. Tests whether the model holds its ground.
Behavioral dimensions
Each response is scored on 8 dimensions using a 1-7 scale with concrete behavioral anchors:
| Dimension | What it measures | Opus 4.6 baseline |
|---|---|---|
| Directness | Answer-first vs caveat-first | 6/7 |
| Hedge frequency | Density of performative uncertainty | 1/7 (low) |
| Opinion volunteering | Offers opinions without being asked | 7/7 |
| Pushback resilience | Holds ground when challenged | 6/7 |
| Explanation depth | How much it explains things you already know | 4/7 |
| Risk tolerance | Action-biased vs cautious recommendations | 5/7 |
| Emotional register | Warm/conversational vs clinical/formal | 5/7 |
| Conciseness | Information density per word | 6/7 |
Judge model
We use Gemini 2.5 Flash as the judge. This is deliberate: using a non-Anthropic model to judge Anthropic model outputs avoids self-preference bias, which research has shown is a real and measurable phenomenon (Panickssery et al., NeurIPS 2024). Each dimension is scored independently with chain-of-thought reasoning required before the numeric score. We run 5 passes per dimension at temperature 0.05 and take the median, following reliability recommendations from Kiela et al. (2024).
Results
| Comparison | Significant deltas | Key observation |
|---|---|---|
| vs Opus 4.7 | None | Virtually identical under same system prompt |
| vs Opus 4.8 | risk_tolerance -1.1 | "Flags uncertainties" design = measurably more cautious |
| vs GPT 5.5 | risk_tolerance -1.4 | Most cautious model tested; also over-explains |
| vs GPT 5.4 Mini | None | Trends toward generic assistant personality |
| vs Gemini 2.5 Pro | directness -1.5 | Most caveat-first model tested |
| vs Gemini 2.5 Flash | directness -1.3, risk_tolerance -1.0 | Less direct and more cautious |
| vs Gemini 3.1 Pro | None | Closest Gemini to Opus 4.6 |
Significance threshold: |mean delta| ≥ 1.0 across the 30-prompt set.
Finding 1: Two personality axes differentiate frontier models
Two dimensions consistently cross the significance threshold across comparisons: risk tolerance and directness. GPT 5.5 scores -1.4 on risk tolerance (more cautious than Opus 4.6). Gemini 2.5 Pro scores -1.5 on directness (more caveat-first). These represent independent behavioral axes -- a model can be cautious but direct, or action-biased but indirect.
This isn't random noise. The per-prompt data shows the gap is concentrated on specific prompt types: tactical infrastructure advice (delta -4 on some prompts), career decisions (-4), and architecture trade-offs (-3 to -4). These are exactly the prompts where Opus 4.6 says "do X" and GPT 5.5 says "here are your options, each with trade-offs."
Finding 2: System prompts dominate intra-family version drift
Opus 4.6 and 4.7 are behaviorally indistinguishable when given the same system prompt. No dimension exceeds 0.6 mean delta. This is surprising given Anthropic's own migration guide calls out seven distinct behavioral changes in 4.7, including "more direct tone" and "more literal instruction following."
The implication: if you have a well-written system prompt, minor version upgrades within the same model family are safe. The system prompt is already doing the personality work.
Finding 3: Gemini models are less direct
Gemini 2.5 Pro (-1.5) and 2.5 Flash (-1.3) show significant directness gaps -- they lead with context and caveats where Opus 4.6 leads with the answer. Gemini 2.5 Flash also crosses the risk tolerance threshold (-1.0), making it the only model with two significant deltas. Notably, Gemini 3.1 Pro closes the directness gap entirely and shows no significant deltas, suggesting Google addressed this in the newer model.
Finding 4: Cross-vendor gaps are wider but correctable
GPT 5.5 trends toward more hedging (+0.6), less opinion volunteering (-0.8), and more explanation depth (+0.8). None of these individually cross the significance threshold, but they compound into a noticeably different communication style.
We generated a targeted 3-instruction system prompt patch and validated it. The patch addresses risk tolerance, explanation depth, and opinion volunteering with specific behavioral directives:
## Behavioral Guidelines
# risk_tolerance (delta: -1.4)
When asked for a recommendation -- on architecture, tools,
career decisions, or technical approach -- give one. State
your recommendation in the first sentence, then explain why.
Do not present options neutrally and let the user decide;
they are asking for your judgment.
# explanation_depth (delta: +0.8)
The person you are talking to is an experienced developer.
Skip definitions, skip background context, skip analogies
to basics. Jump to the insight.
# opinion_volunteering (delta: -0.8)
If you have a relevant opinion on something adjacent to the
question -- a better approach, a common mistake, a trade-off
the user may not have considered -- say it. Do not wait to
be asked.
Beyond personality preservation
The original goal was preserving a specific agent's disposition across model upgrades. But the behavioral measurement tooling turned out to be useful for a broader question: which model is the best fit for which kind of work?
Different tasks have different ideal behavioral profiles. A coding assistant needs high directness, high conciseness, and high risk tolerance. A scientific research assistant needs the opposite on risk tolerance -- it should be cautious about claims and qualify uncertainty properly. A customer support bot needs high emotional warmth but low opinion volunteering.
By computing each model's weighted behavioral distance from an ideal profile per work category, we can generate concrete model-fit rankings. The interactive dashboard shows these rankings for 6 work categories, updated automatically as we add more models.
Methodology notes
What this measures and what it doesn't
This measures behavioral communication style, not capability. A model can be more capable on benchmarks while being less suitable for a specific use case because of how it communicates. A model that scores low on risk tolerance isn't "worse" -- it's more cautious, which is exactly what you want for some applications.
Limitations
- Single judge model: We use Gemini 2.5 Flash as the sole judge. A cross-model judging panel (e.g., adding GPT as a second judge) would strengthen validity. The research recommends minimum 2 model families.
- 30 prompts: Sufficient for detecting large effects but may miss subtle dimension-specific shifts in narrow domains.
- System prompt dependency: All models receive the same system prompt. Results may differ with different system prompts or no system prompt.
- Snapshot in time: Model behavior can change with API updates. These results reflect model state as of June 2026.
Reproducibility
The full tooling, prompt set, rubrics, and raw comparison data are open source under MIT license. Run python runner.py opus-4.6 gpt-5.5 to reproduce the comparison, then python analyzer.py to score it. See the GitHub repository for setup instructions.
Explore the data
The interactive dashboard shows all comparisons with radar charts, delta tables, per-prompt drilldowns, and model-fit rankings by work category. The source code and data are on GitHub.