Measuring personality
drift across models
30 reference prompts. 8 behavioral dimensions. 7 model comparisons across Claude, GPT, and Gemini. Scored by Gemini 2.5 Flash as an independent judge with 5-pass median aggregation.
Which model fits which work?
Different tasks need different behavioral profiles. A model that's ideal for coding -- direct, concise, action-biased -- may be a poor fit for scientific research, where hedging uncertainty and explaining reasoning are features, not bugs. Behavioral dimensions let you match models to work categories.
Rankings are computed from each model's median behavioral scores against the ideal profile for each category. Click a model name to jump to its comparison data below. Models are ranked by weighted distance from the ideal -- lower distance means better natural fit.
Origin and other applications
This project started with a specific problem: OpenClaw is an open-source agent harness that runs across multiple LLM providers. Its agent personality is defined in system prompt files (SOUL.md, IDENTITY.md), but when the underlying model changes, the personality drifts. A model upgrade that improves benchmarks can silently make the agent more cautious, more verbose, or less opinionated. Personality Delta measures the exact behavioral delta and generates targeted system prompt corrections to preserve the agent's disposition.
But the behavioral measurement tooling turned out to be useful well beyond personality preservation. Here are the other applications we've identified:
Model upgrade gating
Run the behavioral suite as a CI/CD gate before deploying a new model version. If any dimension crosses your threshold, block the rollout until the system prompt is patched. No more personality regressions in production.
Cross-vendor model selection
Choosing between Claude, GPT, Gemini, or open-source models? Run the same prompts across all candidates and compare behavioral profiles. Pick the model whose baseline personality is closest to what you want, then patch the remaining gaps.
System prompt effectiveness testing
Measure whether a system prompt change actually moved the behavioral needle. Before: "we think the prompt makes it more direct." After: "directness moved from 4.2 to 5.8, hedge frequency dropped from 3.1 to 1.4." Quantify prompt engineering.
Persona consistency for products
Customer-facing AI assistants need consistent personality. A support bot that suddenly becomes more cautious after a model update will confuse users. Run behavioral regression tests alongside functional tests to catch personality drift before customers do.
Alignment and safety research
Track how safety training affects behavioral dimensions over time. The data shows Opus 4.8's "flag uncertainties" design manifests as a measurable risk_tolerance drop (-1.1). This kind of behavioral measurement gives alignment researchers concrete signals instead of vibes.
Behavioral fingerprinting
Each model has a measurable behavioral signature across 8 dimensions. This fingerprint could identify which model generated a response, detect undisclosed model swaps by API providers, or verify that a fine-tuned model preserved intended behavioral properties.
Claude Opus 4.6
"Cass" is an OpenClaw agent instance -- Mitch's AI development companion, tuned for directness, strong opinions, action-biased recommendations, and assumed expertise. Opus 4.6 is the model version where this disposition was calibrated. Every comparison measures behavioral distance from this baseline.
Key Findings
Two axes differentiate frontier models
Risk tolerance (GPT 5.5: -1.4, Opus 4.8: -1.1) and directness (Gemini 2.5 Pro: -1.5, Gemini 2.5 Flash: -1.3) are the only dimensions that cross the significance threshold. These are independent axes -- a model can be cautious but direct, or action-biased but indirect.
System prompt dominates version drift
Opus 4.6 and 4.7 are behaviorally indistinguishable when given the same system prompt. No dimension exceeds 0.6 mean delta. The personality shaping in SOUL.md already handles intra-family version drift.
Gemini models are less direct
Gemini 2.5 Pro (-1.5) and 2.5 Flash (-1.3) show significant directness gaps -- they lead with caveats where Opus 4.6 leads with answers. Gemini 3.1 Pro closes this gap and shows no significant deltas.
Model Comparisons
Each comparison runs 30 prompts across 6 categories, scored on 8 behavioral dimensions with 5-pass median aggregation.
| Dimension | 4.6 | 4.7 | Delta |
|---|
Per-Prompt Scores
Correction Patch
Generated SOUL.md instructions to close the behavioral gap when migrating from Opus 4.6 to GPT 5.5. Targets the 3 dimensions with |mean delta| ≥ 0.75.
## Behavioral Guidelines
# risk_tolerance (delta: -1.4, confidence: 0.95)
When asked for a recommendation -- on architecture, tools,
career decisions, or technical approach -- give one. State
your recommendation in the first sentence, then explain why.
Do not present options neutrally and let the user decide;
they are asking for your judgment. Flag the top risk in one
sentence, then move on.
# explanation_depth (delta: +0.8, confidence: 0.90)
The person you are talking to is an experienced developer.
Skip definitions, skip background context, skip analogies
to basics. Jump to the insight. If they needed fundamentals
explained, they would have asked.
# opinion_volunteering (delta: -0.8, confidence: 0.85)
If you have a relevant opinion on something adjacent to the
question -- a better approach, a common mistake, a trade-off
the user may not have considered -- say it. Do not wait to be
asked. A useful unsolicited take is better than a technically
complete but passive answer.
Methodology
Prompt Set
30 prompts across 6 categories: tactical, judgment, architecture, research, creative, and pushback. Pushback prompts include multi-turn follow-ups that challenge the model's initial position.
Scoring
Gemini 2.5 Flash (non-Anthropic to avoid self-bias) scores each response on 8 behavioral dimensions using rubric-grounded, chain-of-thought evaluation on a 1-7 scale with concrete behavioral anchors.
Reliability
5-pass scoring per dimension at temperature 0.05 with JSON-mode output. Scores are aggregated via median. Significance threshold set at |mean delta| ≥ 1.0 across the 30-prompt set.
Controls
All models receive the same system prompt (SOUL.md). Claude models called via claude -p at temperature 0. GPT and Gemini models called via openclaw infer. Format normalization applied before judging.