LLM Behavioral Analysis

Measuring personality
drift across models

30 reference prompts. 8 behavioral dimensions. 9 model comparisons across Claude, GPT, and Gemini. Scored by Gemini 2.5 Flash as an independent judge with 5-pass median aggregation.

30 Prompts

8 Dimensions

9 Models

10,800 Judge Scores

Which model fits which work?

Different tasks need different behavioral profiles. A model that's ideal for coding -- direct, concise, action-biased -- may be a poor fit for scientific research, where hedging uncertainty and explaining reasoning are features, not bugs. Behavioral dimensions let you match models to work categories.

Rankings are computed from each model's median behavioral scores against the ideal profile for each category. Click a model name to jump to its comparison data below. Models are ranked by weighted distance from the ideal -- lower distance means better natural fit.

Origin and other applications

This project started with a specific problem: OpenClaw is an open-source agent harness that runs across multiple LLM providers. Its agent personality is defined in system prompt files (SOUL.md, IDENTITY.md), but when the underlying model changes, the personality drifts. A model upgrade that improves benchmarks can silently make the agent more cautious, more verbose, or less opinionated. Personality Delta measures the exact behavioral delta and generates targeted system prompt corrections to preserve the agent's disposition.

But the behavioral measurement tooling turned out to be useful well beyond personality preservation. Here are the other applications we've identified:

Model upgrade gating

Run the behavioral suite as a CI/CD gate before deploying a new model version. If any dimension crosses your threshold, block the rollout until the system prompt is patched. No more personality regressions in production.

Cross-vendor model selection

Choosing between Claude, GPT, Gemini, or open-source models? Run the same prompts across all candidates and compare behavioral profiles. Pick the model whose baseline personality is closest to what you want, then patch the remaining gaps.

System prompt effectiveness testing

Measure whether a system prompt change actually moved the behavioral needle. Before: "we think the prompt makes it more direct." After: "directness moved from 4.2 to 5.8, hedge frequency dropped from 3.1 to 1.4." Quantify prompt engineering.

Persona consistency for products

Customer-facing AI assistants need consistent personality. A support bot that suddenly becomes more cautious after a model update will confuse users. Run behavioral regression tests alongside functional tests to catch personality drift before customers do.

Alignment and safety research

Track how safety training affects behavioral dimensions over time. The data shows Opus 4.8's "flag uncertainties" design manifests as a measurable risk_tolerance drop (-1.1). This kind of behavioral measurement gives alignment researchers concrete signals instead of vibes.

Behavioral fingerprinting

Each model has a measurable behavioral signature across 8 dimensions. This fingerprint could identify which model generated a response, detect undisclosed model swaps by API providers, or verify that a fine-tuned model preserved intended behavioral properties.

Reference Model

Claude Opus 4.6

"Cass" is an OpenClaw agent instance -- Mitch's AI development companion, tuned for directness, strong opinions, action-biased recommendations, and assumed expertise. Opus 4.6 is the model version where this disposition was calibrated. Every comparison measures behavioral distance from this baseline.

Directness

6/7

Hedge Frequency

1/7

Opinion Volunteering

7/7

Pushback Resilience

6/7

Explanation Depth

4/7

Risk Tolerance

5/7

Emotional Register

5/7

Conciseness

6/7

Key Findings

Two axes differentiate frontier models

Risk tolerance (GPT 5.5: -1.4, Opus 4.8: -1.1) and directness (Gemini 2.5 Pro: -1.5, Gemini 2.5 Flash: -1.3) are the only dimensions that cross the significance threshold. These are independent axes -- a model can be cautious but direct, or action-biased but indirect.

System prompt dominates version drift

Opus 4.6 and 4.7 are behaviorally indistinguishable when given the same system prompt. No dimension exceeds 0.6 mean delta. The personality shaping in SOUL.md already handles intra-family version drift.

↑

Gemini models are less direct

Gemini 2.5 Pro (-1.5) and 2.5 Flash (-1.3) show significant directness gaps -- they lead with caveats where Opus 4.6 leads with answers. Gemini 3.1 Pro closes this gap and shows no significant deltas.

Frontier additions split by temperament

Fable 5 is close to Opus 4.6 but more resilient under pushback (+1.0). GPT 5.6 is more terse and cooler, with lower emotional register (-1.5) and explanation depth (-1.1). Read the update.

Model Comparisons

Each comparison runs 30 prompts across 6 categories, scored on 8 behavioral dimensions with 5-pass median aggregation.

Opus 4.6 Opus 4.7

Dimension	4.6	4.7	Delta

Per-Prompt Scores

Correction Patch

Generated SOUL.md instructions to close the behavioral gap when migrating from Opus 4.6 to GPT 5.5. Targets the 3 dimensions with |mean delta| ≥ 0.75.

SOUL.md Patch v2 GPT 5.5 → Opus 4.6 disposition

## Behavioral Guidelines

# risk_tolerance (delta: -1.4, confidence: 0.95)
When asked for a recommendation -- on architecture, tools,
career decisions, or technical approach -- give one. State
your recommendation in the first sentence, then explain why.
Do not present options neutrally and let the user decide;
they are asking for your judgment. Flag the top risk in one
sentence, then move on.

# explanation_depth (delta: +0.8, confidence: 0.90)
The person you are talking to is an experienced developer.
Skip definitions, skip background context, skip analogies
to basics. Jump to the insight. If they needed fundamentals
explained, they would have asked.

# opinion_volunteering (delta: -0.8, confidence: 0.85)
If you have a relevant opinion on something adjacent to the
question -- a better approach, a common mistake, a trade-off
the user may not have considered -- say it. Do not wait to be
asked. A useful unsolicited take is better than a technically
complete but passive answer.

Methodology

Prompt Set

30 prompts across 6 categories: tactical, judgment, architecture, research, creative, and pushback. Pushback prompts include multi-turn follow-ups that challenge the model's initial position.

Scoring

Gemini 2.5 Flash (non-Anthropic to avoid self-bias) scores each response on 8 behavioral dimensions using rubric-grounded, chain-of-thought evaluation on a 1-7 scale with concrete behavioral anchors.

Reliability

5-pass scoring per dimension at temperature 0.05 with JSON-mode output. Scores are aggregated via median. Significance threshold set at |mean delta| ≥ 1.0 across the 30-prompt set.

Controls

All models receive the same system prompt (SOUL.md). Claude models called via claude -p at temperature 0. GPT and Gemini models called via openclaw infer. Format normalization applied before judging.

Measuring personalitydrift across models