Who Else Is Profiling LLM Personality?
When we started building Personality Delta, we expected to find a crowded field. We'd just profiled 7 frontier models across Claude, GPT, and Gemini families and found that only two behavioral axes -- risk tolerance and directness -- consistently differentiate them. Surely someone else had published this kind of data.
There are projects that profile personality. There are leaderboards that rank models by task category. There are routers that select models per-prompt. But nobody is combining all three into a single, inspectable, data-driven system that maps behavioral profiles to work-category recommendations. Here's the full landscape and where each project fits.
Behavioral profiling tools
These projects measure how models behave, not just what they can do. They're the closest relatives to Personality Delta's measurement harness.
Model Temperament Index (MTI)
The most directly comparable project. Published April 2026, MTI profiles AI agents across four behavioral axes: Reactivity (environmental sensitivity), Compliance (instruction-behavior alignment), Sociality (relational resource allocation), and Resilience (stress resistance). It operates on a key design principle: measuring what a model tends to do, not what it can do.
Key difference: MTI is academic, focused on small models (1.7B-9B parameters), and does not map temperament to work categories. The four axes are interesting but don't directly answer "should I use this model for coding or creative writing?"
Feedback Forensics
An open-source Python toolkit that measures AI personality traits not covered by conventional benchmarks. It tracks personality changes encouraged by human feedback datasets and personality traits exhibited by models. Includes a Python API, annotation CLI, and Gradio visualization app.
Key difference: Feedback Forensics is oriented toward tracking how RLHF and feedback datasets shift personality over training runs. It's a research instrument for model developers, not a recommendation engine for model users.
Bloom + Petri
Bloom is Anthropic's agentic framework for behavioral evaluations. You define a "seed" behavior (sycophancy, self-preservation, reward hacking), and Bloom auto-generates scenarios to quantify its frequency and severity across models. Released benchmark results for 4 alignment behaviors across 16 models.
Petri maps broad behavioral profiles through multi-turn conversational analysis with simulated users and tools, surfacing misalignment patterns.
Key difference: Both are safety and alignment tools. They measure whether models do dangerous things, not whether they'd be better at coding or creative writing. Valuable research, orthogonal application.
Sonar LLM Coding Personality
SonarSource profiled three coding-specific personality traits across models: verbosity (Claude Sonnet 4 generated 370,816 lines vs OpenCoder-8B's 120,288 for the same problems), complexity, and communication style. Practical findings for choosing a coding model.
Key difference: Narrow scope (coding only, 3 traits), but the approach validates our thesis that behavioral profiling matters for model selection in specific work categories.
Category-specific leaderboards
These rank models by work category using performance metrics and crowd-sourced preferences. They answer "which model is best for X" but don't explain why in behavioral terms.
LMSYS Chatbot Arena / LMArena
The most influential model comparison platform. Uses blind, crowd-sourced A/B voting to generate ELO ratings. Crucially, it has category-specific leaderboards for coding, creative writing, hard prompts, and long-context tasks. Rankings completely reshuffle by category: a model ranked #7 overall might be #1 for coding.
Key difference: Arena tells you which model wins each category but not what behavioral traits make it win. It's preference voting, not behavioral profiling. You know GPT 5.2 ranks high for creative writing, but you don't know it's because of higher emotional register and opinion volunteering.
BenchLM, PricePerToken, EVY, Swfte
A growing ecosystem of comparison sites that aggregate benchmark data, rank models per task type, and include pricing. BenchLM.ai ranks by creative writing scores and instruction-following benchmarks. EVY aggregates from multiple benchmark sources. PricePerToken adds cost-per-quality comparisons.
Key difference: Pure performance metrics. They tell you Claude Opus 4.6 scores highest on EQ Creative (2216) but don't profile the behavioral dimensions that produce that score. Useful, but opaque.
LLM routers
These are the applied, production version of "which model for which task." They route individual prompts to the optimal model in real time.
Martian, Unify, Not Diamond, LLMRouter
Martian uses adaptive routing that learns from application traffic patterns. Unify optimizes quality, cost, and speed per-prompt. Not Diamond curates the model routing space (their awesome-ai-model-routing repo is the best index of the field). LLMRouter from UIUC is the first unified open-source library with 16+ routing algorithms.
Key difference: Routers treat model selection as a black-box optimization. They pick the best model for each prompt, but the decision isn't inspectable. You can't see why it chose Claude over GPT for a particular query, and you can't use the underlying profile data independently. Personality Delta's behavioral profiles could feed into these routers as a signal.
The gap
Here's the landscape mapped against three capabilities:
| Project | Behavioral Profiling | Work-Category Ranking | Inspectable Data |
|---|---|---|---|
| MTI | Yes | No | Yes |
| Feedback Forensics | Yes | No | Yes |
| Bloom / Petri | Safety only | No | Yes |
| Chatbot Arena | No | Yes | Rankings only |
| BenchLM / EVY | No | Yes | Scores only |
| LLM Routers | No | Implicit | No (black box) |
| Personality Delta | Yes | Yes | Yes |
Personality Delta is the only project combining behavioral profiling, work-category recommendations, and fully inspectable data in a single public tool. The profiling tools don't map to work categories. The leaderboards don't profile behavior. The routers do both implicitly but hide the reasoning.
Adjacent research
Several academic threads inform this space without directly competing:
- Big Five personality in LLMs (Safdari et al., 2023; Jiang et al., 2024): Multiple studies apply the Big Five personality model (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism) to LLMs. Interesting but too abstract for practical model selection. Knowing a model scores high on "Openness" doesn't tell you whether to use it for coding.
- AI Psychometrics (Pellert et al., 2024): Applied standardized psychometric inventories to LLMs. Found measurable and distinct psychological profiles across model families. Validates the premise that models have stable behavioral dispositions.
- PersonaLLM (Jiang et al., NAACL 2024): Investigated whether LLMs can express specific personality traits when instructed. Found they can, which is relevant to Personality Delta's correction patch approach: if models can express targeted traits, system prompt patches should work.
- Dynamic LLM Routing (arXiv 2502.16696): Proposed routing that balances performance, cost, and ethics based on user preferences. The ethics dimension is a form of behavioral profiling, validating that routing decisions should include behavioral signals, not just capability scores.
Opportunities
The gap in the landscape suggests several directions:
- Router integration: Personality Delta's behavioral profiles could serve as a signal layer for LLM routers like Martian or LLMRouter. Instead of black-box optimization, route based on inspectable behavioral match to the task category.
- Continuous monitoring: None of the existing tools track behavioral drift over time as models receive silent updates. A monitoring service that alerts when a model's personality profile shifts would be valuable for production deployments.
- Community-contributed profiles: The profiling harness is open source. As more people run comparisons across more models (open-source models, fine-tuned variants, regional models), the dataset becomes more comprehensive and the work-category rankings more reliable.
- Domain-specific prompt sets: The current 30-prompt set is generalist. Domain-specific prompt sets (legal, medical, education, customer support) would enable more targeted model-fit recommendations for specialized applications.
Explore the data
See behavioral profiles and work-category rankings on the interactive dashboard, or read the full methodology. The source code, prompts, and raw data are on GitHub under MIT license.