← Back to dashboard

Who Else Is Profiling LLM Personality?

June 2026 · Competitive landscape survey

When we started building Personality Delta, we expected to find a crowded field. We'd just profiled 7 frontier models across Claude, GPT, and Gemini families and found that only two behavioral axes -- risk tolerance and directness -- consistently differentiate them. Surely someone else had published this kind of data.

There are projects that profile personality. There are leaderboards that rank models by task category. There are routers that select models per-prompt. But nobody is combining all three into a single, inspectable, data-driven system that maps behavioral profiles to work-category recommendations. Here's the full landscape and where each project fits.

Behavioral profiling tools

These projects measure how models behave, not just what they can do. They're the closest relatives to Personality Delta's measurement harness.

Academic Research

Model Temperament Index (MTI)

The most directly comparable project. Published April 2026, MTI profiles AI agents across four behavioral axes: Reactivity (environmental sensitivity), Compliance (instruction-behavior alignment), Sociality (relational resource allocation), and Resilience (stress resistance). It operates on a key design principle: measuring what a model tends to do, not what it can do.

Key difference: MTI is academic, focused on small models (1.7B-9B parameters), and does not map temperament to work categories. The four axes are interesting but don't directly answer "should I use this model for coding or creative writing?"

Open Source Toolkit

Feedback Forensics

An open-source Python toolkit that measures AI personality traits not covered by conventional benchmarks. It tracks personality changes encouraged by human feedback datasets and personality traits exhibited by models. Includes a Python API, annotation CLI, and Gradio visualization app.

Key difference: Feedback Forensics is oriented toward tracking how RLHF and feedback datasets shift personality over training runs. It's a research instrument for model developers, not a recommendation engine for model users.

Anthropic Open Source

Bloom + Petri

Bloom is Anthropic's agentic framework for behavioral evaluations. You define a "seed" behavior (sycophancy, self-preservation, reward hacking), and Bloom auto-generates scenarios to quantify its frequency and severity across models. Released benchmark results for 4 alignment behaviors across 16 models.

Petri maps broad behavioral profiles through multi-turn conversational analysis with simulated users and tools, surfacing misalignment patterns.

Key difference: Both are safety and alignment tools. They measure whether models do dangerous things, not whether they'd be better at coding or creative writing. Valuable research, orthogonal application.

Industry Research

Sonar LLM Coding Personality

SonarSource profiled three coding-specific personality traits across models: verbosity (Claude Sonnet 4 generated 370,816 lines vs OpenCoder-8B's 120,288 for the same problems), complexity, and communication style. Practical findings for choosing a coding model.

Key difference: Narrow scope (coding only, 3 traits), but the approach validates our thesis that behavioral profiling matters for model selection in specific work categories.

Category-specific leaderboards

These rank models by work category using performance metrics and crowd-sourced preferences. They answer "which model is best for X" but don't explain why in behavioral terms.

Crowd-Sourced Leaderboard

LMSYS Chatbot Arena / LMArena

The most influential model comparison platform. Uses blind, crowd-sourced A/B voting to generate ELO ratings. Crucially, it has category-specific leaderboards for coding, creative writing, hard prompts, and long-context tasks. Rankings completely reshuffle by category: a model ranked #7 overall might be #1 for coding.

Key difference: Arena tells you which model wins each category but not what behavioral traits make it win. It's preference voting, not behavioral profiling. You know GPT 5.2 ranks high for creative writing, but you don't know it's because of higher emotional register and opinion volunteering.

Aggregator Sites

BenchLM, PricePerToken, EVY, Swfte

A growing ecosystem of comparison sites that aggregate benchmark data, rank models per task type, and include pricing. BenchLM.ai ranks by creative writing scores and instruction-following benchmarks. EVY aggregates from multiple benchmark sources. PricePerToken adds cost-per-quality comparisons.

Key difference: Pure performance metrics. They tell you Claude Opus 4.6 scores highest on EQ Creative (2216) but don't profile the behavioral dimensions that produce that score. Useful, but opaque.

LLM routers

These are the applied, production version of "which model for which task." They route individual prompts to the optimal model in real time.

Production Routing

Martian, Unify, Not Diamond, LLMRouter

Martian uses adaptive routing that learns from application traffic patterns. Unify optimizes quality, cost, and speed per-prompt. Not Diamond curates the model routing space (their awesome-ai-model-routing repo is the best index of the field). LLMRouter from UIUC is the first unified open-source library with 16+ routing algorithms.

Key difference: Routers treat model selection as a black-box optimization. They pick the best model for each prompt, but the decision isn't inspectable. You can't see why it chose Claude over GPT for a particular query, and you can't use the underlying profile data independently. Personality Delta's behavioral profiles could feed into these routers as a signal.

The gap

Here's the landscape mapped against three capabilities:

Project Behavioral Profiling Work-Category Ranking Inspectable Data
MTI Yes No Yes
Feedback Forensics Yes No Yes
Bloom / Petri Safety only No Yes
Chatbot Arena No Yes Rankings only
BenchLM / EVY No Yes Scores only
LLM Routers No Implicit No (black box)
Personality Delta Yes Yes Yes

Personality Delta is the only project combining behavioral profiling, work-category recommendations, and fully inspectable data in a single public tool. The profiling tools don't map to work categories. The leaderboards don't profile behavior. The routers do both implicitly but hide the reasoning.

Adjacent research

Several academic threads inform this space without directly competing:

Opportunities

The gap in the landscape suggests several directions:

  1. Router integration: Personality Delta's behavioral profiles could serve as a signal layer for LLM routers like Martian or LLMRouter. Instead of black-box optimization, route based on inspectable behavioral match to the task category.
  2. Continuous monitoring: None of the existing tools track behavioral drift over time as models receive silent updates. A monitoring service that alerts when a model's personality profile shifts would be valuable for production deployments.
  3. Community-contributed profiles: The profiling harness is open source. As more people run comparisons across more models (open-source models, fine-tuned variants, regional models), the dataset becomes more comprehensive and the work-category rankings more reliable.
  4. Domain-specific prompt sets: The current 30-prompt set is generalist. Domain-specific prompt sets (legal, medical, education, customer support) would enable more targeted model-fit recommendations for specialized applications.

Explore the data

See behavioral profiles and work-category rankings on the interactive dashboard, or read the full methodology. The source code, prompts, and raw data are on GitHub under MIT license.