Model Analysis
Detailed performance breakdowns across 13 frontier models and 200 conversations. Each chart isolates a different facet of emotional intelligence: emotion tracking, perspective-taking, conversation-wide reasoning, and downstream response quality.
Overall performance
Where every model lands at a glance, and how consistent each one is.
Score Distributions
Composite score variation across the 200 conversations, per model. The leaderboard reports a single average, but two models with the same average can behave very differently across users: one consistent, one with a wider spread between its best and worst conversations.
Performance Heatmap
Emotion understanding
How accurately each model reads what participants are feeling.
Emotion Tracking
Emotion F1
Valence-Arousal
Emotion VA by Diagnosis
Valence-arousal accuracy broken out by participant diagnosis group. Across nearly every model, accuracy is highest for participants with no reported diagnosis and drops for those with anxiety/depression or ASD/ADHD, evidence that current models read neurotypical affect more accurately than diagnosed users' affect.
Response & perspective
Predicting participants’ preferences, taking their perspective, and writing fitting replies.
Holistic Thinkers vs. Step-by-Step Annotators
Four-Branch EQ & Preference Prediction
Four-Branch EQ
Pairwise Preference
Conversation Quality
Q1 Goals
Q3 Response Fit
Perspective Gap
Draft Response Quality
Subgroup analysis
How scores break down by the topic of the conversation and the participant.
Conversation Topics
Participant Diagnosis
Behavior in detail
Cross-metric correlations, item-level prediction, and within-conversation position.
Metric Relationships
PANAS Item-Level Prediction
Conversation Position
Mid vs Early
Late vs Early
Conditions & dynamics
Sensitivity to evaluation setup and how performance changes across a conversation.
Evaluation Mode
Omniscient Delta
Verbose Delta
Mood Shift & Emotional Trajectory
Temporal Performance
Observer Accuracy
Pairwise Preference
Draft Judge
Statistical significance
Formal tests of which differences between models are real.
Statistical Significance
Omnibus effects by metric
| Metric | Effect η² | p-value | H | Result | |
|---|---|---|---|---|---|
| Draft Judge Score | 0.295 Large | 1.02e-156 | 768.3 | Significant | |
| Pairwise Accuracy | 0.220 Large | 2.32e-115 | 575.0 | Significant | |
| Binary HP Accuracy | 0.056 Small | 8.35e-27 | 154.1 | Significant | |
| Composite Score | 0.052 Small | 4.93e-25 | 145.4 | Significant | |
| Binary OM Accuracy | 0.017 Small | 2.45e-7 | 54.3 | Significant | |
| Emotion VA Score | 0.007 Trace | 0.0025 | 30.3 | Significant | |
| Emotion F1 | 0.002 Trace | 0.1232 | 17.8 | Not significant | |
Selected pairwise composite differences
| Model A | Model B | Δ | p adj. | Effect |r| | Result |
|---|---|---|---|---|---|
| Claude Fable 5 | Claude Sonnet 4.6 | +4.77 | <0.0001 | 0.940 L | Significant |
| Claude Fable 5 | GPT-5.4 | +4.82 | <0.0001 | 0.895 L | Significant |
| Claude Fable 5 | Grok 4 | +4.96 | <0.0001 | 0.881 L | Significant |
| Claude Fable 5 | Mistral Large | +3.21 | <0.0001 | 0.796 L | Significant |
| Claude Fable 5 | Qwen 2.5 72B | +2.80 | <0.0001 | 0.770 L | Significant |
| Claude Fable 5 | Claude Haiku 4.5 | +2.24 | <0.0001 | 0.732 L | Significant |
| Claude Fable 5 | Gemini 3.1 Pro | +2.13 | <0.0001 | 0.725 L | Significant |
| Claude Fable 5 | GPT-5.5 | +1.35 | 0.0016 | 0.669 L | Significant |
| Claude Fable 5 | MiMo-v2-Pro | +1.41 | 0.0026 | 0.664 L | Significant |
| Claude Fable 5 | Claude Opus 4.7 | +1.40 | 0.0026 | 0.663 L | Significant |
| Claude Fable 5 | Claude Opus 4.6 | +0.72 | 0.3910 | 0.594 L | Not significant |
| Claude Fable 5 | Claude Opus 4.8 | +0.51 | 1.0000 | 0.564 L | Not significant |