Overall performance

Where every model lands at a glance, and how consistent each one is.

Score Distributions

Composite score variation across the 200 conversations, per model. The leaderboard reports a single average, but two models with the same average can behave very differently across users: one consistent, one with a wider spread between its best and worst conversations.

30.040.050.060.070.0Claude Fable 5Claude Opus 4.8Claude Opus 4.6MiMo-v2-ProGPT-5.5Claude Opus 4.7Claude Haiku 4.5Gemini 3.1 ProQwen 2.5 72BMistral LargeGPT-5.4Claude Sonnet 4…Grok 4

Performance Heatmap

Composite Score
Emotion F1
VA Score
Hit Rate
Observer Accuracy
Human Accuracy
Pairwise Accuracy
Kendall Tau
Draft Judge
Draft Alignment
Four-Branch
PANAS
Q1 Goals
Q3 Fit
Turn Average
Conversation Average
Best in column Middle Worst in column Click a row to highlight it

Emotion understanding

How accurately each model reads what participants are feeling.

Emotion Tracking

Emotion F1

Claude Opus 4.8Claude Fable 5GPT-5.5Claude Opus 4.7Claude Sonnet 4…Claude Opus 4.6MiMo-v2-ProGPT-5.4Mistral LargeClaude Haiku 4.5Grok 4Gemini 3.1 ProQwen 2.5 72B00.050.10.15

Valence-Arousal

Gemini 3.1 ProClaude Haiku 4.5Grok 4Claude Fable 5GPT-5.4MiMo-v2-ProGPT-5.5Qwen 2.5 72BClaude Opus 4.7Claude Sonnet 4…Claude Opus 4.6Claude Opus 4.8Mistral Large00.10.2

Emotion VA by Diagnosis

Valence-arousal accuracy broken out by participant diagnosis group. Across nearly every model, accuracy is highest for participants with no reported diagnosis and drops for those with anxiety/depression or ASD/ADHD, evidence that current models read neurotypical affect more accurately than diagnosed users' affect.

Claude Sonnet 4.6
None
0.31
Anx/Dep
0.21
ASD/ADHD
0.08
Claude Opus 4.8
None
0.31
Anx/Dep
0.20
ASD/ADHD
0.10
Claude Opus 4.6
None
0.31
Anx/Dep
0.21
ASD/ADHD
0.08
Mistral Large
None
0.29
Anx/Dep
0.18
ASD/ADHD
0.08
Claude Opus 4.7
None
0.31
Anx/Dep
0.22
ASD/ADHD
0.10
Claude Fable 5
None
0.32
Anx/Dep
0.24
ASD/ADHD
0.09
GPT-5.5
None
0.31
Anx/Dep
0.23
ASD/ADHD
0.09
Qwen 2.5 72B
None
0.31
Anx/Dep
0.23
ASD/ADHD
0.09
Claude Haiku 4.5
None
0.33
Anx/Dep
0.24
ASD/ADHD
0.12
MiMo-v2-Pro
None
0.31
Anx/Dep
0.23
ASD/ADHD
0.13
Gemini 3.1 Pro
None
0.32
Anx/Dep
0.26
ASD/ADHD
0.12
GPT-5.4
None
0.31
Anx/Dep
0.24
ASD/ADHD
0.11
Grok 4
None
0.31
Anx/Dep
0.24
ASD/ADHD
0.18
None Anx/Dep ASD/ADHD

Response & perspective

Predicting participants’ preferences, taking their perspective, and writing fitting replies.

Holistic Thinkers vs. Step-by-Step Annotators

Turn-level (lighter) Conversation-wide (solid) Hue = % change

Four-Branch EQ & Preference Prediction

Four-Branch EQ

Qwen 2.5 72BMistral LargeGPT-5.5Gemini 3.1 ProClaude Haiku 4.5Grok 4Claude Fable 5GPT-5.4Claude Opus 4.8MiMo-v2-ProClaude Opus 4.6Claude Opus 4.7Claude Sonnet 4…0%20%40%60%80%

Pairwise Preference

Claude Opus 4.7Claude Opus 4.8Claude Opus 4.6Claude Fable 5MiMo-v2-ProClaude Haiku 4.5GPT-5.5Mistral LargeClaude Sonnet 4…Gemini 3.1 ProGPT-5.4Grok 4Qwen 2.5 72B0%20%40%60%

Conversation Quality

Q1 Goals

GPT-5.4Qwen 2.5 72BClaude Haiku 4.5Mistral LargeClaude Sonnet 4…Claude Opus 4.6MiMo-v2-ProGrok 4Gemini 3.1 ProClaude Fable 5Claude Opus 4.7GPT-5.5Claude Opus 4.80%20%40%60%

Q3 Response Fit

Claude Fable 5Claude Opus 4.8GPT-5.5Claude Opus 4.6Grok 4GPT-5.4Claude Opus 4.7Gemini 3.1 ProMiMo-v2-ProQwen 2.5 72BClaude Haiku 4.5Claude Sonnet 4…Mistral Large0%20%40%

Perspective Gap

Claude Opus 4.6
Claude Opus 4.8
Gemini 3.1 Pro
GPT-5.4
Qwen 2.5 72B
GPT-5.5
Claude Fable 5
Claude Sonnet 4.6
MiMo-v2-Pro
Mistral Large
Claude Haiku 4.5
Grok 4
Claude Opus 4.7
Better at human view Worse at human view

Draft Response Quality

Claude Fable 5Claude Opus 4.6Claude Opus 4.7Claude Opus 4.8GPT-5.5Claude Sonnet 4…Gemini 3.1 ProGPT-5.4Claude Haiku 4.5Mistral LargeGrok 4MiMo-v2-ProQwen 2.5 72B0%20%40%60%80%

Subgroup analysis

How scores break down by the topic of the conversation and the participant.

Conversation Topics

PoliticsMoneyFamilyWork / SchoolHobbiesFriendsEntertainment M…ReligionPhysical HealthRomantic Relati…02040

Participant Diagnosis

None 52.4
Anxiety/Depression 54.3
ASD/ADHD 48.7
Area = conversations · number = average composite

Behavior in detail

Cross-metric correlations, item-level prediction, and within-conversation position.

Metric Relationships

Composite
Emotion F1
VA Score
Observer
Human
Pairwise
Draft
Gap
Composite
0.47
0.51
0.35
0.34
0.44
0.24
-0.13
Emotion F1
0.47
0.64
0.03
0.01
0.01
0.09
-0.02
VA Score
0.51
0.64
0.15
0.18
-0.07
0.12
-0.06
Observer
0.35
0.03
0.15
0.62
-0.05
0.21
-0.14
Human
0.34
0.01
0.18
0.62
-0.10
0.19
0.06
Pairwise
0.44
0.01
-0.07
-0.05
-0.10
0.14
0.01
Draft
0.24
0.09
0.12
0.21
0.19
0.14
-0.09
Gap
-0.13
-0.02
-0.06
-0.14
0.06
0.01
-0.09

PANAS Item-Level Prediction

Positive Affect
Negative Affect
Easiest Hardest

Conversation Position

Mid vs Early

Emotion F1
Observer Accuracy
Pairwise Preference
Draft Judge

Late vs Early

Emotion F1
Observer Accuracy
Pairwise Preference
Draft Judge
Improves vs early Drops vs early

Conditions & dynamics

Sensitivity to evaluation setup and how performance changes across a conversation.

Evaluation Mode

Omniscient Delta

Claude Opus 4.7
Gemini 3.1 Pro
Claude Opus 4.6
Claude Haiku 4.5
Qwen 2.5 72B
Grok 4
Claude Opus 4.8
Mistral Large
MiMo-v2-Pro
GPT-5.5
Claude Sonnet 4.6
GPT-5.4
Claude Fable 5

Verbose Delta

Claude Opus 4.7
GPT-5.5
Grok 4
Claude Opus 4.8
Claude Sonnet 4.6
Qwen 2.5 72B
Claude Opus 4.6
MiMo-v2-Pro
GPT-5.4
Claude Haiku 4.5
Gemini 3.1 Pro
Mistral Large
Claude Fable 5
Improves vs default Drops vs default

Mood Shift & Emotional Trajectory

-0.20+0.8NeutralOverallMoneyPhysical HealthWork / SchoolFriendsHobbiesRomantic RelationshipsPoliticsReligionFamilyEntertainment Media
First half Second half

Temporal Performance

Observer Accuracy

Early
Mid
Late

Pairwise Preference

Early
Mid
Late

Draft Judge

Early
Mid
Late
Early Mid Late increase

Statistical significance

Formal tests of which differences between models are real.

Statistical Significance

Metrics with model effects 6/7
Strongest separation Draft Judge Score
Least conclusive Emotion F1
Pairwise composite gaps 48 of 78 model pairs significant

Omnibus effects by metric

MetricEffect η²p-valueHResult
Draft Judge Score
0.295 Large
1.02e-156768.3Significant
Pairwise Accuracy
0.220 Large
2.32e-115575.0Significant
Binary HP Accuracy
0.056 Small
8.35e-27154.1Significant
Composite Score
0.052 Small
4.93e-25145.4Significant
Binary OM Accuracy
0.017 Small
2.45e-754.3Significant
Emotion VA Score
0.007 Trace
0.002530.3Significant
Emotion F1
0.002 Trace
0.123217.8Not significant

Selected pairwise composite differences

Model AModel BΔp adj.Effect |r|Result
Claude Fable 5 Claude Sonnet 4.6+4.77<0.00010.940 LSignificant
Claude Fable 5 GPT-5.4+4.82<0.00010.895 LSignificant
Claude Fable 5 Grok 4+4.96<0.00010.881 LSignificant
Claude Fable 5 Mistral Large+3.21<0.00010.796 LSignificant
Claude Fable 5 Qwen 2.5 72B+2.80<0.00010.770 LSignificant
Claude Fable 5 Claude Haiku 4.5+2.24<0.00010.732 LSignificant
Claude Fable 5 Gemini 3.1 Pro+2.13<0.00010.725 LSignificant
Claude Fable 5 GPT-5.5+1.350.00160.669 LSignificant
Claude Fable 5 MiMo-v2-Pro+1.410.00260.664 LSignificant
Claude Fable 5 Claude Opus 4.7+1.400.00260.663 LSignificant
Claude Fable 5 Claude Opus 4.6+0.720.39100.594 LNot significant
Claude Fable 5 Claude Opus 4.8+0.511.00000.564 LNot significant
Explore conversations