Model Analysis
Performance Heatmap
Mode
Columns
Scroll sideways to compare more metrics.
Comp
F1
VA
Hit
OM
HP
PW
Tau
Jdg
DBA
4B
PAN
Q1
Q3
Trn
Cnv
Best in column Middle Worst in column Click a row to highlight it
Emotion Tracking
Emotion F1
Valence-Arousal
Holistic Thinkers vs. Step-by-Step Annotators
Turn-level (lighter) Conversation-wide (solid) Hue = % change
Four-Branch EQ & Preference Prediction
Four-Branch EQ
Pairwise Preference
Conversation Quality
Q1 Goals
Q3 Response Fit
Perspective Gap
Claude Opus 4.6
Gemini 3.1 Pro
GPT-5.4
Qwen 2.5 72B
GPT-5.5
Claude Sonnet 4.6
MiMo-v2-Pro
Mistral Large
Claude Haiku 4.5
Grok 4
Claude Opus 4.7
Better at human view Worse at human view
Draft Response Quality
Conversation Topics
Participant Diagnosis
None 51.9
Anxiety/Depression 54.0
ASD/ADHD 48.4
Area = conversations · number = average composite
Metric Relationships
Composite
Emotion F1
VA Score
Observer
Human
Pairwise
Draft
Gap
Composite
—
0.47
0.51
0.36
0.34
0.43
0.21
-0.13
Emotion F1
0.47
—
0.63
0.04
0.01
0.02
0.09
0.00
VA Score
0.51
0.63
—
0.15
0.17
-0.06
0.11
-0.06
Observer
0.36
0.04
0.15
—
0.60
-0.04
0.20
-0.20
Human
0.34
0.01
0.17
0.60
—
-0.12
0.17
0.03
Pairwise
0.43
0.02
-0.06
-0.04
-0.12
—
0.10
0.03
Draft
0.21
0.09
0.11
0.20
0.17
0.10
—
-0.08
Gap
-0.13
0.00
-0.06
-0.20
0.03
0.03
-0.08
—
PANAS Item-Level Prediction
Positive Affect
Interested
Excited
Strong
Enthusiastic
Proud
Alert
Inspired
Determined
Attentive
Active
Negative Affect
Distressed
Upset
Guilty
Scared
Hostile
Irritable
Ashamed
Nervous
Jittery
Afraid
Easiest Hardest
Conversation Position
Mid vs Early
Emotion F1
Observer Accuracy
Pairwise Preference
Draft Judge
Late vs Early
Emotion F1
Observer Accuracy
Pairwise Preference
Draft Judge
Improves vs early Drops vs early
Evaluation Mode
Omniscient Delta
Claude Opus 4.7
Gemini 3.1 Pro
Claude Opus 4.6
Claude Haiku 4.5
Qwen 2.5 72B
Grok 4
Mistral Large
MiMo-v2-Pro
GPT-5.5
Claude Sonnet 4.6
GPT-5.4
Verbose Delta
Claude Opus 4.7
GPT-5.5
Grok 4
Claude Sonnet 4.6
Qwen 2.5 72B
Claude Opus 4.6
MiMo-v2-Pro
GPT-5.4
Claude Haiku 4.5
Gemini 3.1 Pro
Mistral Large
Improves vs default Drops vs default
Mood Shift & Emotional Trajectory
First half Second half
Temporal Performance
Observer Accuracy
Pairwise Preference
Draft Judge
Early Mid Late increase
Statistical Significance
Metrics with model effects 6/7
Strongest separation Draft Judge Score
Least conclusive Emotion F1
Pairwise composite gaps 25 of 36 model pairs significant
Omnibus effects by metric
| Metric | Effect η² | p-value | H | Result | |
|---|---|---|---|---|---|
| Draft Judge Score | 0.299 Large | 3.88e-112 | 543.1 | Significant | |
| Pairwise Accuracy | 0.170 Large | 9.77e-63 | 312.3 | Significant | |
| Binary HP Accuracy | 0.054 Small | 5.93e-19 | 104.2 | Significant | |
| Composite Score | 0.040 Small | 5.18e-14 | 79.9 | Significant | |
| Binary OM Accuracy | 0.021 Small | 2.82e-7 | 45.6 | Significant | |
| Emotion VA Score | 0.011 Small | 0.0007 | 27.1 | Significant | |
| Emotion F1 | 0.004 Trace | 0.0707 | 14.4 | Trend | |
Selected pairwise composite differences
| Model A | Model B | Δ | p adj. | Effect |r| | Result |
|---|---|---|---|---|---|
| Claude Opus 4.6 | Claude Sonnet 4.6 | +4.01 | <0.0001 | 0.899 L | Significant |
| MiMo-v2-Pro | Claude Sonnet 4.6 | +3.32 | <0.0001 | 0.831 L | Significant |
| Claude Opus 4.6 | Grok 4 | +4.16 | <0.0001 | 0.826 L | Significant |
| Claude Opus 4.6 | GPT-5.4 | +4.06 | <0.0001 | 0.822 L | Significant |
| MiMo-v2-Pro | Grok 4 | +3.47 | <0.0001 | 0.797 L | Significant |
| Claude Haiku 4.5 | Claude Sonnet 4.6 | +2.54 | <0.0001 | 0.776 L | Significant |
| MiMo-v2-Pro | GPT-5.4 | +3.37 | <0.0001 | 0.774 L | Significant |
| Gemini 3.1 Pro | GPT-5.4 | +2.69 | <0.0001 | 0.748 L | Significant |
| Claude Opus 4.6 | Mistral Large | +2.64 | <0.0001 | 0.740 L | Significant |
| Claude Haiku 4.5 | GPT-5.4 | +2.59 | <0.0001 | 0.734 L | Significant |
| Claude Opus 4.6 | Qwen 2.5 72B | +2.11 | 0.0001 | 0.693 L | Significant |
| Claude Opus 4.6 | Claude Haiku 4.5 | +1.47 | 0.0022 | 0.657 L | Significant |
| Claude Opus 4.6 | Gemini 3.1 Pro | +1.37 | 0.0036 | 0.635 L | Significant |
| Claude Opus 4.6 | MiMo-v2-Pro | +0.69 | 0.1952 | 0.594 L | Not significant |