Model Analysis · AttuneBench

Performance Heatmap

Mode

Columns

Scroll sideways to compare more metrics.

Comp

F1

VA

Hit

OM

HP

PW

Tau

Jdg

DBA

4B

PAN

Q1

Q3

Trn

Cnv

Best in column Middle Worst in column Click a row to highlight it

Emotion Tracking

Emotion F1

Valence-Arousal

Holistic Thinkers vs. Step-by-Step Annotators

Qwen 2.5 72B

Gemini 3.1 Pro

Grok 4

GPT-5.5

GPT-5.4

Mistral Large

Claude Haiku 4.5

Claude Opus 4.6

MiMo-v2-Pro

Claude Sonnet 4.6

Claude Opus 4.7

Turn-level (lighter) Conversation-wide (solid) Hue = % change

Four-Branch EQ & Preference Prediction

Four-Branch EQ

Pairwise Preference

Conversation Quality

Q1 Goals

Q3 Response Fit

Perspective Gap

Claude Opus 4.6

Gemini 3.1 Pro

GPT-5.4

Qwen 2.5 72B

GPT-5.5

Claude Sonnet 4.6

MiMo-v2-Pro

Mistral Large

Claude Haiku 4.5

Grok 4

Claude Opus 4.7

Better at human view Worse at human view

Draft Response Quality

Conversation Topics

Participant Diagnosis

None 51.9

Anxiety/Depression 54.0

ASD/ADHD 48.4

Area = conversations · number = average composite

Metric Relationships

Composite

Emotion F1

VA Score

Observer

Human

Pairwise

Draft

Gap

Composite

—

0.47

0.51

0.36

0.34

0.43

0.21

-0.13

Emotion F1

0.47

—

0.63

0.04

0.01

0.02

0.09

0.00

VA Score

0.51

0.63

—

0.15

0.17

-0.06

0.11

-0.06

Observer

0.36

0.04

0.15

—

0.60

-0.04

0.20

-0.20

Human

0.34

0.01

0.17

0.60

—

-0.12

0.17

0.03

Pairwise

0.43

0.02

-0.06

-0.04

-0.12

—

0.10

0.03

Draft

0.21

0.09

0.11

0.20

0.17

0.10

—

-0.08

Gap

-0.13

0.00

-0.06

-0.20

0.03

-0.08

—

PANAS Item-Level Prediction

Positive Affect

Interested

Excited

Strong

Enthusiastic

Proud

Alert

Inspired

Determined

Attentive

Active

Negative Affect

Distressed

Upset

Guilty

Scared

Hostile

Irritable

Ashamed

Nervous

Jittery

Afraid

Easiest Hardest

Conversation Position

Mid vs Early

Emotion F1

Observer Accuracy

Pairwise Preference

Draft Judge

Late vs Early

Emotion F1

Observer Accuracy

Pairwise Preference

Draft Judge

Improves vs early Drops vs early

Evaluation Mode

Omniscient Delta

Claude Opus 4.7

Gemini 3.1 Pro

Claude Opus 4.6

Claude Haiku 4.5

Qwen 2.5 72B

Grok 4

Mistral Large

MiMo-v2-Pro

GPT-5.5

Claude Sonnet 4.6

GPT-5.4

Verbose Delta

Claude Opus 4.7

GPT-5.5

Grok 4

Claude Sonnet 4.6

Qwen 2.5 72B

Claude Opus 4.6

MiMo-v2-Pro

GPT-5.4

Claude Haiku 4.5

Gemini 3.1 Pro

Mistral Large

Improves vs default Drops vs default

Mood Shift & Emotional Trajectory

First half Second half

Temporal Performance

Observer Accuracy

Early

Mid

Late

Pairwise Preference

Early

Mid

Late

Draft Judge

Early

Mid

Late

Early Mid Late increase

Statistical Significance

Metrics with model effects 6/7

Strongest separation Draft Judge Score

Least conclusive Emotion F1

Pairwise composite gaps 25 of 36 model pairs significant

Omnibus effects by metric

Metric	Effect η²	p-value	H	Result
Draft Judge Score	0.299 Large	3.88e-112	543.1	Significant
Pairwise Accuracy	0.170 Large	9.77e-63	312.3	Significant
Binary HP Accuracy	0.054 Small	5.93e-19	104.2	Significant
Composite Score	0.040 Small	5.18e-14	79.9	Significant
Binary OM Accuracy	0.021 Small	2.82e-7	45.6	Significant
Emotion VA Score	0.011 Small	0.0007	27.1	Significant
Emotion F1	0.004 Trace	0.0707	14.4	Trend

Selected pairwise composite differences

Model A	Model B	Δ	p adj.	Effect \|r\|	Result
Claude Opus 4.6	Claude Sonnet 4.6	+4.01	<0.0001	0.899 L	Significant
MiMo-v2-Pro	Claude Sonnet 4.6	+3.32	<0.0001	0.831 L	Significant
Claude Opus 4.6	Grok 4	+4.16	<0.0001	0.826 L	Significant
Claude Opus 4.6	GPT-5.4	+4.06	<0.0001	0.822 L	Significant
MiMo-v2-Pro	Grok 4	+3.47	<0.0001	0.797 L	Significant
Claude Haiku 4.5	Claude Sonnet 4.6	+2.54	<0.0001	0.776 L	Significant
MiMo-v2-Pro	GPT-5.4	+3.37	<0.0001	0.774 L	Significant
Gemini 3.1 Pro	GPT-5.4	+2.69	<0.0001	0.748 L	Significant
Claude Opus 4.6	Mistral Large	+2.64	<0.0001	0.740 L	Significant
Claude Haiku 4.5	GPT-5.4	+2.59	<0.0001	0.734 L	Significant
Claude Opus 4.6	Qwen 2.5 72B	+2.11	0.0001	0.693 L	Significant
Claude Opus 4.6	Claude Haiku 4.5	+1.47	0.0022	0.657 L	Significant
Claude Opus 4.6	Gemini 3.1 Pro	+1.37	0.0036	0.635 L	Significant
Claude Opus 4.6	MiMo-v2-Pro	+0.69	0.1952	0.594 L	Not significant