Benchmarks
Human-like thinking, measured against every benchmark that matters
Capability scorecards running the adopted academic benchmarks (ARC-AGI-2, HLE, GAIA, SimpleBench, GPQA Diamond, MMLU-Pro) plus our own Rationale Integrity, Abstention, and Human-Like Thinking composite. Sortable leaderboard shows the full comparison.
OpenAI
GPT-4.5
Based on published documentation. Full audit in progress (0%).
OpenAI
o3-mini
Based on published documentation. Full audit in progress (0%).
Alibaba
Qwen 3
Based on published documentation. Full audit in progress (0%).
Frequently asked questions
Which benchmarks are run?
The full adopted academic battery — ARC-AGI-2, HLE (Humanity's Last Exam), GAIA, SimpleBench, GPQA Diamond, MMLU-Pro, plus our own composites: Rationale Integrity (does the reasoning trace match the answer), Abstention (does the model refuse when it should), and the Human-Like Thinking score (aggregate across axes most predictive of agentic competence).
How often does the leaderboard refresh?
On every model version bump and weekly otherwise. Each row shows the updatedAt timestamp of its last full benchmark pass.
Why both academic + proprietary benchmarks?
Academic benchmarks have known training-set contamination risks — top models often hit ceiling on widely-cited tests. Our proprietary composites use unseen probes and behavioural traces that resist contamination, giving the leaderboard a longer signal-shelf-life.
How does the Human-Like Thinking score work?
A weighted aggregate across reasoning, planning, calibrated uncertainty, abstention, and rationale-integrity axes. Tuned to correlate with downstream agentic task performance, not just multiple-choice accuracy.
Can I download the raw results?
Yes — every benchmark row exports per-task scores plus rationale traces from the methodology page.