Benchmarks

Human-like thinking, measured against every benchmark that matters

Capability scorecards running the adopted academic benchmarks (ARC-AGI-2, HLE, GAIA, SimpleBench, GPQA Diamond, MMLU-Pro) plus our own Rationale Integrity, Abstention, and Human-Like Thinking composite. Sortable leaderboard shows the full comparison.

Frequently asked questions

Which benchmarks are run?

The full adopted academic battery — ARC-AGI-2, HLE (Humanity's Last Exam), GAIA, SimpleBench, GPQA Diamond, MMLU-Pro, plus our own composites: Rationale Integrity (does the reasoning trace match the answer), Abstention (does the model refuse when it should), and the Human-Like Thinking score (aggregate across axes most predictive of agentic competence).

How often does the leaderboard refresh?

On every model version bump and weekly otherwise. Each row shows the updatedAt timestamp of its last full benchmark pass.

Why both academic + proprietary benchmarks?

Academic benchmarks have known training-set contamination risks — top models often hit ceiling on widely-cited tests. Our proprietary composites use unseen probes and behavioural traces that resist contamination, giving the leaderboard a longer signal-shelf-life.

How does the Human-Like Thinking score work?

A weighted aggregate across reasoning, planning, calibrated uncertainty, abstention, and rationale-integrity axes. Tuned to correlate with downstream agentic task performance, not just multiple-choice accuracy.

Can I download the raw results?

Yes — every benchmark row exports per-task scores plus rationale traces from the methodology page.