Ethics & Trustworthiness
Testing what the labs actually claim their models do
Trustworthiness scorecards running live probes: the 8 DecodingTrust axes, our own Break-Free sandbox-escape harness, a Claim-Validation pass that verifies published safety assertions, and a Pressure-Drift curve showing how alignment degrades under sustained pressure.
OpenAI
GPT-4.5
Based on published documentation. Full audit in progress (0%).
OpenAI
o3-mini
Based on published documentation. Full audit in progress (0%).
Alibaba
Qwen 3
Based on published documentation. Full audit in progress (0%).
Frequently asked questions
What does an LLM ethics audit cover?
Four pillars. (1) DecodingTrust — 8 axes of trustworthiness including toxicity, stereotype bias, adversarial robustness, out-of-distribution behaviour, privacy, machine ethics, and fairness. (2) Sandbox-escape — a live "Break-Free" harness that probes whether the model attempts to circumvent stated constraints. (3) Claim-validation — verifying every published safety assertion from the provider. (4) Pressure-drift — how alignment degrades under sustained adversarial pressure.
Why test claim validation specifically?
Model providers publish safety claims that are frequently overstated or untested. Claim-validation runs the claim against the model and reports whether the behaviour matches the stated specification — a much stronger signal than the marketing.
How does Pressure-Drift work?
A scripted adversary runs sustained conversational pressure (jailbreak attempts, social engineering, persistence). The curve shows how the model's refusal rate decays over turn count. Models that hold steady earn high marks; models that capitulate quickly are flagged.
Are these tests open?
The methodology is public. The exact probe prompts are partially gated — fully open prompts get trained against, eroding the signal. Methodology details and partial probe examples are on the methodology page.
How are ethics scores different from benchmarks?
Capability benchmarks measure what a model can do. Ethics audits measure what a model should refuse or constrain. A model can ace ARC-AGI and still fail the ethics suite — they are orthogonal axes.