An automated audit pipeline

Auto Benchmark Audit

We systematically audit LLM evaluation benchmarks across coding, math, science, safety, and multimodal reasoning — surfacing instruction ambiguity, environment conflicts, and evaluation quality flaws that quietly distort the scoreboards.

Audit findings are framed constructively: they identify task-level defects (instruction ambiguity, environment conflicts, and evaluation quality) and are intended to help benchmark authors fix issues, not to penalise individual benchmarks or the models evaluated on them.

Benchmarks
168
across 9 domains
Tasks audited
35,205
task-level audits
Major findings
25.5%
of audited tasks
Minor findings
15.2%
of audited tasks

FeaturedBenchmarks

Popular benchmarks from frontier model releases and academic work.

MajorMinorClean
★ Spotlight
Coding

Terminal-Bench v2

Of 89 Coding tasks audited, 5 carry a major finding (6%) — typically instruction ambiguity, environment conflicts, or evaluation oversights.

6% major21% minor73% clean
Audited
89 / 89
100% coverage
Findings
24
27% of audited tasks

How we scoreseverity

What counts as Major?
Severity 2

It is almost impossible to know what you are being asked. The prompt omits or obscures key details that no amount of domain expertise can compensate for, or the tests look for something different from what the description asks.

e.g. hidden API contracts · contradictory examples · undisclosed test mechanisms · tests that grade a different output than described
What counts as Minor?
Severity 1

A discoverable gap. A competent developer can reach the expected solution via git history, a known API, or reasoning through standard approaches — but the bridge requires deliberate work, and reasonable alternatives exist that the tests would reject.

e.g. under-specified data formats · multiple plausible solutions · expected approach inferable but not stated
What about a Clean task?
Severity 0

The prompt is clear and sufficient. The tests fairly evaluate what the prompt asks for. A well-designed but hard task is still Clean — failure mode is skill or insight, not interpretation.

also clean: a vague prompt whose expected approach is domain-standard — domain knowledge bridges the gap without deliberate effort.

Browse168Audited Benchmarks

168 benchmarks in the audit pipeline — 52 with a complete static audit pass, the rest partial. Filter by domain, sort by severity, or open a benchmark to see per-task findings and rubric scores.

▦ Cards
Benchmark Domain Audited / Tasks Severity mixMajor % Minor %
professional · NeurIPS D&B poster
Professional200 / 353partial
72%13%
agentic / tool use · NeurIPS D&B poster
Agentic / Tool Use200 / 246partial
21%31%
agentic / tool use · NeurIPS D&B spotlight
Agentic / Tool Use200 / 707partial
49%32%
agentic / tool use · NeurIPS D&B spotlight
Agentic / Tool Use200 / 1,499partial
1%0%
multimodal · NeurIPS D&B poster
Multimodal200 / 772partial
65%8%
coding
Coding225 / 225audited
34%25%
math
Math60 / 60audited
3%0%
agentic / tool use · NeurIPS D&B poster
Agentic / Tool Use40 / 40audited
0%0%
coding · NeurIPS D&B poster
Coding154 / 154audited
14%23%
multimodal · NeurIPS D&B poster
Multimodal200 / 7,916partial
32%40%
safety / alignment · NeurIPS D&B oral
Safety / Alignment500 / 1,350partial
52%6%
science · NeurIPS D&B poster
Science200 / 432partial
46%35%
science · NeurIPS D&B poster
Science200 / 2,641partial
20%19%
safety / alignment · NeurIPS D&B poster
Safety / Alignment500 / 4,200partial
45%28%
Showing 14 of 168 · sorted by Benchmark ↑Rows:

How these benchmarksare sourced

The audited portfolio is drawn from two complementary pipelines — a frontier release-report consensus and a sweep of the NeurIPS 2025 Datasets & Benchmarks Track — then filtered to domains where the audit rubric can operate.

Two sources, one audited portfolio

Provenance · v2026.05
Source 01 · Depth

Frontier release-report consensus

We extract every benchmark named in the headline capability tables of five recent frontier model releases — Anthropic Opus 4.7, OpenAI GPT-5.4, Zhipu GLM-5.1, Moonshot Kimi K2.6, and MiniMax M2.7 — normalise aliases, and retain only benchmarks cited by at least two reports. Manual scoping removes off-domain entries.

5
release reports unioned
≥ 2
report intersection threshold
Source 02 · Breadth

NeurIPS 2025 D&B Track sweep

We sweep every accepted NeurIPS 2025 Datasets & Benchmarks Track paper that introduces a benchmark with an evaluation protocol, classify it into one of sixteen candidate domains, and keep the nine that fall within scope — no popularity or citation ranking applied.

9 inScienceMultimodalProfessionalAgentic / Tool UseCodingMedicalMathRetrieval / RAGSafety / Alignment
7 outNLP / text, creative generation, eval methodology, audio / speech, video, embodied 3D, remote sensing.
16 → 9
candidate → in-scope domains
100%
of in-scope accepted papers audited
scope (i)System under test is a general-purpose frontier LLM or LLM-driven agent — excludes specialised stacks (audio, embodied 3D, remote sensing).
scope (ii)Each task carries a verifiable ground truth — test suite, gold answer, or deterministic grader — that the audit rubric can operate on. Excludes subjective-evaluation tasks and meta-benchmarks of evaluation methodology.