An automated audit pipeline

Auto Benchmark Audit

We systematically audit LLM evaluation benchmarks across coding, math, science, safety, and multimodal reasoning — surfacing instruction ambiguity, environment conflicts, and evaluation quality flaws that quietly distort the scoreboards.

Audit findings are framed constructively: they identify task-level defects (instruction ambiguity, environment conflicts, and evaluation quality) and are intended to help benchmark authors fix issues, not to penalise individual benchmarks or the models evaluated on them.

Browse audited benchmarks →

Benchmarks

168

across 9 domains

Tasks audited

35,205

task-level audits

Major findings

25.5%

of audited tasks

Minor findings

15.2%

of audited tasks

FeaturedBenchmarks

Popular benchmarks from frontier model releases and academic work.

MajorMinorClean

★ Spotlight

Coding

Terminal-Bench v2

Of 89 Coding tasks audited, 5 carry a major finding (6%) — typically instruction ambiguity, environment conflicts, or evaluation oversights.

6% major21% minor73% clean

Audited

89 / 89

100% coverage

Findings

27% of audited tasks

How we scoreseverity

What counts as Major?

Severity 2

It is almost impossible to know what you are being asked. The prompt omits or obscures key details that no amount of domain expertise can compensate for, or the tests look for something different from what the description asks.

e.g. hidden API contracts · contradictory examples · undisclosed test mechanisms · tests that grade a different output than described

What counts as Minor?

Severity 1

A discoverable gap. A competent developer can reach the expected solution via git history, a known API, or reasoning through standard approaches — but the bridge requires deliberate work, and reasonable alternatives exist that the tests would reject.

e.g. under-specified data formats · multiple plausible solutions · expected approach inferable but not stated

What about a Clean task?

Severity 0

The prompt is clear and sufficient. The tests fairly evaluate what the prompt asks for. A well-designed but hard task is still Clean — failure mode is skill or insight, not interpretation.

also clean: a vague prompt whose expected approach is domain-standard — domain knowledge bridges the gap without deliberate effort.

Browse168Audited Benchmarks

168 benchmarks in the audit pipeline — 52 with a complete static audit pass, the rest partial. Filter by domain, sort by severity, or open a benchmark to see per-task findings and rubric scores.

▦ Cards

Benchmark ↑	Domain ↕	Audited / Tasks ↕	Major % ↕	Minor % ↕
A Multimodal Benchmark for Framing of Oil & Gas Advertising and Potential Greenwashing Detection professional · NeurIPS D&B poster	Professional	200 / 353partial	72%	13%
AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents agentic / tool use · NeurIPS D&B poster	Agentic / Tool Use	200 / 246partial	21%	31%
AGENTIF: Benchmarking Large Language Models Instruction Following Ability in Agentic Scenarios agentic / tool use · NeurIPS D&B spotlight	Agentic / Tool Use	200 / 707partial	49%	32%
AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems agentic / tool use · NeurIPS D&B spotlight	Agentic / Tool Use	200 / 1,499partial	1%	0%
AgMMU: A Comprehensive Agricultural Multimodal Understanding Benchmark multimodal · NeurIPS D&B poster	Multimodal	200 / 772partial	65%	8%
Aider Polyglot coding	Coding	225 / 225audited	34%	25%
AIME 2024 + 2025 math	Math	60 / 60audited	3%	0%
ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering agentic / tool use · NeurIPS D&B poster	Agentic / Tool Use	40 / 40audited	0%	0%
AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs? coding · NeurIPS D&B poster	Coding	154 / 154audited	14%	23%
AnomalyCoT: A Multi-Scenario Chain-of-Thought Dataset for Multimodal Large Language Models multimodal · NeurIPS D&B poster	Multimodal	200 / 7,916partial	32%	40%
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond) safety / alignment · NeurIPS D&B oral	Safety / Alignment	500 / 1,350partial	52%	6%
AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy science · NeurIPS D&B poster	Science	200 / 432partial	46%	35%
AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science science · NeurIPS D&B poster	Science	200 / 2,641partial	20%	19%
BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models safety / alignment · NeurIPS D&B poster	Safety / Alignment	500 / 4,200partial	45%	28%

Showing 14 of 168 · sorted by Benchmark ↑Rows:

How these benchmarksare sourced

The audited portfolio is drawn from two complementary pipelines — a frontier release-report consensus and a sweep of the NeurIPS 2025 Datasets & Benchmarks Track — then filtered to domains where the audit rubric can operate.

Two sources, one audited portfolio

Provenance · v2026.05

Source 01 · Depth

Frontier release-report consensus

We extract every benchmark named in the headline capability tables of five recent frontier model releases — Anthropic Opus 4.7, OpenAI GPT-5.4, Zhipu GLM-5.1, Moonshot Kimi K2.6, and MiniMax M2.7 — normalise aliases, and retain only benchmarks cited by at least two reports. Manual scoping removes off-domain entries.

release reports unioned

≥ 2

report intersection threshold

Source 02 · Breadth

NeurIPS 2025 D&B Track sweep

We sweep every accepted NeurIPS 2025 Datasets & Benchmarks Track paper that introduces a benchmark with an evaluation protocol, classify it into one of sixteen candidate domains, and keep the nine that fall within scope — no popularity or citation ranking applied.

9 inScienceMultimodalProfessionalAgentic / Tool UseCodingMedicalMathRetrieval / RAGSafety / Alignment

7 outNLP / text, creative generation, eval methodology, audio / speech, video, embodied 3D, remote sensing.

16 → 9

candidate → in-scope domains

100%

of in-scope accepted papers audited

scope (i)System under test is a general-purpose frontier LLM or LLM-driven agent — excludes specialised stacks (audio, embodied 3D, remote sensing).

scope (ii)Each task carries a verifiable ground truth — test suite, gold answer, or deterministic grader — that the audit rubric can operate on. Excludes subjective-evaluation tasks and meta-benchmarks of evaluation methodology.