Terminal-Bench v2
Of 89 Coding tasks audited, 5 carry a major finding (6%) — typically instruction ambiguity, environment conflicts, or evaluation oversights.
We systematically audit LLM evaluation benchmarks across coding, math, science, safety, and multimodal reasoning — surfacing instruction ambiguity, environment conflicts, and evaluation quality flaws that quietly distort the scoreboards.
Audit findings are framed constructively: they identify task-level defects (instruction ambiguity, environment conflicts, and evaluation quality) and are intended to help benchmark authors fix issues, not to penalise individual benchmarks or the models evaluated on them.
Popular benchmarks from frontier model releases and academic work.
Of 89 Coding tasks audited, 5 carry a major finding (6%) — typically instruction ambiguity, environment conflicts, or evaluation oversights.
It is almost impossible to know what you are being asked. The prompt omits or obscures key details that no amount of domain expertise can compensate for, or the tests look for something different from what the description asks.
A discoverable gap. A competent developer can reach the expected solution via git history, a known API, or reasoning through standard approaches — but the bridge requires deliberate work, and reasonable alternatives exist that the tests would reject.
The prompt is clear and sufficient. The tests fairly evaluate what the prompt asks for. A well-designed but hard task is still Clean — failure mode is skill or insight, not interpretation.
168 benchmarks in the audit pipeline — 52 with a complete static audit pass, the rest partial. Filter by domain, sort by severity, or open a benchmark to see per-task findings and rubric scores.
The audited portfolio is drawn from two complementary pipelines — a frontier release-report consensus and a sweep of the NeurIPS 2025 Datasets & Benchmarks Track — then filtered to domains where the audit rubric can operate.
We extract every benchmark named in the headline capability tables of five recent frontier model releases — Anthropic Opus 4.7, OpenAI GPT-5.4, Zhipu GLM-5.1, Moonshot Kimi K2.6, and MiniMax M2.7 — normalise aliases, and retain only benchmarks cited by at least two reports. Manual scoping removes off-domain entries.
We sweep every accepted NeurIPS 2025 Datasets & Benchmarks Track paper that introduces a benchmark with an evaluation protocol, classify it into one of sixteen candidate domains, and keep the nine that fall within scope — no popularity or citation ranking applied.