Browse benchmarks
168 of 168 benchmarks
Science
AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy
Partially auditedScience · NeurIPS D&B poster
Tasks
432
Audited
46%
200 / 432
Major
46%
91
Minor
35%
69
AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science
Partially auditedScience · NeurIPS D&B poster
Tasks
2,641
Audited
8%
200 / 2,641
Major
20%
40
Minor
19%
37
Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations
Partially auditedScience · NeurIPS D&B poster
Tasks
1,485
Audited
34%
500 / 1,485
Major
32%
159
Minor
13%
67
CellVerse: Do Large Language Models Really Understand Cell Biology?
Partially auditedScience · NeurIPS D&B poster
Tasks
1,822
Audited
11%
200 / 1,822
Major
37%
74
Minor
27%
53
ChemX: A Collection of Chemistry Datasets for Benchmarking Automated Information Extraction
AuditedScience · NeurIPS D&B poster
Tasks
10
Audited
100%
10 / 10
Major
60%
6
Minor
30%
3
FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models
Partially auditedScience · NeurIPS D&B poster
Tasks
7,146
Audited
3%
200 / 7,146
Major
9%
18
Minor
5%
10
GPQA Diamond
AuditedScience
Tasks
198
Audited
100%
198 / 198
Major
3%
5
Minor
5%
9
HLE (Humanity's Last Exam)
Partially auditedScience
Tasks
2,500
Audited
20%
500 / 2,500
Major
28%
140
Minor
14%
69
Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab
Partially auditedScience · NeurIPS D&B poster
Tasks
350
Audited
57%
200 / 350
Major
5%
9
Minor
24%
48
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models
Partially auditedScience · NeurIPS D&B poster
Tasks
500
Audited
40%
200 / 500
Major
19%
38
Minor
18%
36
PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors
AuditedScience · NeurIPS D&B poster
Tasks
97
Audited
100%
97 / 97
Major
8%
8
Minor
11%
11
QCircuitBench: A Large-Scale Dataset for Benchmarking Quantum Algorithm Design
AuditedScience · NeurIPS D&B poster
Tasks
28
Audited
100%
28 / 28
Major
25%
7
Minor
11%
3
Scaling Physical Reasoning with the PHYSICS Dataset
Partially auditedScience · NeurIPS D&B poster
Tasks
2,000
Audited
10%
200 / 2,000
Major
33%
65
Minor
14%
27
Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning
Partially auditedScience · NeurIPS D&B poster
Tasks
1,660
Audited
12%
200 / 1,660
Major
36%
71
Minor
32%
64
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
Partially auditedScience · NeurIPS D&B poster
Tasks
26,529
Audited
2%
500 / 26,529
Major
29%
144
Minor
15%
75
Multimodal
AgMMU: A Comprehensive Agricultural Multimodal Understanding Benchmark
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
772
Audited
26%
200 / 772
Major
65%
129
Minor
8%
15
AnomalyCoT: A Multi-Scenario Chain-of-Thought Dataset for Multimodal Large Language Models
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
7,916
Audited
3%
200 / 7,916
Major
32%
63
Minor
40%
80
BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
899
Audited
22%
200 / 899
Major
36%
71
Minor
18%
35
BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
40,916
Audited
0%
200 / 40,916
Major
25%
49
Minor
16%
31
Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
15,708
Audited
1%
200 / 15,708
Major
40%
79
Minor
11%
22
CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
11,631
Audited
2%
200 / 11,631
Major
1%
1
Minor
6%
12
ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
1,162
Audited
17%
200 / 1,162
Major
2%
3
Minor
7%
14
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
10,507
Audited
2%
200 / 10,507
Major
14%
29
Minor
9%
17
ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
5,885
Audited
3%
200 / 5,885
Major
5%
9
Minor
8%
15
CoralVQA: A Large-Scale Visual Question Answering Dataset for Coral Reef Image Understanding
Partially auditedMultimodal · NeurIPS D&B oral
Tasks
50,927
Audited
0%
200 / 50,927
Major
12%
23
Minor
14%
28
DisasterM3
Partially auditedMultimodal
Tasks
30,042
Audited
2%
500 / 30,042
Major
41%
207
Minor
11%
56
Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?
AuditedMultimodal · NeurIPS D&B poster
Tasks
124
Audited
100%
124 / 124
Major
17%
21
Minor
31%
39
Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
3,600
Audited
6%
200 / 3,600
Major
0%
0
Minor
3%
5
FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges
Partially auditedMultimodal · NeurIPS D&B spotlight
Tasks
760
Audited
26%
200 / 760
Major
79%
158
Minor
18%
35
Fire360: A Benchmark for Robust Perception and Episodic Memory in Degraded 360° Firefighting Video
AuditedMultimodal · NeurIPS D&B spotlight
Tasks
5
Audited
100%
5 / 5
Major
100%
5
Minor
0%
0
From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
7,211
Audited
7%
500 / 7,211
Major
5%
26
Minor
32%
160
Hyperphantasia: A Benchmark for Evaluating the Mental Visualization Capabilities of Multimodal LLMs
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
1,200
Audited
17%
200 / 1,200
Major
0%
0
Minor
0%
0
InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
58,857
Audited
0%
200 / 58,857
Major
30%
60
Minor
10%
20
IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
15,000
Audited
1%
200 / 15,000
Major
88%
176
Minor
10%
19
MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
806
Audited
25%
200 / 806
Major
1%
1
Minor
2%
4
MLLM-ISU: The First-Ever Comprehensive Benchmark for Multimodal Large Language Models based Intrusion Scene Understanding
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
3,000
Audited
7%
200 / 3,000
Major
69%
138
Minor
12%
24
MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
11,497
Audited
2%
200 / 11,497
Major
19%
38
Minor
26%
52
MMCSBench: A Fine-Grained Benchmark for Large Vision-Language Models in Camouflage Scenes
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
79,794
Audited
0%
200 / 79,794
Major
71%
141
Minor
16%
32
MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
2,000
Audited
10%
200 / 2,000
Major
11%
21
Minor
5%
10
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Partially auditedMultimodal · NeurIPS D&B spotlight
Tasks
2,374
Audited
8%
200 / 2,374
Major
3%
6
Minor
3%
6
MMLongBench
Partially auditedMultimodal
Tasks
7,801
Audited
6%
500 / 7,801
Major
5%
25
Minor
13%
64
MMMU-Pro
Partially auditedMultimodal
Tasks
1,730
Audited
29%
501 / 1,730
Major
12%
62
Minor
8%
38
MMPB: It’s Time for Multi-Modal Personalization
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
10,017
Audited
2%
200 / 10,017
Major
8%
16
Minor
18%
35
MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
5,083
Audited
4%
200 / 5,083
Major
2%
4
Minor
4%
7
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
10,000
Audited
5%
500 / 10,000
Major
28%
142
Minor
14%
69
PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
11,881
Audited
2%
200 / 11,881
Major
37%
74
Minor
46%
91
RBench-V: A Primary Assessment for Visual Reasoning Models with Multimodal Outputs
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
803
Audited
25%
200 / 803
Major
23%
45
Minor
4%
7
Robo2VLM: Improving Visual Question Answering using Large-Scale Robot Manipulation Data
Partially auditedMultimodal · NeurIPS D&B spotlight
Tasks
6,676
Audited
3%
200 / 6,676
Major
90%
180
Minor
6%
11
SeePhys: Does Seeing Help Thinking? – Benchmarking Vision-Based Physics Reasoning
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
2,000
Audited
10%
200 / 2,000
Major
17%
33
Minor
12%
23
SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
4,666
Audited
4%
200 / 4,666
Major
20%
39
Minor
29%
58
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
2,400
Audited
8%
200 / 2,400
Major
1%
1
Minor
6%
12
UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?
Partially auditedMultimodal · NeurIPS D&B poster
Tasks
4,221
Audited
5%
200 / 4,221
Major
6%
11
Minor
6%
12
Professional
A Multimodal Benchmark for Framing of Oil & Gas Advertising and Potential Greenwashing Detection
Partially auditedProfessional · NeurIPS D&B poster
Tasks
353
Audited
57%
200 / 353
Major
72%
144
Minor
13%
25
DABstep
AuditedProfessional
Tasks
450
Audited
100%
450 / 450
Major
27%
120
Minor
24%
107
Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia
AuditedProfessional · NeurIPS D&B poster
Tasks
18
Audited
100%
18 / 18
Major
61%
11
Minor
6%
1
GDPval AA
AuditedProfessional
Tasks
220
Audited
100%
220 / 220
Major
37%
81
Minor
43%
95
OfficeQA Pro
AuditedProfessional
Tasks
133
Audited
100%
133 / 133
Major
32%
42
Minor
29%
38
STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models
AuditedProfessional · NeurIPS D&B poster
Tasks
76
Audited
100%
76 / 76
Major
36%
27
Minor
11%
8
Time Travel is Cheating: Going Live with DeepFund for Real-Time Fund Investment Benchmarking
AuditedProfessional · NeurIPS D&B poster
Tasks
13
Audited
100%
13 / 13
Major
100%
13
Minor
0%
0
Vals Finance Agent
AuditedProfessional
Tasks
50
Audited
100%
50 / 50
Major
14%
7
Minor
18%
9
Agentic / Tool Use
AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents
Partially auditedAgentic / Tool Use · NeurIPS D&B poster
Tasks
246
Audited
81%
200 / 246
Major
21%
42
Minor
31%
61
AGENTIF: Benchmarking Large Language Models Instruction Following Ability in Agentic Scenarios
Partially auditedAgentic / Tool Use · NeurIPS D&B spotlight
Tasks
707
Audited
28%
200 / 707
Major
49%
97
Minor
32%
63
AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems
Partially auditedAgentic / Tool Use · NeurIPS D&B spotlight
Tasks
1,499
Audited
13%
200 / 1,499
Major
1%
1
Minor
0%
0
ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering
AuditedAgentic / Tool Use · NeurIPS D&B poster
Tasks
40
Audited
100%
40 / 40
Major
0%
0
Minor
0%
0
Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation
AuditedAgentic / Tool Use · NeurIPS D&B poster
Tasks
15
Audited
100%
15 / 15
Major
27%
4
Minor
47%
7
Establishing Best Practices in Building Rigorous Agentic Benchmarks
Partially auditedAgentic / Tool Use · NeurIPS D&B poster
Tasks
269
Audited
74%
200 / 269
Major
8%
15
Minor
20%
40
Factorio Learning Environment
AuditedAgentic / Tool Use · NeurIPS D&B poster
Tasks
30
Audited
100%
30 / 30
Major
7%
2
Minor
0%
0
MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks
Partially auditedAgentic / Tool Use · NeurIPS D&B poster
Tasks
2,279
Audited
9%
200 / 2,279
Major
10%
20
Minor
19%
38
MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents
Partially auditedAgentic / Tool Use · NeurIPS D&B poster
Tasks
4,616
Audited
4%
200 / 4,616
Major
36%
71
Minor
28%
55
MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering
Partially auditedAgentic / Tool Use · NeurIPS D&B poster
Tasks
202
Audited
99%
200 / 202
Major
24%
47
Minor
21%
42
MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research
Partially auditedAgentic / Tool Use · NeurIPS D&B poster
Tasks
201
Audited
100%
200 / 201
Major
5%
9
Minor
20%
40
MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?
AuditedAgentic / Tool Use · NeurIPS D&B poster
Tasks
8
Audited
100%
8 / 8
Major
38%
3
Minor
50%
4
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents
Partially auditedAgentic / Tool Use · NeurIPS D&B poster
Tasks
463
Audited
43%
200 / 463
Major
27%
54
Minor
13%
25
OSWorld Verified
AuditedAgentic / Tool Use
Tasks
369
Audited
100%
369 / 369
Major
16%
60
Minor
20%
75
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites
AuditedAgentic / Tool Use · NeurIPS D&B poster
Tasks
112
Audited
100%
112 / 112
Major
15%
17
Minor
20%
22
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
Partially auditedAgentic / Tool Use · NeurIPS D&B spotlight
Tasks
564
Audited
35%
200 / 564
Major
10%
20
Minor
17%
33
Seeking and Updating with Live Visual Knowledge
Partially auditedAgentic / Tool Use · NeurIPS D&B poster
Tasks
2,384
Audited
8%
200 / 2,384
Major
43%
85
Minor
17%
33
T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning
Partially auditedAgentic / Tool Use · NeurIPS D&B poster
Tasks
9,000
Audited
2%
200 / 9,000
Major
27%
54
Minor
27%
53
Tau2-Bench Telecom
AuditedAgentic / Tool Use
Tasks
114
Audited
100%
114 / 114
Major
0%
0
Minor
4%
5
The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements
AuditedAgentic / Tool Use · NeurIPS D&B poster
Tasks
20
Audited
100%
20 / 20
Major
10%
2
Minor
25%
5
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
AuditedAgentic / Tool Use · NeurIPS D&B poster
Tasks
175
Audited
100%
175 / 175
Major
33%
58
Minor
29%
51
Toolathlon
AuditedAgentic / Tool Use
Tasks
108
Audited
100%
108 / 108
Major
42%
45
Minor
44%
47
WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks
AuditedAgentic / Tool Use · NeurIPS D&B poster
Tasks
21
Audited
100%
21 / 21
Major
19%
4
Minor
19%
4
Coding
Aider Polyglot
AuditedCoding
Tasks
225
Audited
100%
225 / 225
Major
34%
77
Minor
25%
56
AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?
AuditedCoding · NeurIPS D&B poster
Tasks
154
Audited
100%
154 / 154
Major
14%
21
Minor
23%
36
CLEVER: A Curated Benchmark for Formally Verified Code Generation
AuditedCoding · NeurIPS D&B poster
Tasks
161
Audited
100%
161 / 161
Major
44%
71
Minor
10%
16
CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance
Partially auditedCoding · NeurIPS D&B poster
Tasks
274
Audited
73%
200 / 274
Major
27%
53
Minor
23%
46
CoRe: Benchmarking LLMs’ Code Reasoning Capabilities through Static Analysis Tasks
Partially auditedCoding · NeurIPS D&B spotlight
Tasks
23,955
Audited
1%
200 / 23,955
Major
2%
4
Minor
15%
30
CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming
Partially auditedCoding · NeurIPS D&B poster
Tasks
19,806
Audited
1%
200 / 19,806
Major
1%
1
Minor
2%
4
Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation
Partially auditedCoding · NeurIPS D&B poster
Tasks
9,104
Audited
2%
200 / 9,104
Major
13%
25
Minor
7%
13
EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code
Partially auditedCoding · NeurIPS D&B poster
Tasks
623
Audited
32%
200 / 623
Major
2%
3
Minor
8%
16
Evaluating Program Semantics Reasoning with Type Inference in System $F$
AuditedCoding · NeurIPS D&B poster
Tasks
188
Audited
100%
188 / 188
Major
3%
5
Minor
12%
23
Frontier SWE
AuditedCoding
Tasks
17
Audited
100%
17 / 17
Major
24%
4
Minor
24%
4
GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
AuditedCoding · NeurIPS D&B poster
Tasks
102
Audited
100%
102 / 102
Major
5%
5
Minor
19%
19
ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests
AuditedCoding · NeurIPS D&B poster
Tasks
118
Audited
100%
118 / 118
Major
27%
32
Minor
7%
8
IR-OptSet: An Optimization-Sensitive Dataset for Advancing LLM-Based IR Optimizer
Partially auditedCoding · NeurIPS D&B poster
Tasks
170,564
Audited
0%
200 / 170,564
Major
15%
30
Minor
16%
31
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?
Partially auditedCoding · NeurIPS D&B poster
Tasks
864
Audited
23%
200 / 864
Major
2%
3
Minor
2%
4
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
Partially auditedCoding · NeurIPS D&B poster
Tasks
1,632
Audited
12%
200 / 1,632
Major
22%
43
Minor
17%
34
PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation
Partially auditedCoding · NeurIPS D&B poster
Tasks
28,003
Audited
1%
200 / 28,003
Major
32%
64
Minor
12%
23
ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code
Partially auditedCoding · NeurIPS D&B spotlight
Tasks
212
Audited
94%
200 / 212
Major
7%
14
Minor
10%
19
SWE-bench Goes Live!
Partially auditedCoding · NeurIPS D&B poster
Tasks
1,887
Audited
11%
200 / 1,887
Major
35%
70
Minor
19%
37
SWE-Bench Multilingual
AuditedCoding
Tasks
300
Audited
100%
300 / 300
Major
4%
12
Minor
10%
30
SWE-Bench Pro
AuditedCoding
Tasks
731
Audited
100%
731 / 731
Major
12%
85
Minor
19%
142
SWE-Bench Verified
AuditedCoding
Tasks
500
Audited
100%
500 / 500
Major
4%
21
Minor
7%
36
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
Partially auditedCoding · NeurIPS D&B poster
Tasks
750
Audited
27%
200 / 750
Major
25%
49
Minor
19%
38
Terminal-Bench v2
AuditedCoding
Tasks
89
Audited
100%
89 / 89
Major
6%
5
Minor
21%
19
WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch
AuditedCoding · NeurIPS D&B oral
Tasks
101
Audited
100%
101 / 101
Major
22%
22
Minor
54%
55
Medical
CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research
Partially auditedMedical · NeurIPS D&B poster
Tasks
2,514
Audited
8%
200 / 2,514
Major
91%
181
Minor
6%
11
ClinBench: A Standardized Multi-Domain Framework for Evaluating Large Language Models in Clinical Information Extraction
Partially auditedMedical · NeurIPS D&B poster
Tasks
2,876
Audited
7%
200 / 2,876
Major
99%
197
Minor
1%
2
ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World
AuditedMedical · NeurIPS D&B poster
Tasks
16
Audited
100%
16 / 16
Major
81%
13
Minor
19%
3
CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays
AuditedMedical · NeurIPS D&B spotlight
Tasks
12
Audited
100%
12 / 12
Major
17%
2
Minor
50%
6
DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?
Partially auditedMedical · NeurIPS D&B poster
Tasks
7,789
Audited
3%
200 / 7,789
Major
17%
34
Minor
31%
62
EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis
Partially auditedMedical · NeurIPS D&B poster
Tasks
6,832
Audited
7%
500 / 6,832
Major
3%
17
Minor
7%
35
LTD-Bench: Evaluating Large Language Models by Letting Them Draw
AuditedMedical · NeurIPS D&B poster
Tasks
183
Audited
100%
183 / 183
Major
10%
19
Minor
15%
27
MedChain: Bridging the Gap Between LLM Agents and Clinical Practice with Interactive Sequence
Partially auditedMedical · NeurIPS D&B spotlight
Tasks
2,362
Audited
21%
500 / 2,362
Major
83%
415
Minor
11%
53
MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants
Partially auditedMedical · NeurIPS D&B poster
Tasks
9,467
Audited
2%
200 / 9,467
Major
32%
64
Minor
14%
28
MedSG-Bench: A Benchmark for Medical Image Sequences Grounding
Partially auditedMedical · NeurIPS D&B spotlight
Tasks
9,630
Audited
2%
200 / 9,630
Major
33%
65
Minor
22%
44
MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology
Partially auditedMedical · NeurIPS D&B poster
Tasks
573
Audited
87%
500 / 573
Major
22%
109
Minor
15%
76
PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions
AuditedMedical · NeurIPS D&B spotlight
Tasks
3
Audited
100%
3 / 3
Major
0%
0
Minor
0%
0
Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models
Partially auditedMedical · NeurIPS D&B poster
Tasks
990
Audited
20%
200 / 990
Major
56%
113
Minor
15%
30
SMMILE: An expert-driven benchmark for multimodal medical in-context learning
Partially auditedMedical · NeurIPS D&B poster
Tasks
1,149
Audited
17%
200 / 1,149
Major
6%
12
Minor
11%
21
TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine
Partially auditedMedical · NeurIPS D&B poster
Tasks
29,262
Audited
1%
200 / 29,262
Major
62%
123
Minor
9%
18
Thousand Voices of Trauma: A Large-Scale Synthetic Dataset for Modeling Prolonged Exposure Therapy Conversations
Partially auditedMedical · NeurIPS D&B spotlight
Tasks
500
Audited
40%
200 / 500
Major
100%
200
Minor
0%
0
Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis
Partially auditedMedical · NeurIPS D&B poster
Tasks
1,069
Audited
47%
500 / 1,069
Major
20%
100
Minor
24%
118
Math
AIME 2024 + 2025
AuditedMath
Tasks
60
Audited
100%
60 / 60
Major
3%
2
Minor
0%
0
Benchmarking Large Language Models with Integer Sequence Generation Tasks
Partially auditedMath · NeurIPS D&B poster
Tasks
1,000
Audited
20%
200 / 1,000
Major
15%
30
Minor
10%
19
ConnectomeBench: Can LLMs proofread the connectome?
Partially auditedMath · NeurIPS D&B spotlight
Tasks
1,827
Audited
11%
200 / 1,827
Major
3%
5
Minor
9%
17
HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class
Partially auditedMath · NeurIPS D&B poster
Tasks
215
Audited
93%
200 / 215
Major
36%
71
Minor
17%
34
IMOAnswerBench
AuditedMath
Tasks
400
Audited
100%
400 / 400
Major
4%
14
Minor
9%
35
Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving of Inequalities
Partially auditedMath · NeurIPS D&B poster
Tasks
375
Audited
53%
200 / 375
Major
6%
11
Minor
0%
0
LexiCon: a Benchmark for Planning under Temporal Constraints in Natural Language
Partially auditedMath · NeurIPS D&B poster
Tasks
666
Audited
30%
200 / 666
Major
39%
77
Minor
25%
49
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
Partially auditedMath · NeurIPS D&B poster
Tasks
951
Audited
53%
500 / 951
Major
1%
6
Minor
2%
9
NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions
Partially auditedMath · NeurIPS D&B poster
Tasks
1,145,824
Audited
0%
200 / 1,145,824
Major
62%
124
Minor
14%
27
OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization
Partially auditedMath · NeurIPS D&B poster
Tasks
9,715
Audited
2%
200 / 9,715
Major
19%
37
Minor
6%
12
RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics
Partially auditedMath · NeurIPS D&B poster
Tasks
1,286
Audited
16%
200 / 1,286
Major
6%
12
Minor
8%
16
Reasoning Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
Partially auditedMath · NeurIPS D&B spotlight
Tasks
5,000
Audited
4%
200 / 5,000
Major
3%
6
Minor
6%
11
SolidGeo: Measuring Multimodal Spatial Math Reasoning in Solid Geometry
Partially auditedMath · NeurIPS D&B poster
Tasks
3,113
Audited
6%
200 / 3,113
Major
7%
13
Minor
4%
8
Solving Inequality Proofs with Large Language Models
AuditedMath · NeurIPS D&B spotlight
Tasks
300
Audited
100%
300 / 300
Major
7%
21
Minor
8%
24
Retrieval / RAG
Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering
Partially auditedRetrieval / RAG · NeurIPS D&B poster
Tasks
4,055
Audited
12%
500 / 4,055
Major
28%
139
Minor
33%
166
C-SEO Bench: Does Conversational SEO Work?
Partially auditedRetrieval / RAG · NeurIPS D&B poster
Tasks
1,921
Audited
26%
500 / 1,921
Major
0%
1
Minor
2%
11
Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking
Partially auditedRetrieval / RAG · NeurIPS D&B poster
Tasks
10,787
Audited
2%
200 / 10,787
Major
10%
20
Minor
23%
45
FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents
Partially auditedRetrieval / RAG · NeurIPS D&B poster
Tasks
672
Audited
74%
500 / 672
Major
14%
72
Minor
15%
77
HawkBench: Investigating Resilience of RAG Methods on Stratified Information-Seeking Tasks
Partially auditedRetrieval / RAG · NeurIPS D&B spotlight
Tasks
1,600
Audited
13%
200 / 1,600
Major
28%
56
Minor
33%
65
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
AuditedRetrieval / RAG · NeurIPS D&B poster
Tasks
10
Audited
100%
10 / 10
Major
20%
2
Minor
30%
3
MS-Bench: Evaluating LMMs in Ancient Manuscript Study through a Dunhuang Case Study
Partially auditedRetrieval / RAG · NeurIPS D&B poster
Tasks
8,729
Audited
2%
200 / 8,729
Major
21%
42
Minor
44%
88
Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals
Partially auditedRetrieval / RAG · NeurIPS D&B poster
Tasks
2,648
Audited
8%
200 / 2,648
Major
27%
54
Minor
18%
36
Safety / Alignment
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
Partially auditedSafety / Alignment · NeurIPS D&B oral
Tasks
1,350
Audited
37%
500 / 1,350
Major
52%
258
Minor
6%
28
BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models
Partially auditedSafety / Alignment · NeurIPS D&B poster
Tasks
4,200
Audited
12%
500 / 4,200
Major
45%
223
Minor
28%
141
BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks
Partially auditedSafety / Alignment · NeurIPS D&B poster
Tasks
4,721
Audited
4%
200 / 4,721
Major
100%
200
Minor
0%
0
CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs
Partially auditedSafety / Alignment · NeurIPS D&B poster
Tasks
18,477
Audited
1%
200 / 18,477
Major
27%
53
Minor
3%
6
CHASM: Unveiling Covert Advertisements on Chinese Social Media
Partially auditedSafety / Alignment · NeurIPS D&B poster
Tasks
1,000
Audited
20%
200 / 1,000
Major
17%
33
Minor
16%
31
Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment
AuditedSafety / Alignment · NeurIPS D&B poster
Tasks
116
Audited
100%
116 / 116
Major
100%
116
Minor
0%
0
DataSIR: A Benchmark Dataset for Sensitive Information Recognition
Partially auditedSafety / Alignment · NeurIPS D&B poster
Tasks
1,647,501
Audited
0%
200 / 1,647,501
Major
46%
92
Minor
36%
71
DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios
AuditedSafety / Alignment · NeurIPS D&B poster
Tasks
150
Audited
100%
150 / 150
Major
8%
12
Minor
55%
83
GuardSet-X: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset
Partially auditedSafety / Alignment · NeurIPS D&B poster
Tasks
128,376
Audited
0%
200 / 128,376
Major
14%
29
Minor
11%
22
InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback
Partially auditedSafety / Alignment · NeurIPS D&B spotlight
Tasks
1,948
Audited
10%
200 / 1,948
Major
70%
139
Minor
24%
47
OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics
AuditedSafety / Alignment · NeurIPS D&B poster
Tasks
48
Audited
100%
48 / 48
Major
2%
1
Minor
8%
4
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents
AuditedSafety / Alignment · NeurIPS D&B spotlight
Tasks
110
Audited
100%
110 / 110
Major
7%
8
Minor
11%
12
PUO-Bench: A Panel Understanding and Operation Benchmark with A Privacy-Preserving Framework
Partially auditedSafety / Alignment · NeurIPS D&B poster
Tasks
7,711
Audited
3%
200 / 7,711
Major
22%
44
Minor
21%
42
Risk Management for Mitigating Benchmark Failure Modes: BenchRisk
AuditedSafety / Alignment · NeurIPS D&B poster
Tasks
27
Audited
100%
27 / 27
Major
96%
26
Minor
4%
1
SafeVid: Toward Safety Aligned Video Large Multimodal Models
AuditedSafety / Alignment · NeurIPS D&B poster
Tasks
1
Audited
100%
1 / 1
Major
0%
0
Minor
0%
0
SAGE-Eval: Evaluating LLMs for Systematic Generalizations of Safety Facts
Partially auditedSafety / Alignment · NeurIPS D&B spotlight
Tasks
1,105
Audited
18%
200 / 1,105
Major
3%
6
Minor
8%
15
SECODEPLT: A Unified Benchmark for Evaluating the Security Risks and Capabilities of Code GenAI
Partially auditedSafety / Alignment · NeurIPS D&B poster
Tasks
10,596
Audited
2%
200 / 10,596
Major
96%
192
Minor
3%
5
Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition
AuditedSafety / Alignment · NeurIPS D&B poster
Tasks
44
Audited
100%
44 / 44
Major
27%
12
Minor
36%
16
Towards Evaluating Proactive Risk Awareness of Multimodal Language Models
Partially auditedSafety / Alignment · NeurIPS D&B poster
Tasks
416
Audited
48%
200 / 416
Major
11%
21
Minor
13%
25
UMU-Bench: Closing the Modality Gap in Multimodal Unlearning Evaluation
Partially auditedSafety / Alignment · NeurIPS D&B poster
Tasks
653
Audited
31%
200 / 653
Major
71%
142
Minor
19%
37
Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs
AuditedSafety / Alignment · NeurIPS D&B poster
Tasks
4
Audited
100%
4 / 4
Major
50%
2
Minor
0%
0
VMDT: Decoding the Trustworthiness of Video Foundation Models
Partially auditedSafety / Alignment · NeurIPS D&B poster
Tasks
13,880
Audited
1%
200 / 13,880
Major
19%
37
Minor
13%
26