Browse benchmarks

168 of 168 benchmarks

Science

15

AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy

Partially audited
Science · NeurIPS D&B poster
Tasks
432
Audited
46%
200 / 432
Major
46%
91
Minor
35%
69

AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science

Partially audited
Science · NeurIPS D&B poster
Tasks
2,641
Audited
8%
200 / 2,641
Major
20%
40
Minor
19%
37

Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations

Partially audited
Science · NeurIPS D&B poster
Tasks
1,485
Audited
34%
500 / 1,485
Major
32%
159
Minor
13%
67

CellVerse: Do Large Language Models Really Understand Cell Biology?

Partially audited
Science · NeurIPS D&B poster
Tasks
1,822
Audited
11%
200 / 1,822
Major
37%
74
Minor
27%
53

ChemX: A Collection of Chemistry Datasets for Benchmarking Automated Information Extraction

Audited
Science · NeurIPS D&B poster
Tasks
10
Audited
100%
10 / 10
Major
60%
6
Minor
30%
3

FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models

Partially audited
Science · NeurIPS D&B poster
Tasks
7,146
Audited
3%
200 / 7,146
Major
9%
18
Minor
5%
10

GPQA Diamond

Audited
Science
Tasks
198
Audited
100%
198 / 198
Major
3%
5
Minor
5%
9

HLE (Humanity's Last Exam)

Partially audited
Science
Tasks
2,500
Audited
20%
500 / 2,500
Major
28%
140
Minor
14%
69

Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab

Partially audited
Science · NeurIPS D&B poster
Tasks
350
Audited
57%
200 / 350
Major
5%
9
Minor
24%
48

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Partially audited
Science · NeurIPS D&B poster
Tasks
500
Audited
40%
200 / 500
Major
19%
38
Minor
18%
36

PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors

Audited
Science · NeurIPS D&B poster
Tasks
97
Audited
100%
97 / 97
Major
8%
8
Minor
11%
11

QCircuitBench: A Large-Scale Dataset for Benchmarking Quantum Algorithm Design

Audited
Science · NeurIPS D&B poster
Tasks
28
Audited
100%
28 / 28
Major
25%
7
Minor
11%
3

Scaling Physical Reasoning with the PHYSICS Dataset

Partially audited
Science · NeurIPS D&B poster
Tasks
2,000
Audited
10%
200 / 2,000
Major
33%
65
Minor
14%
27

Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Partially audited
Science · NeurIPS D&B poster
Tasks
1,660
Audited
12%
200 / 1,660
Major
36%
71
Minor
32%
64

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Partially audited
Science · NeurIPS D&B poster
Tasks
26,529
Audited
2%
500 / 26,529
Major
29%
144
Minor
15%
75

Multimodal

37

AgMMU: A Comprehensive Agricultural Multimodal Understanding Benchmark

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
772
Audited
26%
200 / 772
Major
65%
129
Minor
8%
15

AnomalyCoT: A Multi-Scenario Chain-of-Thought Dataset for Multimodal Large Language Models

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
7,916
Audited
3%
200 / 7,916
Major
32%
63
Minor
40%
80

BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
899
Audited
22%
200 / 899
Major
36%
71
Minor
18%
35

BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
40,916
Audited
0%
200 / 40,916
Major
25%
49
Minor
16%
31

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
15,708
Audited
1%
200 / 15,708
Major
40%
79
Minor
11%
22

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
11,631
Audited
2%
200 / 11,631
Major
1%
1
Minor
6%
12

ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
1,162
Audited
17%
200 / 1,162
Major
2%
3
Minor
7%
14

CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
10,507
Audited
2%
200 / 10,507
Major
14%
29
Minor
9%
17

ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
5,885
Audited
3%
200 / 5,885
Major
5%
9
Minor
8%
15

CoralVQA: A Large-Scale Visual Question Answering Dataset for Coral Reef Image Understanding

Partially audited
Multimodal · NeurIPS D&B oral
Tasks
50,927
Audited
0%
200 / 50,927
Major
12%
23
Minor
14%
28

DisasterM3

Partially audited
Multimodal
Tasks
30,042
Audited
2%
500 / 30,042
Major
41%
207
Minor
11%
56

Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?

Audited
Multimodal · NeurIPS D&B poster
Tasks
124
Audited
100%
124 / 124
Major
17%
21
Minor
31%
39

Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
3,600
Audited
6%
200 / 3,600
Major
0%
0
Minor
3%
5

FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

Partially audited
Multimodal · NeurIPS D&B spotlight
Tasks
760
Audited
26%
200 / 760
Major
79%
158
Minor
18%
35

Fire360: A Benchmark for Robust Perception and Episodic Memory in Degraded 360° Firefighting Video

Audited
Multimodal · NeurIPS D&B spotlight
Tasks
5
Audited
100%
5 / 5
Major
100%
5
Minor
0%
0

From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
7,211
Audited
7%
500 / 7,211
Major
5%
26
Minor
32%
160

Hyperphantasia: A Benchmark for Evaluating the Mental Visualization Capabilities of Multimodal LLMs

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
1,200
Audited
17%
200 / 1,200
Major
0%
0
Minor
0%
0

InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
58,857
Audited
0%
200 / 58,857
Major
30%
60
Minor
10%
20

IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
15,000
Audited
1%
200 / 15,000
Major
88%
176
Minor
10%
19

MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
806
Audited
25%
200 / 806
Major
1%
1
Minor
2%
4

MLLM-ISU: The First-Ever Comprehensive Benchmark for Multimodal Large Language Models based Intrusion Scene Understanding

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
3,000
Audited
7%
200 / 3,000
Major
69%
138
Minor
12%
24

MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
11,497
Audited
2%
200 / 11,497
Major
19%
38
Minor
26%
52

MMCSBench: A Fine-Grained Benchmark for Large Vision-Language Models in Camouflage Scenes

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
79,794
Audited
0%
200 / 79,794
Major
71%
141
Minor
16%
32

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
2,000
Audited
10%
200 / 2,000
Major
11%
21
Minor
5%
10

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Partially audited
Multimodal · NeurIPS D&B spotlight
Tasks
2,374
Audited
8%
200 / 2,374
Major
3%
6
Minor
3%
6

MMLongBench

Partially audited
Multimodal
Tasks
7,801
Audited
6%
500 / 7,801
Major
5%
25
Minor
13%
64

MMMU-Pro

Partially audited
Multimodal
Tasks
1,730
Audited
29%
501 / 1,730
Major
12%
62
Minor
8%
38

MMPB: It’s Time for Multi-Modal Personalization

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
10,017
Audited
2%
200 / 10,017
Major
8%
16
Minor
18%
35

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
5,083
Audited
4%
200 / 5,083
Major
2%
4
Minor
4%
7

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
10,000
Audited
5%
500 / 10,000
Major
28%
142
Minor
14%
69

PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
11,881
Audited
2%
200 / 11,881
Major
37%
74
Minor
46%
91

RBench-V: A Primary Assessment for Visual Reasoning Models with Multimodal Outputs

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
803
Audited
25%
200 / 803
Major
23%
45
Minor
4%
7

Robo2VLM: Improving Visual Question Answering using Large-Scale Robot Manipulation Data

Partially audited
Multimodal · NeurIPS D&B spotlight
Tasks
6,676
Audited
3%
200 / 6,676
Major
90%
180
Minor
6%
11

SeePhys: Does Seeing Help Thinking? – Benchmarking Vision-Based Physics Reasoning

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
2,000
Audited
10%
200 / 2,000
Major
17%
33
Minor
12%
23

SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
4,666
Audited
4%
200 / 4,666
Major
20%
39
Minor
29%
58

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
2,400
Audited
8%
200 / 2,400
Major
1%
1
Minor
6%
12

UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

Partially audited
Multimodal · NeurIPS D&B poster
Tasks
4,221
Audited
5%
200 / 4,221
Major
6%
11
Minor
6%
12

Professional

8

Agentic / Tool Use

23

AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents

Partially audited
Agentic / Tool Use · NeurIPS D&B poster
Tasks
246
Audited
81%
200 / 246
Major
21%
42
Minor
31%
61

AGENTIF: Benchmarking Large Language Models Instruction Following Ability in Agentic Scenarios

Partially audited
Agentic / Tool Use · NeurIPS D&B spotlight
Tasks
707
Audited
28%
200 / 707
Major
49%
97
Minor
32%
63

AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems

Partially audited
Agentic / Tool Use · NeurIPS D&B spotlight
Tasks
1,499
Audited
13%
200 / 1,499
Major
1%
1
Minor
0%
0

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

Audited
Agentic / Tool Use · NeurIPS D&B poster
Tasks
40
Audited
100%
40 / 40
Major
0%
0
Minor
0%
0

Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation

Audited
Agentic / Tool Use · NeurIPS D&B poster
Tasks
15
Audited
100%
15 / 15
Major
27%
4
Minor
47%
7

Establishing Best Practices in Building Rigorous Agentic Benchmarks

Partially audited
Agentic / Tool Use · NeurIPS D&B poster
Tasks
269
Audited
74%
200 / 269
Major
8%
15
Minor
20%
40

Factorio Learning Environment

Audited
Agentic / Tool Use · NeurIPS D&B poster
Tasks
30
Audited
100%
30 / 30
Major
7%
2
Minor
0%
0

MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks

Partially audited
Agentic / Tool Use · NeurIPS D&B poster
Tasks
2,279
Audited
9%
200 / 2,279
Major
10%
20
Minor
19%
38

MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents

Partially audited
Agentic / Tool Use · NeurIPS D&B poster
Tasks
4,616
Audited
4%
200 / 4,616
Major
36%
71
Minor
28%
55

MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

Partially audited
Agentic / Tool Use · NeurIPS D&B poster
Tasks
202
Audited
99%
200 / 202
Major
24%
47
Minor
21%
42

MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

Partially audited
Agentic / Tool Use · NeurIPS D&B poster
Tasks
201
Audited
100%
200 / 201
Major
5%
9
Minor
20%
40

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

Audited
Agentic / Tool Use · NeurIPS D&B poster
Tasks
8
Audited
100%
8 / 8
Major
38%
3
Minor
50%
4

Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents

Partially audited
Agentic / Tool Use · NeurIPS D&B poster
Tasks
463
Audited
43%
200 / 463
Major
27%
54
Minor
13%
25

OSWorld Verified

Audited
Agentic / Tool Use
Tasks
369
Audited
100%
369 / 369
Major
16%
60
Minor
20%
75

REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

Audited
Agentic / Tool Use · NeurIPS D&B poster
Tasks
112
Audited
100%
112 / 112
Major
15%
17
Minor
20%
22

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Partially audited
Agentic / Tool Use · NeurIPS D&B spotlight
Tasks
564
Audited
35%
200 / 564
Major
10%
20
Minor
17%
33

Seeking and Updating with Live Visual Knowledge

Partially audited
Agentic / Tool Use · NeurIPS D&B poster
Tasks
2,384
Audited
8%
200 / 2,384
Major
43%
85
Minor
17%
33

T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning

Partially audited
Agentic / Tool Use · NeurIPS D&B poster
Tasks
9,000
Audited
2%
200 / 9,000
Major
27%
54
Minor
27%
53

Tau2-Bench Telecom

Audited
Agentic / Tool Use
Tasks
114
Audited
100%
114 / 114
Major
0%
0
Minor
4%
5

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

Audited
Agentic / Tool Use · NeurIPS D&B poster
Tasks
20
Audited
100%
20 / 20
Major
10%
2
Minor
25%
5

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Audited
Agentic / Tool Use · NeurIPS D&B poster
Tasks
175
Audited
100%
175 / 175
Major
33%
58
Minor
29%
51

Toolathlon

Audited
Agentic / Tool Use
Tasks
108
Audited
100%
108 / 108
Major
42%
45
Minor
44%
47

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

Audited
Agentic / Tool Use · NeurIPS D&B poster
Tasks
21
Audited
100%
21 / 21
Major
19%
4
Minor
19%
4

Coding

24

Aider Polyglot

Audited
Coding
Tasks
225
Audited
100%
225 / 225
Major
34%
77
Minor
25%
56

AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?

Audited
Coding · NeurIPS D&B poster
Tasks
154
Audited
100%
154 / 154
Major
14%
21
Minor
23%
36

CLEVER: A Curated Benchmark for Formally Verified Code Generation

Audited
Coding · NeurIPS D&B poster
Tasks
161
Audited
100%
161 / 161
Major
44%
71
Minor
10%
16

CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance

Partially audited
Coding · NeurIPS D&B poster
Tasks
274
Audited
73%
200 / 274
Major
27%
53
Minor
23%
46

CoRe: Benchmarking LLMs’ Code Reasoning Capabilities through Static Analysis Tasks

Partially audited
Coding · NeurIPS D&B spotlight
Tasks
23,955
Audited
1%
200 / 23,955
Major
2%
4
Minor
15%
30

CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming

Partially audited
Coding · NeurIPS D&B poster
Tasks
19,806
Audited
1%
200 / 19,806
Major
1%
1
Minor
2%
4

Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation

Partially audited
Coding · NeurIPS D&B poster
Tasks
9,104
Audited
2%
200 / 9,104
Major
13%
25
Minor
7%
13

EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code

Partially audited
Coding · NeurIPS D&B poster
Tasks
623
Audited
32%
200 / 623
Major
2%
3
Minor
8%
16

Evaluating Program Semantics Reasoning with Type Inference in System $F$

Audited
Coding · NeurIPS D&B poster
Tasks
188
Audited
100%
188 / 188
Major
3%
5
Minor
12%
23

Frontier SWE

Audited
Coding
Tasks
17
Audited
100%
17 / 17
Major
24%
4
Minor
24%
4

GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

Audited
Coding · NeurIPS D&B poster
Tasks
102
Audited
100%
102 / 102
Major
5%
5
Minor
19%
19

ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests

Audited
Coding · NeurIPS D&B poster
Tasks
118
Audited
100%
118 / 118
Major
27%
32
Minor
7%
8

IR-OptSet: An Optimization-Sensitive Dataset for Advancing LLM-Based IR Optimizer

Partially audited
Coding · NeurIPS D&B poster
Tasks
170,564
Audited
0%
200 / 170,564
Major
15%
30
Minor
16%
31

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

Partially audited
Coding · NeurIPS D&B poster
Tasks
864
Audited
23%
200 / 864
Major
2%
3
Minor
2%
4

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Partially audited
Coding · NeurIPS D&B poster
Tasks
1,632
Audited
12%
200 / 1,632
Major
22%
43
Minor
17%
34

PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation

Partially audited
Coding · NeurIPS D&B poster
Tasks
28,003
Audited
1%
200 / 28,003
Major
32%
64
Minor
12%
23

ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code

Partially audited
Coding · NeurIPS D&B spotlight
Tasks
212
Audited
94%
200 / 212
Major
7%
14
Minor
10%
19

SWE-bench Goes Live!

Partially audited
Coding · NeurIPS D&B poster
Tasks
1,887
Audited
11%
200 / 1,887
Major
35%
70
Minor
19%
37

SWE-Bench Multilingual

Audited
Coding
Tasks
300
Audited
100%
300 / 300
Major
4%
12
Minor
10%
30

SWE-Bench Pro

Audited
Coding
Tasks
731
Audited
100%
731 / 731
Major
12%
85
Minor
19%
142

SWE-Bench Verified

Audited
Coding
Tasks
500
Audited
100%
500 / 500
Major
4%
21
Minor
7%
36

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Partially audited
Coding · NeurIPS D&B poster
Tasks
750
Audited
27%
200 / 750
Major
25%
49
Minor
19%
38

Terminal-Bench v2

Audited
Coding
Tasks
89
Audited
100%
89 / 89
Major
6%
5
Minor
21%
19

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

Audited
Coding · NeurIPS D&B oral
Tasks
101
Audited
100%
101 / 101
Major
22%
22
Minor
54%
55

Medical

17

CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research

Partially audited
Medical · NeurIPS D&B poster
Tasks
2,514
Audited
8%
200 / 2,514
Major
91%
181
Minor
6%
11

ClinBench: A Standardized Multi-Domain Framework for Evaluating Large Language Models in Clinical Information Extraction

Partially audited
Medical · NeurIPS D&B poster
Tasks
2,876
Audited
7%
200 / 2,876
Major
99%
197
Minor
1%
2

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

Audited
Medical · NeurIPS D&B poster
Tasks
16
Audited
100%
16 / 16
Major
81%
13
Minor
19%
3

CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays

Audited
Medical · NeurIPS D&B spotlight
Tasks
12
Audited
100%
12 / 12
Major
17%
2
Minor
50%
6

DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?

Partially audited
Medical · NeurIPS D&B poster
Tasks
7,789
Audited
3%
200 / 7,789
Major
17%
34
Minor
31%
62

EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

Partially audited
Medical · NeurIPS D&B poster
Tasks
6,832
Audited
7%
500 / 6,832
Major
3%
17
Minor
7%
35

LTD-Bench: Evaluating Large Language Models by Letting Them Draw

Audited
Medical · NeurIPS D&B poster
Tasks
183
Audited
100%
183 / 183
Major
10%
19
Minor
15%
27

MedChain: Bridging the Gap Between LLM Agents and Clinical Practice with Interactive Sequence

Partially audited
Medical · NeurIPS D&B spotlight
Tasks
2,362
Audited
21%
500 / 2,362
Major
83%
415
Minor
11%
53

MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants

Partially audited
Medical · NeurIPS D&B poster
Tasks
9,467
Audited
2%
200 / 9,467
Major
32%
64
Minor
14%
28

MedSG-Bench: A Benchmark for Medical Image Sequences Grounding

Partially audited
Medical · NeurIPS D&B spotlight
Tasks
9,630
Audited
2%
200 / 9,630
Major
33%
65
Minor
22%
44

MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

Partially audited
Medical · NeurIPS D&B poster
Tasks
573
Audited
87%
500 / 573
Major
22%
109
Minor
15%
76

PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions

Audited
Medical · NeurIPS D&B spotlight
Tasks
3
Audited
100%
3 / 3
Major
0%
0
Minor
0%
0

Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models

Partially audited
Medical · NeurIPS D&B poster
Tasks
990
Audited
20%
200 / 990
Major
56%
113
Minor
15%
30

SMMILE: An expert-driven benchmark for multimodal medical in-context learning

Partially audited
Medical · NeurIPS D&B poster
Tasks
1,149
Audited
17%
200 / 1,149
Major
6%
12
Minor
11%
21

TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine

Partially audited
Medical · NeurIPS D&B poster
Tasks
29,262
Audited
1%
200 / 29,262
Major
62%
123
Minor
9%
18

Thousand Voices of Trauma: A Large-Scale Synthetic Dataset for Modeling Prolonged Exposure Therapy Conversations

Partially audited
Medical · NeurIPS D&B spotlight
Tasks
500
Audited
40%
200 / 500
Major
100%
200
Minor
0%
0

Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis

Partially audited
Medical · NeurIPS D&B poster
Tasks
1,069
Audited
47%
500 / 1,069
Major
20%
100
Minor
24%
118

Math

14

AIME 2024 + 2025

Audited
Math
Tasks
60
Audited
100%
60 / 60
Major
3%
2
Minor
0%
0

Benchmarking Large Language Models with Integer Sequence Generation Tasks

Partially audited
Math · NeurIPS D&B poster
Tasks
1,000
Audited
20%
200 / 1,000
Major
15%
30
Minor
10%
19

ConnectomeBench: Can LLMs proofread the connectome?

Partially audited
Math · NeurIPS D&B spotlight
Tasks
1,827
Audited
11%
200 / 1,827
Major
3%
5
Minor
9%
17

HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class

Partially audited
Math · NeurIPS D&B poster
Tasks
215
Audited
93%
200 / 215
Major
36%
71
Minor
17%
34

IMOAnswerBench

Audited
Math
Tasks
400
Audited
100%
400 / 400
Major
4%
14
Minor
9%
35

Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving of Inequalities

Partially audited
Math · NeurIPS D&B poster
Tasks
375
Audited
53%
200 / 375
Major
6%
11
Minor
0%
0

LexiCon: a Benchmark for Planning under Temporal Constraints in Natural Language

Partially audited
Math · NeurIPS D&B poster
Tasks
666
Audited
30%
200 / 666
Major
39%
77
Minor
25%
49

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Partially audited
Math · NeurIPS D&B poster
Tasks
951
Audited
53%
500 / 951
Major
1%
6
Minor
2%
9

NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

Partially audited
Math · NeurIPS D&B poster
Tasks
1,145,824
Audited
0%
200 / 1,145,824
Major
62%
124
Minor
14%
27

OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization

Partially audited
Math · NeurIPS D&B poster
Tasks
9,715
Audited
2%
200 / 9,715
Major
19%
37
Minor
6%
12

RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics

Partially audited
Math · NeurIPS D&B poster
Tasks
1,286
Audited
16%
200 / 1,286
Major
6%
12
Minor
8%
16

Reasoning Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Partially audited
Math · NeurIPS D&B spotlight
Tasks
5,000
Audited
4%
200 / 5,000
Major
3%
6
Minor
6%
11

SolidGeo: Measuring Multimodal Spatial Math Reasoning in Solid Geometry

Partially audited
Math · NeurIPS D&B poster
Tasks
3,113
Audited
6%
200 / 3,113
Major
7%
13
Minor
4%
8

Solving Inequality Proofs with Large Language Models

Audited
Math · NeurIPS D&B spotlight
Tasks
300
Audited
100%
300 / 300
Major
7%
21
Minor
8%
24

Retrieval / RAG

8

Safety / Alignment

22

Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)

Partially audited
Safety / Alignment · NeurIPS D&B oral
Tasks
1,350
Audited
37%
500 / 1,350
Major
52%
258
Minor
6%
28

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models

Partially audited
Safety / Alignment · NeurIPS D&B poster
Tasks
4,200
Audited
12%
500 / 4,200
Major
45%
223
Minor
28%
141

BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks

Partially audited
Safety / Alignment · NeurIPS D&B poster
Tasks
4,721
Audited
4%
200 / 4,721
Major
100%
200
Minor
0%
0

CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs

Partially audited
Safety / Alignment · NeurIPS D&B poster
Tasks
18,477
Audited
1%
200 / 18,477
Major
27%
53
Minor
3%
6

CHASM: Unveiling Covert Advertisements on Chinese Social Media

Partially audited
Safety / Alignment · NeurIPS D&B poster
Tasks
1,000
Audited
20%
200 / 1,000
Major
17%
33
Minor
16%
31

Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment

Audited
Safety / Alignment · NeurIPS D&B poster
Tasks
116
Audited
100%
116 / 116
Major
100%
116
Minor
0%
0

DataSIR: A Benchmark Dataset for Sensitive Information Recognition

Partially audited
Safety / Alignment · NeurIPS D&B poster
Tasks
1,647,501
Audited
0%
200 / 1,647,501
Major
46%
92
Minor
36%
71

DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios

Audited
Safety / Alignment · NeurIPS D&B poster
Tasks
150
Audited
100%
150 / 150
Major
8%
12
Minor
55%
83

GuardSet-X: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset

Partially audited
Safety / Alignment · NeurIPS D&B poster
Tasks
128,376
Audited
0%
200 / 128,376
Major
14%
29
Minor
11%
22

InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback

Partially audited
Safety / Alignment · NeurIPS D&B spotlight
Tasks
1,948
Audited
10%
200 / 1,948
Major
70%
139
Minor
24%
47

OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics

Audited
Safety / Alignment · NeurIPS D&B poster
Tasks
48
Audited
100%
48 / 48
Major
2%
1
Minor
8%
4

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

Audited
Safety / Alignment · NeurIPS D&B spotlight
Tasks
110
Audited
100%
110 / 110
Major
7%
8
Minor
11%
12

PUO-Bench: A Panel Understanding and Operation Benchmark with A Privacy-Preserving Framework

Partially audited
Safety / Alignment · NeurIPS D&B poster
Tasks
7,711
Audited
3%
200 / 7,711
Major
22%
44
Minor
21%
42

Risk Management for Mitigating Benchmark Failure Modes: BenchRisk

Audited
Safety / Alignment · NeurIPS D&B poster
Tasks
27
Audited
100%
27 / 27
Major
96%
26
Minor
4%
1

SafeVid: Toward Safety Aligned Video Large Multimodal Models

Audited
Safety / Alignment · NeurIPS D&B poster
Tasks
1
Audited
100%
1 / 1
Major
0%
0
Minor
0%
0

SAGE-Eval: Evaluating LLMs for Systematic Generalizations of Safety Facts

Partially audited
Safety / Alignment · NeurIPS D&B spotlight
Tasks
1,105
Audited
18%
200 / 1,105
Major
3%
6
Minor
8%
15

SECODEPLT: A Unified Benchmark for Evaluating the Security Risks and Capabilities of Code GenAI

Partially audited
Safety / Alignment · NeurIPS D&B poster
Tasks
10,596
Audited
2%
200 / 10,596
Major
96%
192
Minor
3%
5

Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition

Audited
Safety / Alignment · NeurIPS D&B poster
Tasks
44
Audited
100%
44 / 44
Major
27%
12
Minor
36%
16

Towards Evaluating Proactive Risk Awareness of Multimodal Language Models

Partially audited
Safety / Alignment · NeurIPS D&B poster
Tasks
416
Audited
48%
200 / 416
Major
11%
21
Minor
13%
25

UMU-Bench: Closing the Modality Gap in Multimodal Unlearning Evaluation

Partially audited
Safety / Alignment · NeurIPS D&B poster
Tasks
653
Audited
31%
200 / 653
Major
71%
142
Minor
19%
37

Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs

Audited
Safety / Alignment · NeurIPS D&B poster
Tasks
4
Audited
100%
4 / 4
Major
50%
2
Minor
0%
0

VMDT: Decoding the Trustworthiness of Video Foundation Models

Partially audited
Safety / Alignment · NeurIPS D&B poster
Tasks
13,880
Audited
1%
200 / 13,880
Major
19%
37
Minor
13%
26