Browse benchmarks

168 of 168 benchmarks

Science

AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy

Partially audited

Science · NeurIPS D&B poster

AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science

Partially audited

Science · NeurIPS D&B poster

Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations

Partially audited

Science · NeurIPS D&B poster

CellVerse: Do Large Language Models Really Understand Cell Biology?

Partially audited

Science · NeurIPS D&B poster

ChemX: A Collection of Chemistry Datasets for Benchmarking Automated Information Extraction

Audited

Science · NeurIPS D&B poster

FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models

Partially audited

Science · NeurIPS D&B poster

GPQA Diamond

HLE (Humanity's Last Exam)

Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab

Partially audited

Science · NeurIPS D&B poster

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Partially audited

Science · NeurIPS D&B poster

PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors

Audited

Science · NeurIPS D&B poster

QCircuitBench: A Large-Scale Dataset for Benchmarking Quantum Algorithm Design

Audited

Science · NeurIPS D&B poster

Scaling Physical Reasoning with the PHYSICS Dataset

Partially audited

Science · NeurIPS D&B poster

Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Partially audited

Science · NeurIPS D&B poster

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Partially audited

Science · NeurIPS D&B poster

Multimodal

AgMMU: A Comprehensive Agricultural Multimodal Understanding Benchmark

Partially audited

Multimodal · NeurIPS D&B poster

AnomalyCoT: A Multi-Scenario Chain-of-Thought Dataset for Multimodal Large Language Models

Partially audited

Multimodal · NeurIPS D&B poster

BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception

Partially audited

Multimodal · NeurIPS D&B poster

BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

Partially audited

Multimodal · NeurIPS D&B poster

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

Partially audited

Multimodal · NeurIPS D&B poster

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

Partially audited

Multimodal · NeurIPS D&B poster

ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

Partially audited

Multimodal · NeurIPS D&B poster

CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models

Partially audited

Multimodal · NeurIPS D&B poster

ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness

Partially audited

Multimodal · NeurIPS D&B poster

CoralVQA: A Large-Scale Visual Question Answering Dataset for Coral Reef Image Understanding

Partially audited

Multimodal · NeurIPS D&B oral

DisasterM3

Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?

Audited

Multimodal · NeurIPS D&B poster

Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants

Partially audited

Multimodal · NeurIPS D&B poster

FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

Partially audited

Multimodal · NeurIPS D&B spotlight

Fire360: A Benchmark for Robust Perception and Episodic Memory in Degraded 360° Firefighting Video

Audited

Multimodal · NeurIPS D&B spotlight

From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D

Partially audited

Multimodal · NeurIPS D&B poster

Hyperphantasia: A Benchmark for Evaluating the Mental Visualization Capabilities of Multimodal LLMs

Partially audited

Multimodal · NeurIPS D&B poster

InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts

Partially audited

Multimodal · NeurIPS D&B poster

IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

Partially audited

Multimodal · NeurIPS D&B poster

MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models

Partially audited

Multimodal · NeurIPS D&B poster

MLLM-ISU: The First-Ever Comprehensive Benchmark for Multimodal Large Language Models based Intrusion Scene Understanding

Partially audited

Multimodal · NeurIPS D&B poster

MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models

Partially audited

Multimodal · NeurIPS D&B poster

MMCSBench: A Fine-Grained Benchmark for Large Vision-Language Models in Camouflage Scenes

Partially audited

Multimodal · NeurIPS D&B poster

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Partially audited

Multimodal · NeurIPS D&B poster

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Partially audited

Multimodal · NeurIPS D&B poster

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Partially audited

Multimodal · NeurIPS D&B poster

PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?

Partially audited

Multimodal · NeurIPS D&B poster

RBench-V: A Primary Assessment for Visual Reasoning Models with Multimodal Outputs

Partially audited

Multimodal · NeurIPS D&B poster

Robo2VLM: Improving Visual Question Answering using Large-Scale Robot Manipulation Data

Partially audited

Multimodal · NeurIPS D&B spotlight

SeePhys: Does Seeing Help Thinking? – Benchmarking Vision-Based Physics Reasoning

Partially audited

Multimodal · NeurIPS D&B poster

SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models

Partially audited

Multimodal · NeurIPS D&B poster

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

Partially audited

Multimodal · NeurIPS D&B poster

UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

Partially audited

Multimodal · NeurIPS D&B poster

Professional

A Multimodal Benchmark for Framing of Oil & Gas Advertising and Potential Greenwashing Detection

Partially audited

Professional · NeurIPS D&B poster

DABstep

Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

Audited

Professional · NeurIPS D&B poster

GDPval AA

OfficeQA Pro

STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

Audited

Professional · NeurIPS D&B poster

Time Travel is Cheating: Going Live with DeepFund for Real-Time Fund Investment Benchmarking

Audited

Professional · NeurIPS D&B poster

Vals Finance Agent

Agentic / Tool Use

AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents

Partially audited

Agentic / Tool Use · NeurIPS D&B poster

AGENTIF: Benchmarking Large Language Models Instruction Following Ability in Agentic Scenarios

Partially audited

Agentic / Tool Use · NeurIPS D&B spotlight

AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems

Partially audited

Agentic / Tool Use · NeurIPS D&B spotlight

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

Audited

Agentic / Tool Use · NeurIPS D&B poster

Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation

Audited

Agentic / Tool Use · NeurIPS D&B poster

Establishing Best Practices in Building Rigorous Agentic Benchmarks

Partially audited

Agentic / Tool Use · NeurIPS D&B poster

Factorio Learning Environment

Audited

Agentic / Tool Use · NeurIPS D&B poster

MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks

Partially audited

Agentic / Tool Use · NeurIPS D&B poster

MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents

Partially audited

Agentic / Tool Use · NeurIPS D&B poster

MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

Partially audited

Agentic / Tool Use · NeurIPS D&B poster

MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

Partially audited

Agentic / Tool Use · NeurIPS D&B poster

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

Audited

Agentic / Tool Use · NeurIPS D&B poster

Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents

Partially audited

Agentic / Tool Use · NeurIPS D&B poster

OSWorld Verified

REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

Audited

Agentic / Tool Use · NeurIPS D&B poster

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Partially audited

Agentic / Tool Use · NeurIPS D&B spotlight

Seeking and Updating with Live Visual Knowledge

Partially audited

Agentic / Tool Use · NeurIPS D&B poster

T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning

Partially audited

Agentic / Tool Use · NeurIPS D&B poster

Tau2-Bench Telecom

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

Audited

Agentic / Tool Use · NeurIPS D&B poster

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Audited

Agentic / Tool Use · NeurIPS D&B poster

Toolathlon

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

Audited

Agentic / Tool Use · NeurIPS D&B poster

Coding

Aider Polyglot

AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?

Audited

Coding · NeurIPS D&B poster

CLEVER: A Curated Benchmark for Formally Verified Code Generation

Audited

Coding · NeurIPS D&B poster

CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance

Partially audited

Coding · NeurIPS D&B poster

CoRe: Benchmarking LLMs’ Code Reasoning Capabilities through Static Analysis Tasks

Partially audited

Coding · NeurIPS D&B spotlight

CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming

Partially audited

Coding · NeurIPS D&B poster

Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation

Partially audited

Coding · NeurIPS D&B poster

EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code

Partially audited

Coding · NeurIPS D&B poster

Evaluating Program Semantics Reasoning with Type Inference in System $F$

Audited

Coding · NeurIPS D&B poster

Frontier SWE

GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

Audited

Coding · NeurIPS D&B poster

ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests

Audited

Coding · NeurIPS D&B poster

IR-OptSet: An Optimization-Sensitive Dataset for Advancing LLM-Based IR Optimizer

Partially audited

Coding · NeurIPS D&B poster

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

Partially audited

Coding · NeurIPS D&B poster

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Partially audited

Coding · NeurIPS D&B poster

PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation

Partially audited

Coding · NeurIPS D&B poster

ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code

Partially audited

Coding · NeurIPS D&B spotlight

SWE-bench Goes Live!

Partially audited

Coding · NeurIPS D&B poster

SWE-Bench Multilingual

SWE-Bench Pro

SWE-Bench Verified

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Partially audited

Coding · NeurIPS D&B poster

Terminal-Bench v2

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

Audited

Coding · NeurIPS D&B oral

Medical

CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research

Partially audited

Medical · NeurIPS D&B poster

ClinBench: A Standardized Multi-Domain Framework for Evaluating Large Language Models in Clinical Information Extraction

Partially audited

Medical · NeurIPS D&B poster

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

Audited

Medical · NeurIPS D&B poster

CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays

Audited

Medical · NeurIPS D&B spotlight

DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?

Partially audited

Medical · NeurIPS D&B poster

EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

Partially audited

Medical · NeurIPS D&B poster

LTD-Bench: Evaluating Large Language Models by Letting Them Draw

Audited

Medical · NeurIPS D&B poster

MedChain: Bridging the Gap Between LLM Agents and Clinical Practice with Interactive Sequence

Partially audited

Medical · NeurIPS D&B spotlight

MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants

Partially audited

Medical · NeurIPS D&B poster

MedSG-Bench: A Benchmark for Medical Image Sequences Grounding

Partially audited

Medical · NeurIPS D&B spotlight

MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

Partially audited

Medical · NeurIPS D&B poster

PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions

Audited

Medical · NeurIPS D&B spotlight

Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models

Partially audited

Medical · NeurIPS D&B poster

SMMILE: An expert-driven benchmark for multimodal medical in-context learning

Partially audited

Medical · NeurIPS D&B poster

TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine

Partially audited

Medical · NeurIPS D&B poster

Thousand Voices of Trauma: A Large-Scale Synthetic Dataset for Modeling Prolonged Exposure Therapy Conversations

Partially audited

Medical · NeurIPS D&B spotlight

Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis

Partially audited

Medical · NeurIPS D&B poster

Math

AIME 2024 + 2025

Benchmarking Large Language Models with Integer Sequence Generation Tasks

Partially audited

Math · NeurIPS D&B poster

ConnectomeBench: Can LLMs proofread the connectome?

Partially audited

Math · NeurIPS D&B spotlight

HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class

Partially audited

Math · NeurIPS D&B poster

IMOAnswerBench

Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving of Inequalities

Partially audited

Math · NeurIPS D&B poster

LexiCon: a Benchmark for Planning under Temporal Constraints in Natural Language

Partially audited

Math · NeurIPS D&B poster

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Partially audited

Math · NeurIPS D&B poster

NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

Partially audited

Math · NeurIPS D&B poster

OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization

Partially audited

Math · NeurIPS D&B poster

RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics

Partially audited

Math · NeurIPS D&B poster

Reasoning Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Partially audited

Math · NeurIPS D&B spotlight

SolidGeo: Measuring Multimodal Spatial Math Reasoning in Solid Geometry

Partially audited

Math · NeurIPS D&B poster

Solving Inequality Proofs with Large Language Models

Audited

Math · NeurIPS D&B spotlight

Retrieval / RAG

Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering

Partially audited

Retrieval / RAG · NeurIPS D&B poster

C-SEO Bench: Does Conversational SEO Work?

Partially audited

Retrieval / RAG · NeurIPS D&B poster

Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

Partially audited

Retrieval / RAG · NeurIPS D&B poster

FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Partially audited

Retrieval / RAG · NeurIPS D&B poster

HawkBench: Investigating Resilience of RAG Methods on Stratified Information-Seeking Tasks

Partially audited

Retrieval / RAG · NeurIPS D&B spotlight

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Audited

Retrieval / RAG · NeurIPS D&B poster

MS-Bench: Evaluating LMMs in Ancient Manuscript Study through a Dunhuang Case Study

Partially audited

Retrieval / RAG · NeurIPS D&B poster

Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals

Partially audited

Retrieval / RAG · NeurIPS D&B poster

Safety / Alignment

Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)

Partially audited

Safety / Alignment · NeurIPS D&B oral

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models

Partially audited

Safety / Alignment · NeurIPS D&B poster

BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks

Partially audited

Safety / Alignment · NeurIPS D&B poster

CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs

Partially audited

Safety / Alignment · NeurIPS D&B poster

CHASM: Unveiling Covert Advertisements on Chinese Social Media

Partially audited

Safety / Alignment · NeurIPS D&B poster

Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment

Audited

Safety / Alignment · NeurIPS D&B poster

DataSIR: A Benchmark Dataset for Sensitive Information Recognition

Partially audited

Safety / Alignment · NeurIPS D&B poster

DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios

Audited

Safety / Alignment · NeurIPS D&B poster

GuardSet-X: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset

Partially audited

Safety / Alignment · NeurIPS D&B poster

InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback

Partially audited

Safety / Alignment · NeurIPS D&B spotlight

OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics

Audited

Safety / Alignment · NeurIPS D&B poster

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

Audited

Safety / Alignment · NeurIPS D&B spotlight

PUO-Bench: A Panel Understanding and Operation Benchmark with A Privacy-Preserving Framework

Partially audited

Safety / Alignment · NeurIPS D&B poster

Risk Management for Mitigating Benchmark Failure Modes: BenchRisk

Audited

Safety / Alignment · NeurIPS D&B poster

SafeVid: Toward Safety Aligned Video Large Multimodal Models

Audited

Safety / Alignment · NeurIPS D&B poster

SAGE-Eval: Evaluating LLMs for Systematic Generalizations of Safety Facts

Partially audited

Safety / Alignment · NeurIPS D&B spotlight

SECODEPLT: A Unified Benchmark for Evaluating the Security Risks and Capabilities of Code GenAI

Partially audited

Safety / Alignment · NeurIPS D&B poster

Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition

Audited

Safety / Alignment · NeurIPS D&B poster

Towards Evaluating Proactive Risk Awareness of Multimodal Language Models

Partially audited

Safety / Alignment · NeurIPS D&B poster

UMU-Bench: Closing the Modality Gap in Multimodal Unlearning Evaluation

Partially audited

Safety / Alignment · NeurIPS D&B poster

Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs

Audited

Safety / Alignment · NeurIPS D&B poster

VMDT: Decoding the Trustworthiness of Video Foundation Models

Partially audited

Safety / Alignment · NeurIPS D&B poster

Science

AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy

AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science

Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations

CellVerse: Do Large Language Models Really Understand Cell Biology?

ChemX: A Collection of Chemistry Datasets for Benchmarking Automated Information Extraction

FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models

GPQA Diamond

HLE (Humanity's Last Exam)

Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors

QCircuitBench: A Large-Scale Dataset for Benchmarking Quantum Algorithm Design

Scaling Physical Reasoning with the PHYSICS Dataset

Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Multimodal

AgMMU: A Comprehensive Agricultural Multimodal Understanding Benchmark

AnomalyCoT: A Multi-Scenario Chain-of-Thought Dataset for Multimodal Large Language Models

BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception

BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models

ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness

CoralVQA: A Large-Scale Visual Question Answering Dataset for Coral Reef Image Understanding

DisasterM3

Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?

Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants

FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

Fire360: A Benchmark for Robust Perception and Episodic Memory in Degraded 360° Firefighting Video

From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D

Hyperphantasia: A Benchmark for Evaluating the Mental Visualization Capabilities of Multimodal LLMs

InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts

IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models

MLLM-ISU: The First-Ever Comprehensive Benchmark for Multimodal Large Language Models based Intrusion Scene Understanding

MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models

MMCSBench: A Fine-Grained Benchmark for Large Vision-Language Models in Camouflage Scenes

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MMLongBench

MMMU-Pro

MMPB: It’s Time for Multi-Modal Personalization

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?

RBench-V: A Primary Assessment for Visual Reasoning Models with Multimodal Outputs

Robo2VLM: Improving Visual Question Answering using Large-Scale Robot Manipulation Data

SeePhys: Does Seeing Help Thinking? – Benchmarking Vision-Based Physics Reasoning

SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

Professional

A Multimodal Benchmark for Framing of Oil & Gas Advertising and Potential Greenwashing Detection

DABstep

Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

GDPval AA

OfficeQA Pro

STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

Time Travel is Cheating: Going Live with DeepFund for Real-Time Fund Investment Benchmarking

Vals Finance Agent

Agentic / Tool Use

AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents

AGENTIF: Benchmarking Large Language Models Instruction Following Ability in Agentic Scenarios

AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation

Establishing Best Practices in Building Rigorous Agentic Benchmarks

Factorio Learning Environment

MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks

MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents

MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents

OSWorld Verified

REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis