A curated list of benchmarks, environments, papers, competitions, and open-source platforms for evaluating AI agents in interactive, tool-using, dynamic, and production-like settings.
Agentic evaluation is different from plain LLM evaluation: agents plan, call tools, interact with users, change environment state, recover from failures, and operate across long horizons. This list focuses on interactive and environment-grounded evaluation, with an emphasis on open-source work from research labs, large companies, and startups.
- Key Concepts And Glossary
- Scope
- Foundational Benchmark Lineages
- Task Standards And Interoperability
- Competitions And Benchmark Platforms
- Benchmark Rigor And Methodology
- Benchmarks By Domain
- Holistic And Cross-Cutting Evaluation Frameworks
- Open-Source Eval Tooling, Tracing, And Observability
- Company-Backed Open Source
- State-Of-The-Art Practices For Agentic Evaluation
- Design Patterns For Agentic Evaluation
- Conference Tutorials And Workshops
- Contributing
- License
This section defines core terms used throughout this guide. Whether you come from software engineering, ML research, product management, or policy, these definitions should help you navigate the landscape.
| Term | Definition |
|---|---|
| Agent | An AI system that perceives its environment, makes decisions, and takes actions autonomously over multiple steps to achieve a goal. Unlike a simple chatbot that answers one question at a time, an agent maintains state, plans ahead, and adapts its behavior based on feedback. |
| Agentic Evaluation | The practice of measuring how well an AI agent performs in realistic, interactive settings — not just whether it gets the right answer, but whether it follows the right process, uses tools correctly, recovers from errors, and respects constraints. |
| Benchmark | A standardized test suite with predefined tasks, inputs, expected outputs, and scoring criteria. Benchmarks enable reproducible comparison of different agents or models under controlled conditions. |
| Tool Use / Function Calling | The ability of an agent to invoke external functions, APIs, or services (e.g., search engines, databases, code interpreters) as part of its reasoning process. This is a core capability that separates agents from plain language models. |
| Multi-Turn Interaction | A conversation or task that unfolds over multiple exchanges between the agent and a user, environment, or other agent. Each turn can change the state and influence subsequent decisions. |
| Trajectory | The complete sequence of actions, observations, and decisions an agent makes while attempting a task. Trajectory evaluation judges the quality of the process, not just the final outcome. |
| pass^k | A reliability metric introduced by τ-bench. Instead of measuring whether an agent can succeed (like pass@k which checks if any of k attempts succeeds), pass^k measures whether an agent succeeds on all k independent attempts — testing consistency and reliability. |
| Outcome vs. Process Scoring | Outcome scoring checks only the final result (did the agent produce the correct answer?). Process scoring evaluates intermediate steps (did the agent use the right tools? Did it follow the correct policy? Did it ask clarifying questions when appropriate?). |
| Static vs. Dynamic Environments | A static environment stays the same across evaluations. A dynamic environment changes — new data arrives, APIs break, conditions shift — forcing the agent to adapt. Dynamic environments better reflect production reality. |
| Single-Control vs. Dual-Control | In single-control settings, only the agent acts on the environment. In dual-control settings (like τ²-bench), both the agent and the simulated user can change the shared environment, introducing coordination challenges. |
| Execution-Based Judging | Evaluating the agent by actually running its outputs (e.g., executing generated code, applying a patch, running a command) rather than using pattern matching or LLM-based text comparison. This provides stronger correctness guarantees. |
| LLM-as-Judge | Using another language model to evaluate an agent's outputs. While scalable, this approach can introduce biases (verbosity bias, self-preference) and requires careful calibration. |
| Assessor Agent | An AI agent specifically designed to evaluate other agents (as in the AgentBeats framework). This enables scalable, automated evaluation using agent-to-agent protocols. |
| Sandbox / Gym Environment | An isolated, controlled environment (often containerized) where agents can be safely tested. Inspired by OpenAI Gym, these environments provide standardized observation and action interfaces. |
| Contamination | When a model has seen benchmark data during training, inflating its scores beyond its true capability. A major validity concern for all benchmarks, especially static ones. |
This list prioritizes resources that evaluate one or more of the following:
- Multi-turn interaction — agents that maintain context and act across multiple conversational exchanges
- Tool use and environment state changes — agents that call APIs, execute code, modify files, or interact with external systems
- User coordination, policy following, and safety constraints — agents that must respect rules, ask clarifying questions, and coordinate with humans
- Process quality, not just final answers — evaluating how an agent solves a problem, including intermediate reasoning and decision-making
- Reproducibility, realism, and benchmark validity — benchmarks that resist gaming, reflect real-world conditions, and produce consistent results
- Production feedback loops — using traces, logs, and regression testing to continuously evaluate deployed agents
Generic static LLM benchmarks are out of scope unless they are directly useful for evaluating agents.
These are benchmark families that have shaped how the community thinks about agent evaluation. Understanding their design choices helps contextualize newer work.
The τ (tau) family focuses on the triangle of interactions between a tool, an agent, and a user. These benchmarks model realistic customer-service-style scenarios where an agent must follow business policies while helping a simulated user accomplish tasks via tool calls.
- τ-bench — Foundational benchmark for tool-agent-user interaction in real-world domains (airline, retail). Introduces policy-aware conversational tasks and the
pass^kreliability metric that measures consistency across repeated attempts, not just best-case performance. The repository notes that some original tasks are outdated and points readers toward newer τ-lineage work. - τ-bench paper — Core paper that established the TAU framing for evaluating interactive customer-service-style agents.
- τ²-Bench — Dual-control extension where both the agent and the simulated user can act on the shared environment through tools. This surfaces coordination failures that single-control setups miss entirely — e.g., the user booking a flight while the agent is simultaneously trying to modify the same reservation.
- τ²-Bench paper — Reference for coordination failures, user guidance, and shared-world interaction.
The SWE-bench family evaluates whether agents can resolve real software engineering tasks drawn from actual GitHub issue-PR pairs. It has become the canonical benchmark for coding agents and has spawned multiple extensions.
- SWE-bench — Canonical benchmark: given a repository and a GitHub issue description, generate a patch that resolves the problem. Tasks are drawn from real open-source projects (Django, scikit-learn, sympy, etc.) with real test suites for execution-based verification. Originally developed at Princeton, now maintained across Princeton and Stanford. ICLR 2024 Oral.
- SWE-bench Verified — A curated subset of 500 problems verified by professional software engineers as solvable and unambiguous. Developed in collaboration with OpenAI Preparedness. (OpenAI report)
- SWE-bench Multimodal — Extension to visual software domains (UI bugs, plot rendering issues), testing whether agents can generalize beyond text-only code. ICLR 2025.
- SWE-agent — Reference agent implementation that established strong baselines on SWE-bench and demonstrated the importance of agent-computer interface design.
- SWE-smith — Dedicated toolkit for creating synthetic SWE training data at scale, addressing data scarcity for training coding agents.
- AgentBench — Multi-dimensional benchmark across 8 distinct environments (operating system, database, knowledge graph, card game, lateral thinking, web shopping, web browsing, house-holding) for evaluating LLM-as-Agent reasoning and decision-making. Identified that poor long-term reasoning, decision-making, and instruction following are the main bottlenecks. ICLR 2024. (paper)
Task standards define a common format for writing evaluation tasks so that different organizations can share and reuse each other's work without reimplementing everything from scratch. This is critical because building high-quality tasks is expensive.
- METR Task Standard — A common format for defining AI agent evaluation tasks, developed by METR (formerly ARC Evals), a non-profit focused on evaluating frontier model autonomy. Tasks specify an environment (container/VM), instructions, and a scoring function. Already used to define 200+ task families across AI R&D, cybersecurity, and general autonomy. Includes adaptors for running existing benchmarks (GAIA, AgentBench, SWE-bench, picoCTF, HumanEval, GPQA) in the METR format.
- A2A (Agent-to-Agent Protocol) — Google's open protocol for inter-agent communication. Relevant to evaluation because it enables standardized assessor-agent interactions (as used in the AgentBeats framework).
- MCP (Model Context Protocol) — Anthropic's open protocol for connecting AI models to data sources and tools. Provides a standard interface for tool-use evaluation.
Competitions drive the field forward by providing standardized leaderboards, incentivizing reproducibility, and surfacing novel evaluation methodologies.
- AgentBeats — Open platform for agent evaluation where benchmarks themselves are packaged as assessor agents. This "agents evaluating agents" model scales evaluation beyond manually curated test suites.
- AgentX-AgentBeats Competition — UC Berkeley RDI-hosted competition focused on building benchmark agents and contestant agents for public-good evaluation tasks. Demonstrates the AAA (Agent, Assessor, Arena) paradigm.
- agentify-example-tau-bench — Example of wrapping an existing benchmark (τ-bench) into the AgentBeats ecosystem, illustrating how to convert traditional benchmarks into assessor-agent format.
- AgentBeats Tutorials — Starter material for integrating A2A-compatible agents and assessments.
- AgentBeats Info Session Deck — Background on the AAA framework, assessor agents, A2A and MCP standardization, registries, traces, and leaderboards.
- Berkeley Function Calling Leaderboard (BFCL) — UC Berkeley's live leaderboard for evaluating function-calling capabilities of LLMs. Covers simple, parallel, multiple, and multi-turn function calling across Python, Java, JavaScript, and REST APIs. V4 adds agentic web search, memory management, and format sensitivity tests. (GitHub, paper)
A benchmark is only useful if its results are trustworthy. This section covers work on understanding and improving the quality of agent benchmarks themselves — meta-evaluation, if you will.
- Establishing Best Practices for Building Rigorous Agentic Benchmarks — Introduces the Agentic Benchmark Checklist (ABC) and highlights common task-validity, outcome-validity, and reporting weaknesses in agent benchmarks. Essential reading before publishing or trusting any benchmark.
Questions worth asking before trusting a benchmark:
| Dimension | What to check |
|---|---|
| Task Validity | Are tasks representative of real work? Are instructions unambiguous? Could a competent human solve them? |
| Evaluator Validity | Does the scoring function actually measure what it claims? Are there false positives or false negatives? |
| Gaming Resistance | Can the benchmark be solved by shortcuts (e.g., memorization, pattern matching) that don't reflect genuine capability? |
| Failure Transparency | Are failure modes, edge cases, and caveats reported clearly? |
| Production Relevance | Does benchmark performance predict real-world deployment quality? |
| Contamination Risk | Could models have seen the test data during training? Are there mechanisms to detect or prevent this? |
| Reproducibility | Can results be independently reproduced? Are environments, prompts, and scoring deterministic? |
These benchmarks test an agent's ability to carry on multi-turn conversations, decide when and how to call tools, handle ambiguous requests, and follow policies or constraints.
- APIGen-MT — Multi-turn, tool-calling data and evaluation setup closely related to τ-style interaction. Generates verifiable multi-turn API-calling trajectories.
- FlowBench — Evaluates tool-using agents on tasks with explicit workflow knowledge and planning structure. Tests whether agents can follow prescribed workflows rather than ad-hoc reasoning.
- Gorilla / Berkeley Function Calling — UC Berkeley's Gorilla project includes both a function-calling LLM and the BFCL evaluation suite. The leaderboard tracks LLM ability to generate correct API calls across languages and complexity levels. (GitHub)
- IntellAgent — Synthetic test-suite generation pipeline for policy-rich conversational agents. Automatically generates evaluation scenarios from policy documents, explicitly positioned as a scalable proxy to τ-style evaluation.
- ToolSandbox — Stateful tool environment with fine-grained evaluation of partial progress and agent behavior. Measures not just final success but intermediate tool-use quality.
These benchmarks evaluate agents that interact with real web browsers, operating systems, or graphical interfaces. They test perception (understanding screen content), planning (deciding what to click/type), and execution (performing actions correctly).
- AgentLab — Open-source framework for developing, testing, and benchmarking web agents at scale. Provides reproducible evaluation harnesses across multiple web benchmarks.
- BrowserGym — Gym-like environment ecosystem for browser-agent evaluation. Defines a standardized observation-action space for web agents, enabling apples-to-apples comparison.
- ClawBench — First live-site agent benchmark: 153 everyday tasks across 144 real production websites in 15 categories, evaluated safely via proxy interception so agents drive the live web without touching real accounts or payments. MIT-licensed; 7 reference models reported. (paper)
- OSWorld — Multimodal benchmark for open-ended tasks in full computer environments (Ubuntu desktop). Tests agents on real-world tasks like spreadsheet editing, web browsing, and system administration using screenshot and accessibility-tree observations.
- OSWorld-G — Grounding-oriented extension in the OSWorld family, focusing on fine-grained visual grounding in desktop environments.
- WebArena — Realistic, self-hostable web environment with functional websites (e-shopping, forums, maps, content management) for testing autonomous web agents on complex, multi-step tasks. Carnegie Mellon University. (paper)
- WebArena-Verified — Verified release focused on more reproducible web-agent benchmarking through human-validated task correctness.
- VisualWebArena — Extension of WebArena adding visually grounded tasks that require understanding images, screenshots, and visual layouts on the web. (paper)
These benchmarks test agents on real-world software engineering workflows: fixing bugs, writing tests, debugging CI failures, configuring environments, and conducting ML experiments. They typically use execution-based evaluation (running test suites) for reliable scoring.
- DevOps-Gym — End-to-end benchmark for the DevOps lifecycle, including configuration, monitoring, issue resolution, and test generation. UC Santa Barbara.
- MLE-bench — Benchmark for machine-learning engineering and research workflows based on Kaggle competitions. Tests whether agents can perform data analysis, feature engineering, model training, and submission preparation. OpenAI.
- MLGym — Meta's gym-style environment and benchmark for machine-learning research tasks. Provides standardized environments for testing ML research agents.
- Multi-SWE-bench — Extends SWE-style evaluation to multi-language settings (Java, TypeScript, Rust, Go) and more diverse software issues.
- SWE-bench — The canonical benchmark for resolving real GitHub issues in real repositories. Uses Docker-based evaluation for reproducibility. Princeton NLP / Stanford. See Foundational Benchmark Lineages for full details.
- SWT-bench — Benchmark centered on regression-test generation for software agents. Tests whether agents can write tests that catch bugs, rather than fixing them.
These benchmarks evaluate agents on security-relevant tasks: finding vulnerabilities, exploiting bugs, solving capture-the-flag challenges, and performing security analysis. They require agents to reason about code, systems, and adversarial scenarios.
- AutoAdvExBench — Benchmark family for adversarial exploit-generation settings.
- BountyBench — Cybersecurity benchmark oriented toward bug-bounty-style vulnerability discovery in real software.
- CVE-Bench — Benchmark built around vulnerability analysis and CVE-centric tasks, testing whether agents can analyze and reproduce known vulnerabilities.
- CyberGym — Large-scale execution-based benchmark for reproducing real-world vulnerabilities from real projects. UC Berkeley.
- Cybench — Comprehensive cyber benchmark covering multiple CTF categories, often cited in newer benchmark-comparison work.
- NYU CTF Bench — Capture-the-flag style benchmark for cybersecurity agents. NYU.
These benchmarks assess whether AI agents pose risks through autonomous capabilities like self-replication, deception, resource acquisition, or resisting oversight. Critical for frontier model evaluation and policy decisions.
- Dangerous Capability Evaluations — Google DeepMind's evaluation suite covering in-house CTF challenges, self-proliferation tasks, and self-reasoning tasks. Tests whether frontier models can autonomously replicate, acquire resources, or reason about their own situation. (paper)
- Anthropic Model-Written Evaluations — Datasets testing models for persona consistency, sycophancy, advanced AI risks (power-seeking, self-preservation), and bias. Demonstrates the approach of using language models to generate evaluation data at scale. (paper)
- HELM Safety Leaderboard — Stanford CRFM's aggregated safety benchmark covering 6 risk vectors across popular safety benchmarks. Part of the broader HELM framework.
- AIR-Bench — Safety benchmark based on emerging government regulations and company policies, evaluating model compliance with real-world safety requirements. Stanford CRFM.
Multi-agent evaluation is an emerging area that tests how agents collaborate, coordinate, or compete with each other. This is increasingly important as deployed systems involve multiple cooperating agents.
- AutoGen Bench — Benchmarking suite integrated within Microsoft's AutoGen multi-agent framework. Evaluates multi-agent orchestration performance on standard tasks.
- Magentic-One — Microsoft's state-of-the-art multi-agent team built on AutoGen. Handles web browsing, code execution, and file handling tasks, serving as both a reference agent implementation and an evaluation subject. (paper)
- AgentBench — While primarily a single-agent benchmark, its 8 diverse environments serve as testbeds for multi-agent architectures. See Foundational Benchmark Lineages.
These benchmarks aim to measure agent capabilities on tasks that resemble actual human work — information gathering, analysis, multi-step problem solving, and professional expertise.
- GAIA — Benchmark for general AI assistants that require multi-step reasoning, tool use, and real-world problem solving. Tasks are designed so that a human can verify the answer easily but solving requires genuine capability. Meta / HuggingFace.
- GDPval — Benchmark for economically valuable expert work across occupations and sectors. Maps agent capabilities to real economic value.
- GDPval in Inspect Evals — Example of integrating GDPval into the UK AI Safety Institute's broader open evaluation ecosystem.
- How Well Does Agent Development Reflect Real-World Work? — Critical perspective on how current agent development and benchmarks map to actual real-world work practices and requirements.
- GoEX (Gorilla Execution Engine) — UC Berkeley's runtime for executing LLM-generated actions (API calls, code) with "post-facto validation," "undo," and "damage confinement" abstractions. Directly relevant to evaluating agents in production where actions have real consequences. (paper)
Static benchmarks eventually saturate and become gameable. These resources address the fundamental challenge of creating evaluation environments that change over time, preventing memorization and testing true adaptation.
- AUTOENV — Environment-generation direction for scalable and adaptive agent evaluation. Automatically generates new evaluation environments rather than relying on hand-crafted ones.
- The World Won't Stay Still: Programmable Evolution for Agent Benchmarks — Argues that static environments are insufficient and introduces programmable environment evolution, where the benchmark itself can change its tasks and conditions over time.
- ToolQA-D — Dynamic evaluation setup for tool-use agents operating in changing conditions. Tests whether agents can adapt when the data behind their tools changes.
These frameworks aim to evaluate models and agents across multiple dimensions simultaneously — capabilities, safety, fairness, robustness, and efficiency — rather than focusing on a single benchmark.
- HELM (Holistic Evaluation of Language Models) — Stanford CRFM's comprehensive open-source framework for evaluating foundation models across many dimensions: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. Maintains multiple leaderboards covering capabilities, safety, vision-language (VHELM), text-to-image (HEIM), medical (MedHELM), financial, multilingual, and domain-specific evaluations. The gold standard for holistic model evaluation. (paper)
- Inspect Evals — Large open collection of benchmark implementations maintained by the UK AI Safety Institute. Provides a unified framework for running many different benchmarks with consistent tooling, including GAIA, GDPval, and others.
These tools help you build, run, and manage evaluations. They provide the infrastructure for defining test cases, running agents against them, and collecting results.
- agentevals — Agent-focused evaluators with emphasis on trajectory and intermediate-step assessment. Supports both reference-based and reference-free evaluation of agent behavior.
- DeepEval — Pytest-like framework for evaluating LLM systems, including multi-step workflows. Provides built-in metrics (faithfulness, answer relevancy, hallucination) and integrates with CI/CD pipelines.
- Inspect Evals — Large open collection of benchmark implementations and adapters maintained by the UK AI Safety Institute. Designed for reproducible evaluations with standardized tooling.
- OpenAI Evals — General-purpose open-source eval framework and registry for LLM systems. Established many of the conventions now used across the ecosystem.
- OpenEvals — Readymade evaluators for LLM applications and custom eval suites. Covers common evaluation patterns (correctness, style, safety) out of the box.
Production agent evaluation requires visibility into what agents are actually doing. These tools provide logging, tracing, and analysis capabilities for understanding agent behavior at scale.
- agenttrace — Local-first TUI and CLI for evaluating AI coding agent session traces with cost, token, latency, and health regression gates.
- Braintrust Python SDK — SDK for logging, tracing, datasets, and evaluations. Supports A/B testing of prompts and model configurations.
- Braintrust TypeScript SDK — JavaScript and TypeScript side of the Braintrust tooling stack.
- Langfuse — Open-source LLM engineering platform for monitoring, evals, prompts, and debugging. Provides production-grade tracing with cost tracking and latency analysis.
- Phoenix — Open-source observability platform for experimentation, evaluation, and troubleshooting. Strong support for embedding visualization and drift detection.
Understanding who backs which projects helps you gauge long-term maintenance commitments, potential biases, and ecosystem compatibility.
| Organization | Key Contributions |
|---|---|
| UC Berkeley RDI | AgentBeats competition/platform, CyberGym security benchmark |
| UC Berkeley (Gorilla/BFCL) | Berkeley Function Calling Leaderboard, Gorilla LLM + tools, GoEX execution engine |
| Stanford CRFM | HELM holistic evaluation framework, Safety / Capabilities / VHELM leaderboards, AIR-Bench |
| Princeton NLP | SWE-bench family (canonical coding-agent benchmark), SWE-agent |
| METR | Task Standard for autonomous capability evaluation, frontier model safety assessments |
| Tsinghua University | AgentBench multi-dimensional agent evaluation |
| Carnegie Mellon | WebArena and VisualWebArena web-agent benchmarks |
| Organization | Key Contributions |
|---|---|
| OpenAI | Evals framework, SWE-bench Verified collaboration, MLE-bench |
| Google DeepMind | Dangerous Capability Evaluations, frontier safety assessment |
| A2A (Agent-to-Agent Protocol) for standardized agent communication | |
| Anthropic | Model-Written Evaluations, MCP (Model Context Protocol), safety-focused evaluation research |
| Microsoft | AutoGen multi-agent framework with AutoGen Bench, Magentic-One |
| Meta | MLGym ML research agent benchmark, GAIA co-development |
| ServiceNow | AgentLab, BrowserGym, WebArena-Verified |
| Sierra Research | τ-bench / τ²-bench benchmark family |
| UK AI Safety Institute | Inspect Evals evaluation platform |
| Organization | Key Contributions |
|---|---|
| Arize AI | Phoenix observability and evaluation platform |
| Braintrust | Braintrust SDK for tracing and evaluation workflows |
| Confident AI | DeepEval OSS-first evaluation framework |
| LangChain | agentevals trajectory evaluation, OpenEvals general LLM evals |
| Langfuse | Langfuse open-source observability and eval stack |
Drawing on lessons from the benchmarks, papers, and competitions listed above, the following emerging practices represent the current state of the art for evaluating AI agents.
Why: Text-based comparison (exact match, BLEU, LLM-as-judge) is brittle and unreliable for agentic tasks. An agent might produce a correct database query with different formatting, or fix a bug with a different but equally valid approach.
Best practice: Run the agent's outputs in a real environment — execute the code, apply the patch, run the test suite, check the database state. SWE-bench, CyberGym, and DevOps-Gym all use this approach. If execution-based evaluation isn't possible, combine multiple evaluation signals rather than relying on a single LLM judge.
Why: An agent that succeeds 1 out of 10 times on a task is useless in production but might rank well on pass@10 leaderboards. Real users need consistency.
Best practice: Use metrics like pass^k (introduced by τ-bench) that penalize inconsistency. Report variance across runs. Test with different seeds and prompt formulations. The BFCL leaderboard reports results across multiple API call formats and complexity levels.
Why: A correct final answer can mask dangerous intermediate behavior — the agent might have hallucinated tool calls that happened to produce the right result, or violated safety policies along the way.
Best practice: Score agent trajectories and intermediate steps. Check policy compliance at each turn (as τ-bench does). Use tools like agentevals that support trajectory-level evaluation. Log every tool call, observation, and decision point for post-hoc analysis.
Why: Static benchmarks inevitably leak into training data. Models that have memorized answers appear capable but aren't.
Best practice: Use dynamic environments (AUTOENV, ToolQA-D) that generate fresh tasks. Maintain hidden test sets with private evaluation (as SWE-bench Multimodal does). Regularly refresh benchmark data. Consider "live" evaluation with continuously updated data (as BFCL V2 does with enterprise-contributed real-world scenarios).
Why: Most real agent deployments involve coordination with humans or other agents who can independently change the environment. Single-agent benchmarks miss coordination failures entirely.
Best practice: Use dual-control benchmarks like τ²-bench where both user and agent can act. Test multi-agent scenarios where agents must share resources or information. Evaluate how agents handle conflicting actions or stale state.
Why: Agents that execute code, call APIs, or modify files can cause real damage. Evaluation infrastructure must be isolated, and every action must be traceable.
Best practice: Use containerized environments (Docker, as SWE-bench does). Implement "undo" and "damage confinement" abstractions (as GoEX does). Deploy comprehensive tracing (Langfuse, Phoenix, Braintrust) for every evaluation run. The METR Task Standard provides a template for isolated evaluation environments.
Why: Many published benchmarks contain ambiguous tasks, incorrect ground truth, or evaluation functions that don't measure what they claim to measure.
Best practice: Follow the ABC (Agentic Benchmark Checklist). Have human experts verify task solvability (as SWE-bench Verified did with professional engineers). Test your evaluation function against known-correct and known-incorrect solutions. Report inter-annotator agreement. Be transparent about failure modes and limitations.
When designing or choosing a benchmark, understanding these fundamental design axes helps clarify what is being measured and what is being missed. A good benchmark makes its stance explicit on each.
| Design Axis | Options | Trade-offs |
|---|---|---|
| Scoring Target | Outcome vs. Process | Outcome is simpler but misses how the agent got there; process is richer but harder to define and score |
| Environment Dynamics | Static vs. Dynamic | Static is reproducible but gameable; dynamic resists memorization but is harder to maintain |
| Control Model | Single-control vs. Dual/Multi-control | Single is simpler; dual/multi better reflects real coordination challenges |
| Judging Method | Text matching vs. LLM judge vs. Execution-based | Execution is strongest but requires sandboxing; LLM judge scales but introduces bias; text matching is too brittle for open-ended tasks |
| Scoring Granularity | End-state matching vs. Trajectory scoring | End-state is pass/fail; trajectory reveals partial progress and failing patterns |
| Evaluation Mode | Offline benchmarking vs. Production feedback loops | Offline is controlled; production feedback captures real deployment challenges but is harder to standardize |
| Task Source | Hand-crafted vs. Programmatic vs. Real-world collected | Hand-crafted is high quality but expensive; programmatic scales but may lack realism; real-world (like SWE-bench) is authentic but noisy |
Curated conference sessions that are directly about agentic evaluation, or that include evaluation frameworks, benchmark design, safety assessment, and interactive-agent testing as a major component.
- RecSys 2025 — Agentic LLM for Recommender Systems — Tutorial on agentic recommender systems that explicitly includes evaluation frameworks and benchmark design principles for future research and practice.
- RecSys 2025 — Multi Agentic Recommender Systems: Foundations, Design Patterns, and E-Commerce Applications — Tutorial covering multi-agent recommender architectures with explicit attention to evaluation, user-behavior simulation, and production-oriented design patterns.
- RecSys 2025 — Recent Advances in Generative Conversational Recommender Systems — Tutorial on multi-turn conversational recommenders that highlights evaluation issues such as hallucinations, social factors, and ethical considerations.
- CHIIR 2026 — Information Seeking in the Age of Agentic AI — Hands-on ACM tutorial explicitly framed around leveraging and evaluating information-seeking systems in the age of agentic AI.
- FSE 2025 — AIOpsLab in Action: An Open Platform for AIOps Research — Software-engineering tutorial introducing AIOpsLab as an open platform and benchmark suite for evaluating AI agents in operational cloud/IT environments.
- ICML 2025 Workshop on Computer Use Agents — Workshop centered on computer-use agents that explicitly asks how to build robust environments and evaluation metrics for real-world deployment.
- ICML 2025 — OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety — Invited workshop talk on a simulation framework for evaluating agent safety across 350+ multi-turn tasks spanning benign and adversarial scenarios with real tool use.
- ICML 2025 — Building and Evaluating Generalist Agents — Invited workshop talk covering OSWorld and AgentArena as interactive platforms for building and evaluating generalist agents.
- ICML 2025 — AgentSearchBench: Evaluating Agentic Search with Agent-as-a-Judge — Workshop paper on rigorous evaluation of frontier agentic search systems with an explicit agent-as-a-judge setup.
- ICML 2025 — ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents — Workshop benchmark contribution focused on safety and trustworthiness evaluation for enterprise-style web agents.
- ICML 2025 — OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents — Workshop spotlight introducing a benchmark specifically for safety evaluation of computer-use agents.
- ICML 2025 — ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows — Workshop spotlight on evaluating multimodal autonomous agents in realistic scientific workflows rather than toy environments.
Contributions are welcome. Please read CONTRIBUTING.md before opening a pull request.
This work is licensed under CC0-1.0.