Skip to content

Support Multi-step Agent Tasks in Evaluation Framework (e.g., GAIA's Multi-turn Tasks) #55

@RuishanFang

Description

@RuishanFang

Description:
Currently, our evaluation framework primarily supports single-turn or simple interaction tasks. However, many real-world agent scenarios, such as those in the GAIA benchmark , involve multi-turn dialogues and complex, sequential decision-making processes.

To better support these advanced use cases, we need to enhance the framework to:

  • Support multi-turn task evaluations
  • Track agent behavior across multiple steps
  • Provide metrics for task completion, dialogue flow, and agent performance in multi-step scenarios

Proposed Features:

  • Support for logging and analyzing agent responses across turns
  • Integration with existing evaluation metrics (e.g., accuracy, reward, task success)
  • Example implementations for benchmarks like GAIA

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions