Support Multi-step Agent Tasks in Evaluation Framework (e.g., GAIA's Multi-turn Tasks)

**Description:**
Currently, our evaluation framework primarily supports single-turn or simple interaction tasks. However, many real-world agent scenarios, such as those in the [GAIA benchmark ](https://huggingface.co/gaia-benchmark), involve multi-turn dialogues and complex, sequential decision-making processes.

To better support these advanced use cases, we need to enhance the framework to:

- Support multi-turn task evaluations
- Track agent behavior across multiple steps
- Provide metrics for task completion, dialogue flow, and agent performance in multi-step scenarios

**Proposed Features:**

- Support for logging and analyzing agent responses across turns
- Integration with existing evaluation metrics (e.g., accuracy, reward, task success)
- Example implementations for benchmarks like GAIA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Multi-step Agent Tasks in Evaluation Framework (e.g., GAIA's Multi-turn Tasks) #55

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Support Multi-step Agent Tasks in Evaluation Framework (e.g., GAIA's Multi-turn Tasks) #55

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions