Skip to content

Add custom trajectory helpers for external agents#45

Open
eldwin-easynet-world wants to merge 1 commit into
Future-House:mainfrom
eldwin-easynet-world:codex/bixbench-custom-trajectories
Open

Add custom trajectory helpers for external agents#45
eldwin-easynet-world wants to merge 1 commit into
Future-House:mainfrom
eldwin-easynet-world:codex/bixbench-custom-trajectories

Conversation

@eldwin-easynet-world

Copy link
Copy Markdown

Summary

This adds a small custom-trajectory helper module for external agents that already produce BixBench answers/notebooks and need a safe way to validate and write postprocessing-compatible JSON before running a full benchmark.

Changes:

  • add bixbench.custom_trajectories with validate_custom_trajectory, write_custom_trajectory, and minimal_notebook
  • document a no-core-edit custom-agent flow in the README
  • make load_dataframe_from_json_directory tolerate JSON files without a replica_<id> suffix by defaulting to replica 0
  • add focused tests for validation, writer naming, and directory-loader compatibility

Why

The current README tells custom-agent users to edit custom_rollout and produce the same trajectory shape as BixBench trajectories. That is hard to smoke-test before spending credits or running Docker/Hugging Face/full agentic evals. This patch gives external agents a small contract and deterministic checks for the trajectory format used by postprocessing.

Validation

Ran in a local Python 3.12 conda environment:

PYTHONPATH=. conda run -n bixbench-py312 pytest tests/test_custom_trajectories.py -q
PYTHONPATH=. conda run -n bixbench-py312 python -m py_compile bixbench/custom_trajectories.py bixbench/postprocessing_utils.py tests/test_custom_trajectories.py
PYTHONPATH=. conda run -n bixbench-py312 pytest tests -q

Results:

  • tests/test_custom_trajectories.py: 4 passed
  • full local suite: 19 passed, with existing Pydantic v2 deprecation warnings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant