Skip to content

Introduce benchmark CLI, model registry, safe-control components, Docker/Compose updates, and tests#45

Open
vtavakkoli wants to merge 2 commits into
mainfrom
codex/clean-repository-and-implement-main-benchmark-3xah8h
Open

Introduce benchmark CLI, model registry, safe-control components, Docker/Compose updates, and tests#45
vtavakkoli wants to merge 2 commits into
mainfrom
codex/clean-repository-and-implement-main-benchmark-3xah8h

Conversation

@vtavakkoli

Copy link
Copy Markdown
Owner

Motivation

  • Pivot the repository from scenario-run scripts to a unified benchmarking workflow focused on main vs appendix model scopes and safe agentic control evaluation.
  • Provide a compact, testable suite of model interfaces, lightweight model stubs, and a safe policy layer to enable reproducible offline control and forecasting comparisons.
  • Update container orchestration and entrypoints so Docker-driven workflows run the new benchmark/report pipeline by default.

Description

  • Add a new src package with benchmark.py, ranking.py, and report.py to load model lists from configs/benchmark_models.yaml, synthesize mock metrics, produce CSV leaderboards, and render a simple HTML report, and wired the new CLI as the Docker/Compose entrypoint via CMD and service commands.
  • Create configs/benchmark_models.yaml to declare main, appendix, and optional_foundation_models groups and add many model stubs/implementations under models/ (e.g. gradient_boosting.py, graph_actor_critic_ran.py, masked_graph_ppo_ran.py, safegraphagent_ran.py, base.py) plus a policies/safe_policy_layer.py implementing a safe fallback enforcement function.
  • Update Dockerfile and docker-compose.yml to copy new folders (src, models, policies, configs) and expose new compose targets benchmark-main, benchmark-appendix, benchmark-all, and report; add PYTHONPATH=/app in the image.
  • Refresh README.md to describe the new benchmark scope, commands (python -m src.benchmark --benchmark-scope main), outputs, and scientific notes about pseudo-labels.

Testing

  • Ran the unit test suite with pytest -q covering tests/test_benchmark_scope.py, tests/test_gradient_boosting_baseline.py, tests/test_report_sections.py, and tests/test_safe_policy_layer.py and the tests passed.
  • Unit tests validate load_models scoping, GradientBoostingBaseline.fit/predict basic behavior, report HTML sections via src.report.main, and fallback behavior of SafePolicyLayer.enforce.

Codex Task

@chatgpt-codex-connector

Copy link
Copy Markdown

💡 Codex Review

self.actor = nn.Linear(hidden, num_actions)
self.critic = nn.Linear(hidden, 1)
self.safe = SafePolicyLayer()

P1 Badge Implement forward pass in SafeGraphAgentRAN

SafeGraphAgentRAN subclasses nn.Module but only defines __init__, so invoking the model (e.g., during training/inference) raises NotImplementedError because no forward method exists. Since this model is included in main_models, any pipeline that tries to execute it will fail at runtime instead of producing control outputs.


"safe_fallback_rate": sum(int(a == (max(actions) if actions else 0)) for a in actions) / n,

P1 Badge Compute safe fallback rate against fixed safe action

offline_policy_eval currently defines safe_fallback_rate as the fraction of actions equal to max(actions), which measures the most frequent high-index action in that batch rather than the actual safe-fallback action. For traces with no fallback action (e.g., actions [0,1,2,1]), this still reports a nonzero fallback rate, corrupting safety metrics and downstream ranking/report conclusions.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant