You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Pivot the repository from scenario-run scripts to a unified benchmarking workflow focused on main vs appendix model scopes and safe agentic control evaluation.
Provide a compact, testable suite of model interfaces, lightweight model stubs, and a safe policy layer to enable reproducible offline control and forecasting comparisons.
Update container orchestration and entrypoints so Docker-driven workflows run the new benchmark/report pipeline by default.
Description
Add a new src package with benchmark.py, ranking.py, and report.py to load model lists from configs/benchmark_models.yaml, synthesize mock metrics, produce CSV leaderboards, and render a simple HTML report, and wired the new CLI as the Docker/Compose entrypoint via CMD and service commands.
Create configs/benchmark_models.yaml to declare main, appendix, and optional_foundation_models groups and add many model stubs/implementations under models/ (e.g. gradient_boosting.py, graph_actor_critic_ran.py, masked_graph_ppo_ran.py, safegraphagent_ran.py, base.py) plus a policies/safe_policy_layer.py implementing a safe fallback enforcement function.
Update Dockerfile and docker-compose.yml to copy new folders (src, models, policies, configs) and expose new compose targets benchmark-main, benchmark-appendix, benchmark-all, and report; add PYTHONPATH=/app in the image.
Refresh README.md to describe the new benchmark scope, commands (python -m src.benchmark --benchmark-scope main), outputs, and scientific notes about pseudo-labels.
Testing
Ran the unit test suite with pytest -q covering tests/test_benchmark_scope.py, tests/test_gradient_boosting_baseline.py, tests/test_report_sections.py, and tests/test_safe_policy_layer.py and the tests passed.
Unit tests validate load_models scoping, GradientBoostingBaseline.fit/predict basic behavior, report HTML sections via src.report.main, and fallback behavior of SafePolicyLayer.enforce.
SafeGraphAgentRAN subclasses nn.Module but only defines __init__, so invoking the model (e.g., during training/inference) raises NotImplementedError because no forward method exists. Since this model is included in main_models, any pipeline that tries to execute it will fail at runtime instead of producing control outputs.
Compute safe fallback rate against fixed safe action
offline_policy_eval currently defines safe_fallback_rate as the fraction of actions equal to max(actions), which measures the most frequent high-index action in that batch rather than the actual safe-fallback action. For traces with no fallback action (e.g., actions [0,1,2,1]), this still reports a nonzero fallback rate, corrupting safety metrics and downstream ranking/report conclusions.
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Description
srcpackage withbenchmark.py,ranking.py, andreport.pyto load model lists fromconfigs/benchmark_models.yaml, synthesize mock metrics, produce CSV leaderboards, and render a simple HTML report, and wired the new CLI as the Docker/Compose entrypoint viaCMDand service commands.configs/benchmark_models.yamlto declaremain,appendix, andoptional_foundation_modelsgroups and add many model stubs/implementations undermodels/(e.g.gradient_boosting.py,graph_actor_critic_ran.py,masked_graph_ppo_ran.py,safegraphagent_ran.py,base.py) plus apolicies/safe_policy_layer.pyimplementing a safe fallback enforcement function.Dockerfileanddocker-compose.ymlto copy new folders (src,models,policies,configs) and expose new compose targetsbenchmark-main,benchmark-appendix,benchmark-all, andreport; addPYTHONPATH=/appin the image.README.mdto describe the new benchmark scope, commands (python -m src.benchmark --benchmark-scope main), outputs, and scientific notes about pseudo-labels.Testing
pytest -qcoveringtests/test_benchmark_scope.py,tests/test_gradient_boosting_baseline.py,tests/test_report_sections.py, andtests/test_safe_policy_layer.pyand the tests passed.load_modelsscoping,GradientBoostingBaseline.fit/predictbasic behavior, report HTML sections viasrc.report.main, and fallback behavior ofSafePolicyLayer.enforce.Codex Task