Skip to content

[2.0] Add DuckDB final verification gate#139

Merged
joyemang33 merged 1 commit into
mainfrom
codex/duckdb-final-verification
Jun 5, 2026
Merged

[2.0] Add DuckDB final verification gate#139
joyemang33 merged 1 commit into
mainfrom
codex/duckdb-final-verification

Conversation

@joyemang33
Copy link
Copy Markdown
Contributor

Summary

  • Add a final-only DuckDB SQLLogicTest smoke gate for the DuckDB E2E query optimization task.
  • Split iterative async feedback from final scoring: async submissions run the public-scale TPC-H gate, while final verification runs the broader scale-factor set plus SQLLogicTest smoke.
  • Make best async submission reuse go through a verifier-only final rerun instead of falling back to an old async score record.

Please read CONTRIBUTING.md before submitting.

Type of Change

  • New research problem
  • New algorithmic problem
  • New Frontier-CS 2.0 problem
  • Bug fix
  • Documentation update
  • Other:

Testing

  • python3 -m py_compile 2.0/problems/duckdb_e2e_query_optimization/evaluator.py adapters/frontier-cs-2.0/src/frontier_cs_2_0/task-template/environment/judge_server.py adapters/frontier-cs-2.0/src/frontier_cs_2_0/task-template/tests/evaluate.py
  • Measured selected DuckDB SQLLogicTest smoke files in the judge image: 5 files, 114 SQLLogicTest blocks, 438 assertions, ~0.3-0.4s.
  • Harbor trial: frontier-cs-2-0-duckdb-e2e-query__xHLegDz
    • Async submissions used quick feedback: scale_factor_count=1, benchmark_count=22, ~52-55s each.
    • Final verifier used full scoring: submission_role=final, scale_factor_count=3, correctness_case_count=66, benchmark_count=66.
    • Final SQLLogicTest gate ran: final_sqllogictest=1, final_sqllogictest_count=5, final_sqllogictest_seconds=0.323.
    • Hidden query/scale/per-query/stderr leakage grep found no matches in verifier/agent submission logs.

Checklist

  • Code follows the project structure and conventions
  • Self-review completed
  • Documentation updated (if applicable)

CI Validation (for new problems)

When adding new problems, CI will automatically validate that your reference solution achieves score > 0.

  • Algorithmic problems: Include reference.cpp in your problem directory
  • Research problems: Include reference.py (or reference.cpp if language: cpp in config.yaml)
  • 2.0 problems: Include reference.py unless the problem config declares another language

@joyemang33 joyemang33 marked this pull request as ready for review June 5, 2026 19:09
@joyemang33 joyemang33 merged commit b98d752 into main Jun 5, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant