Skip to content

fix(synthbench): substitute runnable model_id from leaderboard rows (sy-i7a)#529

Merged
openclaw-dv merged 1 commit into
mainfrom
polecat/garnet-mpj27ppb
May 24, 2026
Merged

fix(synthbench): substitute runnable model_id from leaderboard rows (sy-i7a)#529
openclaw-dv merged 1 commit into
mainfrom
polecat/garnet-mpj27ppb

Conversation

@openclaw-dv
Copy link
Copy Markdown
Collaborator

Summary

When a top SynthBench row's model field is a display label (e.g.
"SynthPanel (Gemini Flash Lite)") the prior fix (sy-kh3, #521) refused
it and fell back to the default model — which, with only OpenRouter
credentials present, lands on openrouter/auto. But SynthBench #297 now
publishes a runnable id in the row's model_id (e.g.
"google/gemini-2.5-flash-lite"). Hermes v1.5.4 dogfood re-reported #519:
refuse-and-fallback works, but the runnable model_id is ignored.

Fix

  • _runnable_id_from_row() — prefer entry["model_id"], joined with entry["provider_id"] as "<provider_id>/<model_id>" when model_id is a bare slug. A model_id that is itself a display label is ignored so we never reintroduce the non-runnable-stamping failure.
  • recommend() — when the display model is non-runnable, prefer the runnable model_id (authoritative) over the config_id base-model heuristic; fall back to config_id inference only when no model_id exists; refuse only when neither yields a runnable id. runnable=True now lets the CLI stamp the real upstream id instead of falling through to the default. raw_model preserves the original display label for provenance.

Applies to any row surfacing a display label (product/ensemble or
not), matching the dogfood evidence (is_ensemble=false with a model_id).

Test plan

  • 5 new recommend()-level tests covering the model_id substitution paths.
  • 1 end-to-end CLI test exercising the dogfood scenario.
  • GitHub CI runs the full suite on this PR.

Docs

README, recommended-models, and CHANGELOG updated to reflect the new
model_id-aware recommendation behaviour.

References

…sy-i7a)

When a top SynthBench row's `model` field is a display label (e.g.
"SynthPanel (Gemini Flash Lite)") the prior fix (sy-kh3) refused it and
fell back to the default model — which, with only OpenRouter credentials
present, lands on openrouter/auto. But SynthBench #297 now publishes a
runnable id in the row's `model_id` (e.g. "google/gemini-2.5-flash-lite").
Hermes v1.5.4 dogfood re-reported #519: refuse-and-fallback works, but the
runnable model_id is ignored.

- Add _runnable_id_from_row(): prefer entry["model_id"], joined with
  entry["provider_id"] as "<provider_id>/<model_id>" when model_id is a
  bare slug. A model_id that is itself a display label is ignored so we
  never reintroduce the non-runnable-stamping failure.
- recommend(): when the display `model` is non-runnable, prefer the
  runnable model_id (authoritative) over the config_id base-model
  heuristic; fall back to config_id inference only when no model_id exists;
  refuse only when neither yields a runnable id. runnable=True now lets the
  CLI stamp the real upstream id instead of falling through to the default.
  raw_model preserves the original display label for provenance.

Applies to any row surfacing a display label (product/ensemble or not),
matching the dogfood evidence (is_ensemble=false with a model_id). Adds 5
recommend()-level tests + 1 end-to-end CLI test; README, recommended-models,
and CHANGELOG updated.

Closes #519.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@openclaw-dv openclaw-dv merged commit dbfc3b9 into main May 24, 2026
17 of 19 checks passed
@openclaw-dv openclaw-dv deleted the polecat/garnet-mpj27ppb branch May 24, 2026 01:01
@openclaw-dv openclaw-dv restored the polecat/garnet-mpj27ppb branch May 24, 2026 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

--best-model-for should not pass SynthBench display names as provider model IDs

1 participant