fix(synthbench): substitute runnable model_id from leaderboard rows (sy-i7a)#529
Merged
Conversation
…sy-i7a) When a top SynthBench row's `model` field is a display label (e.g. "SynthPanel (Gemini Flash Lite)") the prior fix (sy-kh3) refused it and fell back to the default model — which, with only OpenRouter credentials present, lands on openrouter/auto. But SynthBench #297 now publishes a runnable id in the row's `model_id` (e.g. "google/gemini-2.5-flash-lite"). Hermes v1.5.4 dogfood re-reported #519: refuse-and-fallback works, but the runnable model_id is ignored. - Add _runnable_id_from_row(): prefer entry["model_id"], joined with entry["provider_id"] as "<provider_id>/<model_id>" when model_id is a bare slug. A model_id that is itself a display label is ignored so we never reintroduce the non-runnable-stamping failure. - recommend(): when the display `model` is non-runnable, prefer the runnable model_id (authoritative) over the config_id base-model heuristic; fall back to config_id inference only when no model_id exists; refuse only when neither yields a runnable id. runnable=True now lets the CLI stamp the real upstream id instead of falling through to the default. raw_model preserves the original display label for provenance. Applies to any row surfacing a display label (product/ensemble or not), matching the dogfood evidence (is_ensemble=false with a model_id). Adds 5 recommend()-level tests + 1 end-to-end CLI test; README, recommended-models, and CHANGELOG updated. Closes #519. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a top SynthBench row's
modelfield is a display label (e.g."SynthPanel (Gemini Flash Lite)") the prior fix (sy-kh3, #521) refusedit and fell back to the default model — which, with only OpenRouter
credentials present, lands on
openrouter/auto. But SynthBench #297 nowpublishes a runnable id in the row's
model_id(e.g."google/gemini-2.5-flash-lite"). Hermes v1.5.4 dogfood re-reported #519:refuse-and-fallback works, but the runnable
model_idis ignored.Fix
_runnable_id_from_row()— preferentry["model_id"], joined withentry["provider_id"]as"<provider_id>/<model_id>"whenmodel_idis a bare slug. Amodel_idthat is itself a display label is ignored so we never reintroduce the non-runnable-stamping failure.recommend()— when the displaymodelis non-runnable, prefer the runnablemodel_id(authoritative) over theconfig_idbase-model heuristic; fall back toconfig_idinference only when nomodel_idexists; refuse only when neither yields a runnable id.runnable=Truenow lets the CLI stamp the real upstream id instead of falling through to the default.raw_modelpreserves the original display label for provenance.Applies to any row surfacing a display label (product/ensemble or
not), matching the dogfood evidence (
is_ensemble=falsewith amodel_id).Test plan
recommend()-level tests covering themodel_idsubstitution paths.Docs
README,recommended-models, andCHANGELOGupdated to reflect the newmodel_id-aware recommendation behaviour.References