fix(qwen3_vl): require truthy value in untyped image_part fallback#46
Open
eligotts wants to merge 1 commit into
Open
fix(qwen3_vl): require truthy value in untyped image_part fallback#46eligotts wants to merge 1 commit into
eligotts wants to merge 1 commit into
Conversation
`_is_image_part` / `_is_video_part` use a permissive fallback — `return "image_url" in item` — that misfires after `Dataset.from_list` round-trip. Arrow schema unification adds an `image_url: None` zombie key to every text part when the same content list mixes text and image parts, so the fallback returns True, the caller dispatches `emit_image`, and `_load_pil_image` raises `TypeError: 'NoneType'`. Every rollout in an env whose dataset embeds a multimodal user prompt in the initial prompt column (e.g. tic-tac-toe) fails before reaching the model. Make the `type` field authoritative when present (`type="text"` always classifies as not image / video), and require a truthy value — not mere key presence — in the untyped fallback. Same fix applied to `_is_video_part` symmetrically. These helpers live in `qwen3_vl.py` but are imported and used by every multimodal renderer in the package — `qwen35.py` (the Qwen3.5 / Qwen3.6 family) and `kimi_k25.py` (Kimi K2.5) both `from renderers.qwen3_vl import _is_image_part, _is_video_part`. So the fix covers all three renderer families without touching their files. Adds a focused regression test exercising both the schema-unified text case and the untyped fallback's truthy-value requirement. The test synthesizes the schema-unified shape directly rather than depending on `datasets` as a test dep. Fixes #40.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_is_image_part/_is_video_partinrenderers/qwen3_vl.pyuse apermissive fallback (
return "image_url" in item) that misfires aftera
Dataset.from_listround-trip. Arrow schema unification adds animage_url: Nonezombie key to every text part when the same contentlist mixes text and image parts, so the fallback returns True for the
text part, the caller dispatches
emit_image, and_load_pil_imageraises
TypeError: Unsupported image source 'NoneType'.Every rollout in an env that embeds a multimodal user prompt directly
in its dataset column (e.g.
tic-tac-toe, wherevisual_prompt = [ {"role": "user", "content": [text, image_url]}]is part of everydataset row) fails before reaching the model.
Repro (no renderers state needed)
Fix
Make the
typefield authoritative when present:Same fix applied to
_is_video_partsymmetrically.These helpers live in
renderers/qwen3_vl.pybut are imported by everymultimodal renderer in the package —
renderers/qwen35.py(theQwen3.5 / Qwen3.6 family) and
renderers/kimi_k25.py(Kimi K2.5) bothfrom renderers.qwen3_vl import _is_image_part, _is_video_part. So thefix covers all three renderer families without touching their files.
Test plan
test_is_image_part_treats_type_field_as_authoritativeintests/test_multimodal.py. Exercises both the schema-unified textcase (the actual regression) and the untyped fallback's truthy-value
requirement. Synthesizes the schema-unified content shape directly so
the test doesn't depend on
datasetsas a test dep.Qwen/Qwen3.5-2B+ thetic-tac-toeenv viaprime-rl: with the fix, the renderer correctly classifies tic-tac-toe's
text+image initial prompt and the trainer runs through 8 RL steps with
mismatch_kl/max=0.00165.Fixes #40.