Skip to content

fix(qwen3_vl): require truthy value in untyped image_part fallback#46

Open
eligotts wants to merge 1 commit into
mainfrom
fix/is-image-part-truthy
Open

fix(qwen3_vl): require truthy value in untyped image_part fallback#46
eligotts wants to merge 1 commit into
mainfrom
fix/is-image-part-truthy

Conversation

@eligotts
Copy link
Copy Markdown
Contributor

Summary

_is_image_part / _is_video_part in renderers/qwen3_vl.py use a
permissive fallback (return "image_url" in item) that misfires after
a Dataset.from_list round-trip. Arrow schema unification adds an
image_url: None zombie key to every text part when the same content
list mixes text and image parts, so the fallback returns True for the
text part, the caller dispatches emit_image, and _load_pil_image
raises TypeError: Unsupported image source 'NoneType'.

Every rollout in an env that embeds a multimodal user prompt directly
in its dataset column (e.g. tic-tac-toe, where visual_prompt = [ {"role": "user", "content": [text, image_url]}] is part of every
dataset row) fails before reaching the model.

Repro (no renderers state needed)

from datasets import Dataset
row = {"prompt": [{"role": "user", "content": [
    {"type": "text", "text": "hi"},
    {"type": "image_url", "image_url": {"url": "data:image/png;base64,XXX"}},
]}]}
print(Dataset.from_list([row])[0]["prompt"][0]["content"][0])
# => {'image_url': None, 'text': 'hi', 'type': 'text'}
#               ^^^^^^^^^^^^^ Arrow unified the schema across the list

from renderers.qwen3_vl import _is_image_part
_is_image_part({"type": "text", "text": "hi", "image_url": None})
# => True   (misclassified — should be False)

Fix

Make the type field authoritative when present:

def _is_image_part(item):
    if not isinstance(item, dict): return False
    t = item.get("type")
    if t in ("image", "image_url"): return True
    if t is not None: return False                                    # ← new
    return bool(item.get("image")) or bool(item.get("image_url"))     # ← truthy, not key-presence

Same fix applied to _is_video_part symmetrically.

These helpers live in renderers/qwen3_vl.py but are imported by every
multimodal renderer in the package — renderers/qwen35.py (the
Qwen3.5 / Qwen3.6 family) and renderers/kimi_k25.py (Kimi K2.5) both
from renderers.qwen3_vl import _is_image_part, _is_video_part. So the
fix covers all three renderer families without touching their files.

Test plan

  • Adds test_is_image_part_treats_type_field_as_authoritative in
    tests/test_multimodal.py. Exercises both the schema-unified text
    case (the actual regression) and the untyped fallback's truthy-value
    requirement. Synthesizes the schema-unified content shape directly so
    the test doesn't depend on datasets as a test dep.
  • Test passes locally.
  • Verified end-to-end on Qwen/Qwen3.5-2B + the tic-tac-toe env via
    prime-rl: with the fix, the renderer correctly classifies tic-tac-toe's
    text+image initial prompt and the trainer runs through 8 RL steps with
    mismatch_kl/max=0.00165.

Fixes #40.

`_is_image_part` / `_is_video_part` use a permissive fallback —
`return "image_url" in item` — that misfires after `Dataset.from_list`
round-trip. Arrow schema unification adds an `image_url: None` zombie
key to every text part when the same content list mixes text and image
parts, so the fallback returns True, the caller dispatches `emit_image`,
and `_load_pil_image` raises `TypeError: 'NoneType'`. Every rollout in
an env whose dataset embeds a multimodal user prompt in the initial
prompt column (e.g. tic-tac-toe) fails before reaching the model.

Make the `type` field authoritative when present (`type="text"` always
classifies as not image / video), and require a truthy value — not mere
key presence — in the untyped fallback. Same fix applied to
`_is_video_part` symmetrically.

These helpers live in `qwen3_vl.py` but are imported and used by every
multimodal renderer in the package — `qwen35.py` (the Qwen3.5 / Qwen3.6
family) and `kimi_k25.py` (Kimi K2.5) both `from renderers.qwen3_vl
import _is_image_part, _is_video_part`. So the fix covers all three
renderer families without touching their files.

Adds a focused regression test exercising both the schema-unified text
case and the untyped fallback's truthy-value requirement. The test
synthesizes the schema-unified shape directly rather than depending on
`datasets` as a test dep.

Fixes #40.
@eligotts eligotts marked this pull request as ready for review May 17, 2026 05:55
@eligotts eligotts requested a review from hallerite May 17, 2026 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Qwen3VLRenderer misclassifies text parts as images when content list has been through HF Dataset round-trip

1 participant