Xueqing Wu · Yu-Chi Lin · Kai-Wei Chang · Nanyun Peng
University of California, Los Angeles
Contact: xueqing.wu@cs.ucla.edu
TL;DR Post-training (SFT and RL) improves a VLM's reasoning far more than its perception, leaving perception as the dominant bottleneck for end-to-end visual reasoning. Using a controlled diagnostic framework, we trace this perception–reasoning asymmetry to two distinct mechanisms — token imbalance in SFT and reward coupling in RL — and show that targeted, paradigm-specific interventions (loss reweighting / perception-aware rewards) rebalance the two and improve end-to-end accuracy by up to 18.2 points.
Contents: Roadmap · Data · Inference & Evaluation · SFT Training · RL Training · Citation
This repository is the official home for the code, data generators, and training/evaluation pipelines behind the paper.
- Data — train/test splits and CoT supervision distilled from Qwen3-32B
- Inference & evaluation scripts — disentangled end-to-end (
$a$ ), perception ($a_p$ ), and counterfactual reasoning ($a_r$ ) metrics - SFT training code — standard SFT and mitigations (loss reweighting
$\lambda$ , NGDiff dynamic balancing) - RL training code — GRPO and mitigations (perception-augmented reward
$\alpha$ , surrogate perception rewards) - Synthetic task generators — Graph Coloring & Sudoku image rendering with oracle perception
$p^*$ annotations
The datasets are released on Hugging Face, one per task: xqwu/asymmetric-VLM__gc for Graph Coloring and xqwu/asymmetric-VLM__sudoku for Sudoku.
Each dataset provides three splits — test, train_sft, and train_rl — sharing the same columns:
query— the textual promptimage— the rendered task imageground_truth— the reference solution, used to compute accuracyresponse— the distilled CoT (perception followed by reasoning) fortrain_sft; left empty fortestandtrain_rl
The inference/ directory evaluates a VLM on both tasks under the paper's disentangled protocol. For each task (gc/, sudoku/) there are two entry points:
-
evaluate_vl.py— end-to-end visual reasoning: the model transcribes the image (inside<caption>...</caption>) and then solves the task, reporting end-to-end accuracy$a$ and perception accuracy$a_p$ . -
evaluate_reasoning.py— counterfactual reasoning: the oracle perception$p^*$ is injected into the assistant prefix so only reasoning is tested, reporting reasoning accuracy$a_r$ .
Both load the test split directly from the Hugging Face datasets above and run with vLLM. See inference/README.md for setup, arguments, and technical details.
The sft/ directory fine-tunes a Qwen3-VL backbone with SFT, including standard SFT and the two interventions, i.e. loss reweighting with fixed perception ratio and NGDiff.
The training code is built on top of LLaMA-Factory; the per-method configs are under sft/config/. See sft/README.md for setup, configs, and technical details.
The rl/ directory trains a Qwen3-VL backbone with GRPO, including standard RL and the perception-aware interventions, i.e. reward augmentation with a perception reward and the teacher-distilled surrogate perception reward. It also includes the correlation diagnostic that measures how the outcome reward couples with reasoning vs. perception.
The training code is built on top of verl; each task lives under rl/gc/ and rl/sudoku/. See rl/README.md for setup, commands, and technical details.
If you find this work useful, please consider citing:
@misc{wu2026asymmetricoptimizationreasoningperception,
title={On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training},
author={Xueqing Wu and Yu-Chi Lin and Kai-Wei Chang and Nanyun Peng},
year={2026},
eprint={2605.29496},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.29496},
}