On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training

Xueqing Wu · Yu-Chi Lin · Kai-Wei Chang · Nanyun Peng
University of California, Los Angeles
Contact: xueqing.wu@cs.ucla.edu

TL;DR Post-training (SFT and RL) improves a VLM's reasoning far more than its perception, leaving perception as the dominant bottleneck for end-to-end visual reasoning. Using a controlled diagnostic framework, we trace this perception–reasoning asymmetry to two distinct mechanisms — token imbalance in SFT and reward coupling in RL — and show that targeted, paradigm-specific interventions (loss reweighting / perception-aware rewards) rebalance the two and improve end-to-end accuracy by up to 18.2 points.

Contents: Roadmap · Data · Inference & Evaluation · SFT Training · RL Training · Citation

This repository is the official home for the code, data generators, and training/evaluation pipelines behind the paper.

Roadmap

Data — train/test splits and CoT supervision distilled from Qwen3-32B
Inference & evaluation scripts — disentangled end-to-end ($a$), perception ($a_p$), and counterfactual reasoning ($a_r$) metrics
SFT training code — standard SFT and mitigations (loss reweighting $\lambda$, NGDiff dynamic balancing)
RL training code — GRPO and mitigations (perception-augmented reward $\alpha$, surrogate perception rewards)
Synthetic task generators — Graph Coloring & Sudoku image rendering with oracle perception $p^*$ annotations

Data

The datasets are released on Hugging Face, one per task: xqwu/asymmetric-VLM__gc for Graph Coloring and xqwu/asymmetric-VLM__sudoku for Sudoku.

Each dataset provides three splits — test, train_sft, and train_rl — sharing the same columns:

query — the textual prompt
image — the rendered task image
ground_truth — the reference solution, used to compute accuracy
response — the distilled CoT (perception followed by reasoning) for train_sft; left empty for test and train_rl

Inference & Evaluation

The inference/ directory evaluates a VLM on both tasks under the paper's disentangled protocol. For each task (gc/, sudoku/) there are two entry points:

evaluate_vl.py — end-to-end visual reasoning: the model transcribes the image (inside <caption>...</caption>) and then solves the task, reporting end-to-end accuracy $a$ and perception accuracy $a_p$.
evaluate_reasoning.py — counterfactual reasoning: the oracle perception $p^*$ is injected into the assistant prefix so only reasoning is tested, reporting reasoning accuracy $a_r$.

Both load the test split directly from the Hugging Face datasets above and run with vLLM. See inference/README.md for setup, arguments, and technical details.

SFT Training

The sft/ directory fine-tunes a Qwen3-VL backbone with SFT, including standard SFT and the two interventions, i.e. loss reweighting with fixed perception ratio and NGDiff.

The training code is built on top of LLaMA-Factory; the per-method configs are under sft/config/. See sft/README.md for setup, configs, and technical details.

RL Training

The rl/ directory trains a Qwen3-VL backbone with GRPO, including standard RL and the perception-aware interventions, i.e. reward augmentation with a perception reward and the teacher-distilled surrogate perception reward. It also includes the correlation diagnostic that measures how the outcome reward couples with reasoning vs. perception.

The training code is built on top of verl; each task lives under rl/gc/ and rl/sudoku/. See rl/README.md for setup, commands, and technical details.

📝 Citation

If you find this work useful, please consider citing:

@misc{wu2026asymmetricoptimizationreasoningperception,
      title={On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training}, 
      author={Xueqing Wu and Yu-Chi Lin and Kai-Wei Chang and Nanyun Peng},
      year={2026},
      eprint={2605.29496},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.29496}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
inference		inference
rl		rl
sft		sft
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training

Roadmap

Data

Inference & Evaluation

SFT Training

RL Training

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training

Roadmap

Data

Inference & Evaluation

SFT Training

RL Training

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages