Skip to content

PlusLabNLP/asymmetric-VLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training

arXiv Project Page Hugging Face Data BibTeX License

Xueqing Wu  ·  Yu-Chi Lin  ·  Kai-Wei Chang  ·  Nanyun Peng
University of California, Los Angeles
Contact: xueqing.wu@cs.ucla.edu


TL;DR   Post-training (SFT and RL) improves a VLM's reasoning far more than its perception, leaving perception as the dominant bottleneck for end-to-end visual reasoning. Using a controlled diagnostic framework, we trace this perception–reasoning asymmetry to two distinct mechanisms — token imbalance in SFT and reward coupling in RL — and show that targeted, paradigm-specific interventions (loss reweighting / perception-aware rewards) rebalance the two and improve end-to-end accuracy by up to 18.2 points.

Contents: Roadmap · Data · Inference & Evaluation · SFT Training · RL Training · Citation


This repository is the official home for the code, data generators, and training/evaluation pipelines behind the paper.

Roadmap

  • Data — train/test splits and CoT supervision distilled from Qwen3-32B
  • Inference & evaluation scripts — disentangled end-to-end ($a$), perception ($a_p$), and counterfactual reasoning ($a_r$) metrics
  • SFT training code — standard SFT and mitigations (loss reweighting $\lambda$, NGDiff dynamic balancing)
  • RL training code — GRPO and mitigations (perception-augmented reward $\alpha$, surrogate perception rewards)
  • Synthetic task generators — Graph Coloring & Sudoku image rendering with oracle perception $p^*$ annotations

Data

The datasets are released on Hugging Face, one per task: xqwu/asymmetric-VLM__gc for Graph Coloring and xqwu/asymmetric-VLM__sudoku for Sudoku.

Each dataset provides three splits — test, train_sft, and train_rl — sharing the same columns:

  • query — the textual prompt
  • image — the rendered task image
  • ground_truth — the reference solution, used to compute accuracy
  • response — the distilled CoT (perception followed by reasoning) for train_sft; left empty for test and train_rl

Inference & Evaluation

The inference/ directory evaluates a VLM on both tasks under the paper's disentangled protocol. For each task (gc/, sudoku/) there are two entry points:

  • evaluate_vl.pyend-to-end visual reasoning: the model transcribes the image (inside <caption>...</caption>) and then solves the task, reporting end-to-end accuracy $a$ and perception accuracy $a_p$.
  • evaluate_reasoning.pycounterfactual reasoning: the oracle perception $p^*$ is injected into the assistant prefix so only reasoning is tested, reporting reasoning accuracy $a_r$.

Both load the test split directly from the Hugging Face datasets above and run with vLLM. See inference/README.md for setup, arguments, and technical details.

SFT Training

The sft/ directory fine-tunes a Qwen3-VL backbone with SFT, including standard SFT and the two interventions, i.e. loss reweighting with fixed perception ratio and NGDiff.

The training code is built on top of LLaMA-Factory; the per-method configs are under sft/config/. See sft/README.md for setup, configs, and technical details.

RL Training

The rl/ directory trains a Qwen3-VL backbone with GRPO, including standard RL and the perception-aware interventions, i.e. reward augmentation with a perception reward and the teacher-distilled surrogate perception reward. It also includes the correlation diagnostic that measures how the outcome reward couples with reasoning vs. perception.

The training code is built on top of verl; each task lives under rl/gc/ and rl/sudoku/. See rl/README.md for setup, commands, and technical details.

📝 Citation

If you find this work useful, please consider citing:

@misc{wu2026asymmetricoptimizationreasoningperception,
      title={On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training}, 
      author={Xueqing Wu and Yu-Chi Lin and Kai-Wei Chang and Nanyun Peng},
      year={2026},
      eprint={2605.29496},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.29496}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors