Ke Zhao1,*
·
Zixiang Di1,*
·
Hong Qian1,†
·
Xiang Shu2
·
Yaolin Wen1
·
Qitao Shi2
·
Bingdong Li1
Xingyu Lu2
·
Xiangfeng Wang1
·
Jun Zhou2
·
Ke Tang3
·
Yang Yu4
1East China Normal University | 2AntGroup | 3Southern University of Science and Technology | 4Nanjing University
MiniOpt is an end-to-end optimization solving paradigm based on reinforcement learning with verifiable reward (RLVR). It enables small language models (1.5B-14B parameters) to achieve state-of-the-art performance in solving optimization problems from natural language descriptions, significantly reducing computational costs while maintaining competitive accuracy.
MiniOpt achieves remarkable performance across 8 optimization benchmarks.
| Category | Model / Method | Performance | |
|---|---|---|---|
| SA Avg. (%) | ER Avg. (%) | ||
| General Models | Qwen2.5-3B-Instruct | 11.23 | 16.57 |
| Qwen2.5-7B-Instruct | 33.20 | 41.86 | |
| Qwen2.5-14B-Instruct | 47.46 | 60.64 | |
| DeepSeek-V3 | 60.14 | 81.86 | |
| General Models (Thinking) | Qwen3-4B | 11.16 | 14.02 |
| Qwen3-8B | 21.79 | 25.43 | |
| Qwen3-14B | 23.78 | 30.04 | |
| DeepSeek-R1 | 60.85 | 82.24 | |
| Gemini-2.5-Pro | 57.39 | 88.87 | |
| GPT-5 | 57.54 | 84.73 | |
| Prompt-based Methods | Chain-of-Experts | 45.78 | 60.33 |
| OptiMUS | 20.65 | 49.43 | |
| Reflexion | 45.54 | 78.28 | |
| Learning-based Models | Step-OPT-Qwen2.5-3B | 39.76 | 54.65 |
| Step-OPT-Qwen2.5-7B | 52.22 | 69.76 | |
| OptMATH-7B | 54.62 | 83.39 | |
| LLMOPT-14B | 60.10 | 89.75 | |
| Ours | MiniOpt-3B | 59.65 | 87.92 |
| MiniOpt-7B | 64.76 | 91.17 | |
-
Python 3.10+
-
Conda package manager
# Clone the repository
# git clone https://github.com/xxxxx/MiniOpt.git
# cd MiniOpt
# Create a conda environment
conda create -n MiniOpt python=3.10 -y
conda activate MiniOpt
# Install the required packages
bash init.sh.
├── init.sh
├── README.md
├── datasets
│ ├── rl_dataset
│ │ └── example.parquet
│ └── sft_dataset
│ └── example.jsonl
├── inference
│ └── inference.py
├── prompts
│ ├── code_conversion.py
│ ├── question_scenario_labeling.py
│ ├── question_type_labeling.py
│ └── rl_prompt.py
├── rl
│ ├── configs
│ │ ├── rl_example.sh
│ │ ├── rl_phase1.sh
│ │ └── rl_phase2.sh
│ ├── opt_reward.py
│ ├── pyomo_executor.py
│ └── rl.sh
└── sft
├── configs
│ ├── merge_config.yaml
│ └── sft_config.yaml
├── data
│ └── dataset_info.json
└── sft.shdatasets: Examples of SFT/RL training dataset.inference: Example of using the fine-tuned model to infer an optimization problem.prompts: All the prompts used and mentioned in our paper.rl: This folder includes theopt_rewardand the execution method of pyomo code. Theconfigsfolder includes the 2-stage rl training configuration files and a configuration example.rl.shshows how to use these scripts.sft: This folder provides the code for SFT based on LLaMAFacroty, including dataset configuration (./sft/data/dataset_info.json), fine-tuning script (./sft/configs/sft_config.yaml), and post-training model merge script (./sft/configs/merge_config.yaml).sft.shshows how to use these scripts.init.shshows the setup of the environment.
- Prepare the sft training dataset. Here is an example of SFT training dataset format:
./datasets/sft_dataset/example.jsonl. - Config the dataset. Here is an example of LLaMAFactory dataset configuration:
./sft/data/dataset_info.json. - Run SFT and merge Lora model. The hyperparameter setting used for SFT warm-up in our paper is shown in
./sft/configs/sft_config.yaml.
cd sft
bash sft.sh- Prepare your RL training dataset and eval dataset. MiniOpt uses a 2-stage RL training approach, including
Paradigm Acquisition(phase 1) andOptimization Generalization(phase 2). Although the training data used in the two stages are different, the format and attributes are the same. Here is an example of RL training dataset format:./datasets/rl_dataset/example.jsonl. - Run RL training. The training parameters of the 2-stage RL are fully listed in
./rl/configs/rl_phase1.shand./rl/configs/rl_phase2.sh.
cd rl
bash rl.shRun python ./inference/inference.py to perform inference. This script shows the system prompt used for inference and tests the first case of nl4opt_test benchmark.
=======
If you find this repository useful in your research, please cite:
@misc{zhao2026minioptreasoningmodelsolve,
title={MiniOpt: Reasoning to Model and Solve General Optimization Problems with Limited Resources},
author={Ke Zhao and Zixiang Di and Hong Qian and Xiang Shu and Yaolin Wen and Qitao Shi and Bingdong Li and Xingyu Lu and Xiangfeng Wang and Jun Zhou and Ke Tang and Yang Yu},
year={2026},
eprint={2606.25832},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2606.25832},
}