π[2026/06] Prefill-Decode Disaggregation (PDD) version released! Support for Attention-FFN Disaggregation (AFD) will be available soon.
π[2026/06] Initial version released, with support for co-located serving and modern optimizations.
Frontier is a discrete-event simulator for modern LLM serving. It is built for serving systems that combine complex parallelism, runtime optimizations, sparse model architectures (MoE), and stateful workloads (reasoning agents, RL rollouts). It currently simulates vLLM-style serving behavior, and we plan to include other serving engines soon.
Frontier helps researchers and engineers better understand serving system designs and tradeoffs without the time and financial costs of repeatedly deploying on GPU clusters.
- Co-located & Disaggregated Serving: This branch supports monolithic co-location and PDD serving. AFD support is planned for a later public release.
- Modern Runtime Optimizations: Frontier captures production techniques such as CUDA Graph, speculative decoding / MTP, prefix caching, quantization, chunked prefill, and hierarchical caching as part of the scheduler-batch-engine loop. These optimizations change batch shape, memory state, and per-request progress, so Frontier models them as runtime behavior rather than simple speedup factors.
- Fidelity: Frontier combines calibrated operator, communication, transfer, and KV-cache memory models to make simulation results useful for deployment decisions. This helps users compare configurations under SLA constraints, explore large GPU-scale design spaces ex-situ, and avoid conclusions that would be distorted by coarse average-case models.
AFD serving architecture is intentionally not included in this release and will be available in a later public release.
- Simulation: Frontier supports E2E execution entirely on CPU-only machines, leveraging pre-compiled profiling databases.
- Profiling: At least 1 GPU is required exclusively for running the Frontier profiling module to collect operator-level performance metrics on new hardware or software stacks.
Frontier is designed for what-if studies that would be expensive or slow to run directly on large GPU clusters. The current paper draft demonstrates four use cases.
Install the release package and test extras from the repository root:
python -m pip install -e '.[test]'
PYTHONPATH=$PWD PYTHONDONTWRITEBYTECODE=1 pytest tests/unit/test_examples_pdd_scripts.py -q -p no:cacheproviderCurrent release-facing PDD and co-location examples are split by simulation mode and default to the formula-based analytical backend for one-click smoke runs:
examples/architecture/pdd/offline/dense_model_basic.shexamples/architecture/pdd/online/dense_model_basic_online.shexamples/architecture/co-location/online/dense_model_basic_online.shexamples/architecture/co-location/online/moe_model_basic_online.shexamples/architecture/co-location/online/thinking_mode_basic_online.sh
These examples cover most runtime optimizations. Note that dummy analytical smoke runs only validate runtime plumbing, not profiling fidelity. Profiling outputs and reusable compute data are organized under data/profiling/compute.
Currently, only the
a800andrtx_pro_6000datasets contain full-feature format profiles; we highly recommend re-collecting profiling data locally for your specific hardware.
Example fixtures live under:
examples/
βββ architecture/
β βββ pdd/
β β βββ run_all.sh
β β βββ offline/
β β βββ online/
β βββ co-location/
β βββ run_all.sh
β βββ offline/
β βββ online/
βββ fixtures/
βββ profiling/
The release example suites default to the lightweight formula-based backend for one-click smoke runs:
--cc_backend_config_type analyticalThe ASTRA-Sim-inspired topology model remains available for direct CLI experiments:
--cc_backend_config_type astra_sim_analyticalcollective_sim is optional and is used only when you explicitly select --cc_backend_config_type collective_sim.
To enable the optional target-runtime backend:
git submodule update --init --recursive frontier/cc_backend/backends/collective-sim
cd frontier/cc_backend/backends/collective-sim/sim
make -j"$(nproc)"After building, verify that frontier/cc_backend/backends/collective-sim/sim/datacenter/htsim_ndp exists before selecting --cc_backend_config_type collective_sim.
If your host C++ runtime reports a GLIBC or GLIBCXX mismatch, rebuild the optional backend in the target runtime:
make -B -j"$(nproc)"For the standard simulator environment, install from environment.yml or use editable pip installation. Profiling wrappers use a dedicated environment because vllm and flashinfer can be sensitive to CUDA and Python versions:
conda env create -f environment_profiling.yml
conda activate frontier-profilingDo not blindly install profiling dependencies into an existing environment unless you have confirmed version compatibility.
Production Docker users can start from the public image:
docker pull fengyicheng/frontier-env
docker run --rm --gpus all --shm-size 16g \
--tmpfs /workspace/frontier/outputs \
--tmpfs /workspace/frontier/cache \
-v "$PWD":/workspace/frontier \
-w /workspace/frontier \
fengyicheng/frontier-env bashWe will continuously release Frontier's core components to the community. We are actively developing new features and plan to support:
- Serving Engines Integration: Support for SGLang and TensorRT-LLM frameworks.
- Advanced Caching: Support for
tair-kvcacheas an advanced runtime backend for the Hierarchical Caching feature. - Advanced Model Support: Expanded support for state-of-the-art models such as DeepSeek-V4, Kimi 2.5, and more.
- Simulation Acceleration: Introduce more simulation acceleration mechanisms.
Frontier is mainly built on top of Vidur. The following great systems have been referenced or adapted as runtime backends. We sincerely thank all the developers for their contributions to the community!
If you use this repo in your research, please cite our arXiv paper:
@article{feng2026frontier,
title={Frontier: Towards Comprehensive and Accurate LLM Inference Simulation},
author={Feng, Yicheng and Tan, Xin and Deng, Yangtao and Jiang, Yimin and Zhu, Yibo and Xu, Hong},
journal={arXiv preprint arXiv:2605.21312},
year={2026}
}Email Yicheng Feng (yichengfeng@link.cuhk.edu.hk) or Hong Xu (hongxu@cuhk.edu.hk) if you have any questions.
Feedback, issues, and PRs are highly welcome. We welcome your contributions!
This project is licensed under the MIT license - see the LICENSE file for details.





