llama: save more VRAM by reserving n_outputs == n_seqs when possible by am17an · Pull Request #23861 · ggml-org/llama.cpp

am17an · 2026-05-29T08:37:17Z

Overview

continue #23764, this PR only reserves logits space for n_seqs when possible. With -ub 2048 and MTP, this saves another 1.2GB of VRAM for me. I've tested llama-perplexity also and it seems to work fine. ~~But maybe there is a better API, putting up as a draft for now~~ According to me an API in llama-context is a good solution for this, by default it will reserve all tokens but specifically in server-context we can set it to 1 whenever possible.

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, for review and testing across various configs

drauh · 2026-05-30T17:23:31Z

To motivate this PR further, here is the experiment I ran (copied from the issue #23527 which would be closed by this PR). This patch provides memory reduction even if not using MTP leaving some headroom to put more experts on the GPU for instance.

Setup: Qwen3.6-35B-A3B-UD-Q4_K_XL, -c 150000 -b 4096 -ub 4096 -fa on -ctk q8_0 -ctv q8_0 --kv-unified --cpu-moe -np 1, RTX 3080 10 GB. Peak sampled at 10 Hz over a 145041-token PP.

run	CUDA0 compute	CUDA_Host compute	idle VRAM	peak VRAM
baseline	3976 MiB	2408 MiB	7981 MiB	8275 MiB
#23861 only	3892 MiB	2408 MiB	7897 MiB	8195 MiB
#23764 only	3976 MiB	1236 MiB	7979 MiB	8277 MiB
#23861+ #23764	1982 MiB	1236 MiB	5985 MiB	6279 MiB

joaotbelo · 2026-05-31T09:33:35Z

Corroborating data on Blackwell (SM_120, RTX 5070 Ti 16 GB) for a config not yet in this thread: MTP draft decoding + vision (mmproj) + --parallel 2, real assistant workload.

Built this branch (810aa71, includes #23764) for CUDA_DOCKER_ARCH=120 and compared against b9426 (already carries #23764), so this isolates the #23861 increment.

Config (identical both runs): Qwen3.6-35B-A3B-UD-Q2_K_XL-MTP, -c 131072 --parallel 2, --spec-type draft-mtp, vision mmproj loaded, -ctk q4_0 -ctv q4_0, -fa on, default -b 2048 -ub 512. Workload: 1 vision request + 2× concurrent ~5.6k-token prefill → 256-tok decode; peak sampled at 5 Hz.

run	idle VRAM	peak VRAM
b9426 (#23764 only)	15495 MiB	15513 MiB
+ #23861	14835 MiB	14859 MiB
saving	660 MiB	654 MiB

Smaller than the -ub 4096 numbers above (expected — the reservation scales with n_ubatch; we run the default -ub 512), but on a 16 GB card it nearly doubles free headroom at peak (790 → 1444 MiB). No OOM, MTP draft acceptance unchanged (~0.5–0.72), and the saving holds at idle and under load — so it's reclaiming a persistently-reserved buffer, not just a prefill spike. 👍

llama: save more VRAM by reserving n_outputs == n_seqs when possible

af9f4af

am17an mentioned this pull request May 29, 2026

Misc. bug: Could save ~3 GB VRAM in graph_reserve when caller doesn't need logits (big-vocab models at large ub) #23527

Open

am17an requested a review from ggerganov May 29, 2026 08:52

ProTekk mentioned this pull request May 29, 2026

Merge upstream commits up to 05/29 spiritbuun/buun-llama-cpp#68

Merged

am17an marked this pull request as ready for review May 31, 2026 06:29

am17an requested review from a team as code owners May 31, 2026 06:29

github-actions Bot added examples server labels May 31, 2026

add n_outputs_per_seq

810aa71

am17an force-pushed the save-logits-vram branch from abbb1e4 to 810aa71 Compare May 31, 2026 06:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama: save more VRAM by reserving n_outputs == n_seqs when possible#23861

llama: save more VRAM by reserving n_outputs == n_seqs when possible#23861
am17an wants to merge 2 commits into
ggml-org:masterfrom
am17an:save-logits-vram

am17an commented May 29, 2026 •

edited

Loading

Uh oh!

drauh commented May 30, 2026

Uh oh!

joaotbelo commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

am17an commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

drauh commented May 30, 2026

Uh oh!

joaotbelo commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

am17an commented May 29, 2026 •

edited

Loading