Skip to content

llama: save more VRAM by reserving n_outputs == n_seqs when possible#23861

Open
am17an wants to merge 2 commits into
ggml-org:masterfrom
am17an:save-logits-vram
Open

llama: save more VRAM by reserving n_outputs == n_seqs when possible#23861
am17an wants to merge 2 commits into
ggml-org:masterfrom
am17an:save-logits-vram

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented May 29, 2026

Overview

continue #23764, this PR only reserves logits space for n_seqs when possible. With -ub 2048 and MTP, this saves another 1.2GB of VRAM for me. I've tested llama-perplexity also and it seems to work fine. But maybe there is a better API, putting up as a draft for now According to me an API in llama-context is a good solution for this, by default it will reserve all tokens but specifically in server-context we can set it to 1 whenever possible.

Additional information

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, for review and testing across various configs

@drauh
Copy link
Copy Markdown

drauh commented May 30, 2026

To motivate this PR further, here is the experiment I ran (copied from the issue #23527 which would be closed by this PR). This patch provides memory reduction even if not using MTP leaving some headroom to put more experts on the GPU for instance.

Setup: Qwen3.6-35B-A3B-UD-Q4_K_XL, -c 150000 -b 4096 -ub 4096 -fa on -ctk q8_0 -ctv q8_0 --kv-unified --cpu-moe -np 1, RTX 3080 10 GB. Peak sampled at 10 Hz over a 145041-token PP.

run CUDA0 compute CUDA_Host compute idle VRAM peak VRAM
baseline 3976 MiB 2408 MiB 7981 MiB 8275 MiB
#23861 only 3892 MiB 2408 MiB 7897 MiB 8195 MiB
#23764 only 3976 MiB 1236 MiB 7979 MiB 8277 MiB
#23861+ #23764 1982 MiB 1236 MiB 5985 MiB 6279 MiB

@am17an am17an marked this pull request as ready for review May 31, 2026 06:29
@am17an am17an requested review from a team as code owners May 31, 2026 06:29
@am17an am17an force-pushed the save-logits-vram branch from abbb1e4 to 810aa71 Compare May 31, 2026 06:37
@joaotbelo
Copy link
Copy Markdown

Corroborating data on Blackwell (SM_120, RTX 5070 Ti 16 GB) for a config not yet in this thread: MTP draft decoding + vision (mmproj) + --parallel 2, real assistant workload.

Built this branch (810aa71, includes #23764) for CUDA_DOCKER_ARCH=120 and compared against b9426 (already carries #23764), so this isolates the #23861 increment.

Config (identical both runs): Qwen3.6-35B-A3B-UD-Q2_K_XL-MTP, -c 131072 --parallel 2, --spec-type draft-mtp, vision mmproj loaded, -ctk q4_0 -ctv q4_0, -fa on, default -b 2048 -ub 512. Workload: 1 vision request + 2× concurrent ~5.6k-token prefill → 256-tok decode; peak sampled at 5 Hz.

run idle VRAM peak VRAM
b9426 (#23764 only) 15495 MiB 15513 MiB
+ #23861 14835 MiB 14859 MiB
saving 660 MiB 654 MiB

Smaller than the -ub 4096 numbers above (expected — the reservation scales with n_ubatch; we run the default -ub 512), but on a 16 GB card it nearly doubles free headroom at peak (790 → 1444 MiB). No OOM, MTP draft acceptance unchanged (~0.5–0.72), and the saving holds at idle and under load — so it's reclaiming a persistently-reserved buffer, not just a prefill spike. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants