llama: save more VRAM by reserving n_outputs == n_seqs when possible#23861
llama: save more VRAM by reserving n_outputs == n_seqs when possible#23861am17an wants to merge 2 commits into
Conversation
|
To motivate this PR further, here is the experiment I ran (copied from the issue #23527 which would be closed by this PR). This patch provides memory reduction even if not using MTP leaving some headroom to put more experts on the GPU for instance. Setup: Qwen3.6-35B-A3B-UD-Q4_K_XL,
|
|
Corroborating data on Blackwell (SM_120, RTX 5070 Ti 16 GB) for a config not yet in this thread: MTP draft decoding + vision (mmproj) + Built this branch ( Config (identical both runs):
Smaller than the |
Overview
continue #23764, this PR only reserves logits space for
n_seqswhen possible. With-ub 2048and MTP, this saves another 1.2GB of VRAM for me. I've testedllama-perplexityalso and it seems to work fine.But maybe there is a better API, putting up as a draft for nowAccording to me an API in llama-context is a good solution for this, by default it will reserve all tokens but specifically in server-context we can set it to 1 whenever possible.Additional information
Requirements