Upgrade llama.cpp to b9829; add DRY sampling parameters#274
Merged
bernardladenthin merged 7 commits intoJun 28, 2026
Conversation
…Rs #22393 and #23116 as patches
InferenceParameters gains five immutable withers mirroring the existing
withMinP/withTopNSigma (scalars) and withStopStrings (string array) style:
withDryMultiplier(float) -> "dry_multiplier"
withDryBase(float) -> "dry_base"
withDryAllowedLength(int) -> "dry_allowed_length"
withDryPenaltyLastN(int) -> "dry_penalty_last_n" (rejects < -1)
withDrySequenceBreakers(String...) -> "dry_sequence_breakers" (omitted when unset)
This exposes DRY per request, uniformly with the other samplers, instead of
only at model/launch level (ModelParameters --dry-*). Defaults are unchanged:
no wither call emits nothing and DRY stays disabled. Adds 12 unit tests
covering field/value serialization, the JSON string array, the no-op-when-empty
contract, penalty-last-n validation, and immutable-instance semantics
(InferenceParametersTest: 90 -> 102 tests).
Also carries two still-open upstream llama.cpp PRs as local patches (named
after the PR number), refreshed against the pinned b9803 source and verified
to apply cleanly + reverse-check idempotently:
patches/0003-pr22393-... server_context slot_prompt_similarity get/set
patches/0004-pr23116-... per-request reasoning_budget_tokens override
(incl. upstream test-chat.cpp additions, verbatim)
Updates CLAUDE.md patches table and CHANGELOG.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NoVagFhnb7af9DFSDzpsuY
Bumps the pinned llama.cpp tag (CMakeLists GIT_TAG + LLAMA_TAG, README badge,
CLAUDE.md) from b9803 to b9829.
Build-breaking upstream change handled — resumable streaming (PR #23226):
b9829 adds tools/server/server-stream.cpp, which defines g_stream_sessions,
stream_session_attach_pipe(), stream_aware_should_stop(),
stream_conv_id_from_headers() and the stream_pipe_* types. The three server
TUs the project already compiles into libjllama — server-context.cpp,
server-http.cpp, server-models.cpp — now #include "server-stream.h" and
reference those symbols, so server-stream.cpp MUST be compiled in or the link
fails with undefined references. Added it to both the jllama target_sources and
the jllama_test sources. It is platform-neutral (threads + std mutex/condvar,
no subprocess.h/posix_spawn_*), so it stays outside the server-models Android
guard. libjllama wires its own JNI routes and never calls start_gc(), so the
session GC thread stays dormant.
Patch refresh — patches/0001-win32-arg-parse-embed-guard.patch:
- tests/export-graph-ops.cpp was renamed to tests/test-export-graph-ops.cpp;
repointed the call-site-flip hunk (path + index + content unchanged).
- the resumable-stream PR inserted g_stream_sessions.start_gc() after
common_init() in server.cpp, shifting the common_params_parse ->
common_params_parse_main flip context (@@ -82 -> @@ -87); regenerated.
Patches 0002/0003/0004 apply unchanged. All four verified to apply +
reverse-apply cleanly against b9829 via git apply --check over the actual
b9829 sources (FetchContent git-clone is blocked in this sandbox).
New feature now enabled — slot_prompt_similarity:
configureParallelInference now applies slot_prompt_similarity live via
server_context::set_slot_prompt_similarity() (the accessor added by upstream
PR #22393, carried here as patches/0003), replacing the previously
validated-but-discarded TODO block that was explicitly gated on this PR + a
version bump.
Other upstream changes in range (Mamba2 dt_rank generalization, OpenCL
quantized-KV flash attention, CUDA cpy/out-prod fast paths, common/clip
hardening) are internal to upstream-compiled TUs and bind no symbol the
project references — no further source changes required. Recorded the full
upgrade analysis in docs/history/llama-cpp-breaking-changes.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NoVagFhnb7af9DFSDzpsuY
… integration) Two layers of coverage for the new InferenceParameters.withDry* feature, beyond the existing InferenceParametersTest JSON-emission unit tests: C++ (deterministic, no model — src/test/cpp/test_server.cpp, +5 → 194): Happy-path ParamsFromJsonCmpl.Dry* tests pin that the exact JSON keys the Java withers emit (dry_multiplier / dry_base / dry_allowed_length / dry_penalty_last_n / dry_sequence_breakers) are the keys server-schema.cpp reads into common_params_sampling. Verified against the b9829 parser; DRY parsing is vocab-independent so they run with nullptr vocab like the existing schema tests. An upstream field rename now fails here instead of silently disabling the feature. Total C++ suite 454 → 459. Java (model-gated — LlamaModelTest.testDrySamplingAltersRepetitiveGeneration): End-to-end proof that the dry_* fields actually reach the native sampler. Greedy decoding (withTopK(1)) + a fixed seed make two completions of the same repetition-saturated prompt byte-identical unless the sampler changes; a strong DRY config (multiplier 4.0, allowed_length 2, penalty_last_n -1) must diverge from the DRY-disabled baseline. Self-skips via the class @BeforeAll assumeTrue(model present), so it runs only in CI (codellama-7b.Q2_K), exactly like the other model tests. Updated the C++ test counts + test_server.cpp scope note in CLAUDE.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NoVagFhnb7af9DFSDzpsuY
…just Linux)
Previously the nomic-embedding, vision, and TTS integration tests
(LlamaEmbeddingsTest, MultimodalIntegrationTest, TtsIntegrationTest) only
ran on the primary Linux x86_64 job — the macOS (x3) and Windows (x2) test
jobs downloaded just the required 5 + vision and ran without the nomic/tts
properties, so those tests self-skipped there. The validate-models output's
"optional, skipped: not present" lines (plus the Linux job validating before
the TTS download step) made it look like the tests never ran at all.
Now every Java test job downloads the full model set BEFORE validating and
passes all the -Dnet.ladenthin.llama.* properties, so the embedding/vision/TTS
tests run on all platforms:
- publish.yml: add nomic + OuteTTS + WavTokenizer downloads to the 3 macOS
and 2 Windows test jobs; add nomic.path + tts.ttc.model + tts.vocoder.model
to each job's mvn invocation; on the Linux job move the TTS downloads ahead
of the validate step so all downloads precede validation uniformly.
- validate-models.sh / validate-models.bat: nomic + vision + TTS are now
REQUIRED (a missing model hard-fails instead of silently self-skipping);
only the audio-input model (no CI download) remains a self-skip.
Cache: key stays `gguf-models-v1` (not bumped). Every test job now downloads the
full set, so whichever job wins the immutable-key save race caches everything —
but the existing v1 entry was saved without nomic/TTS and actions/cache won't
overwrite a present key, so the old entry must be deleted once for the next run
to rebuild a complete cache. Documented in CLAUDE.md "Java tests".
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NoVagFhnb7af9DFSDzpsuY
…soning budget The CI failure (ReasoningBudgetTest.testReasoningBudgetZero_parameterAccepted_ thinkingNotSuppressed) was the intended signal, not a regression: that test pinned the *unfixed* llama.cpp bug (per-request reasoning_budget_tokens dropped by the server-common.cpp copy loop) and asserted reasoning_content stays present. patches/0004 (upstream PR #23116), added on this branch, fixes the bug, so the CI-built native lib now suppresses thinking at budget=0 — and the bug-pinning assertion correctly fails. Its own message said: "If this assertion fails, the bug has been fixed — remove this test and enable [the suppression test]." Done exactly that, leaving one sharp test: - Removed testReasoningBudgetZero_parameterAccepted_thinkingNotSuppressed (the bug-behavior assertion). - Enabled + renamed the @disabled correct-behavior test to testReasoningBudgetZero_suppressesThinking; it asserts reasoning_content is empty when reasoning_budget_tokens=0, with temperature=0 for cross-platform determinism. Dropped the now-unused @disabled import. - Updated the class Javadoc / @ClaudeGenerated purpose from "known limitation, not enforced" to "enforced via patches/0004", and repointed the positive-budget test's dangling {@link} to the surviving test. If/when a pinned b<nnnn> includes PR #23116 and patches/0004 is dropped, this test keeps asserting the correct behavior and would flag any regression. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NoVagFhnb7af9DFSDzpsuY
The step was labelled "Download vision model (upstream kherud#103 / #34)" — a cosmetic cross-reference to the upstream issues that originally requested vision support, not a dependency. It reads like one, so strip it: the 6 step names become "Download vision model" and the env comment loses the same parenthetical. No behavior change. Untouched (different in kind): the SPDX `Konstantin Herud` copyright headers (MIT-license attribution — legally required for this fork), the README "forked from / many thanks" credit, the SECURITY/CHANGELOG pre-fork history, and the docs/history upstream-issue catalog. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NoVagFhnb7af9DFSDzpsuY
…the vision test Removes the cosmetic "(upstream kherud#103 / #34)" annotation from the three label spots — TestConstants Javadoc, the CLAUDE.md model table, and the README system-properties table — where it read like a dependency. Keeps the provenance in MultimodalIntegrationTest (it explains why the test exists), but as full URLs to the pre-fork upstream repo: kherud#103 and /issues/34 — not a bare "#103", which GitHub would resolve against THIS repo (bernardladenthin/java-llama.cpp) instead of kherud's. Untouched: SPDX Konstantin Herud copyright headers (MIT-license attribution), the README fork credit, and the SECURITY/CHANGELOG/docs-history provenance. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NoVagFhnb7af9DFSDzpsuY
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
InferenceParameters:dry_multiplier,dry_base,dry_allowed_length,dry_penalty_last_n, anddry_sequence_breakers.slot_prompt_similaritymutation in JNI layer now that the upstream setter is available.Test plan
InferenceParametersTestcovers all five new DRY withers and their validation)Related issues / PRs
Upstream: ggml-org/llama.cpp#22393, ggml-org/llama.cpp#23116
Checklist
CONTRIBUTING.mdandCODE_OF_CONDUCT.mdhttps://claude.ai/code/session_01NoVagFhnb7af9DFSDzpsuY