Upgrade llama.cpp to b9829; add DRY sampling parameters by bernardladenthin · Pull Request #274 · bernardladenthin/java-llama.cpp

bernardladenthin · 2026-06-28T09:15:01Z

Summary

Upgrade pinned llama.cpp version from b9803 to b9829, which includes upstream PR #22393 (slot_prompt_similarity getter/setter) and PR #23116 (per-request reasoning budget overrides in chat completions).
Add five new per-request DRY (Don't Repeat Yourself) sampling parameters to InferenceParameters: dry_multiplier, dry_base, dry_allowed_length, dry_penalty_last_n, and dry_sequence_breakers.
Carry two upstream patches locally until they merge: PR #22393 (slot_prompt_similarity accessors) and PR #23116 (reasoning budget per-request overrides).
Uncomment and enable live slot_prompt_similarity mutation in JNI layer now that the upstream setter is available.

Test plan

Affected unit tests pass locally (InferenceParametersTest covers all five new DRY withers and their validation)
CI is green on this branch
CHANGELOG and CLAUDE.md updated with new llama.cpp version and DRY parameter documentation

Related issues / PRs

Upstream: ggml-org/llama.cpp#22393, ggml-org/llama.cpp#23116

Checklist

I have read CONTRIBUTING.md and CODE_OF_CONDUCT.md
My commits follow Conventional Commits
No security-sensitive changes

https://claude.ai/code/session_01NoVagFhnb7af9DFSDzpsuY

…Rs #22393 and #23116 as patches InferenceParameters gains five immutable withers mirroring the existing withMinP/withTopNSigma (scalars) and withStopStrings (string array) style: withDryMultiplier(float) -> "dry_multiplier" withDryBase(float) -> "dry_base" withDryAllowedLength(int) -> "dry_allowed_length" withDryPenaltyLastN(int) -> "dry_penalty_last_n" (rejects < -1) withDrySequenceBreakers(String...) -> "dry_sequence_breakers" (omitted when unset) This exposes DRY per request, uniformly with the other samplers, instead of only at model/launch level (ModelParameters --dry-*). Defaults are unchanged: no wither call emits nothing and DRY stays disabled. Adds 12 unit tests covering field/value serialization, the JSON string array, the no-op-when-empty contract, penalty-last-n validation, and immutable-instance semantics (InferenceParametersTest: 90 -> 102 tests). Also carries two still-open upstream llama.cpp PRs as local patches (named after the PR number), refreshed against the pinned b9803 source and verified to apply cleanly + reverse-check idempotently: patches/0003-pr22393-... server_context slot_prompt_similarity get/set patches/0004-pr23116-... per-request reasoning_budget_tokens override (incl. upstream test-chat.cpp additions, verbatim) Updates CLAUDE.md patches table and CHANGELOG. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NoVagFhnb7af9DFSDzpsuY

Bumps the pinned llama.cpp tag (CMakeLists GIT_TAG + LLAMA_TAG, README badge, CLAUDE.md) from b9803 to b9829. Build-breaking upstream change handled — resumable streaming (PR #23226): b9829 adds tools/server/server-stream.cpp, which defines g_stream_sessions, stream_session_attach_pipe(), stream_aware_should_stop(), stream_conv_id_from_headers() and the stream_pipe_* types. The three server TUs the project already compiles into libjllama — server-context.cpp, server-http.cpp, server-models.cpp — now #include "server-stream.h" and reference those symbols, so server-stream.cpp MUST be compiled in or the link fails with undefined references. Added it to both the jllama target_sources and the jllama_test sources. It is platform-neutral (threads + std mutex/condvar, no subprocess.h/posix_spawn_*), so it stays outside the server-models Android guard. libjllama wires its own JNI routes and never calls start_gc(), so the session GC thread stays dormant. Patch refresh — patches/0001-win32-arg-parse-embed-guard.patch: - tests/export-graph-ops.cpp was renamed to tests/test-export-graph-ops.cpp; repointed the call-site-flip hunk (path + index + content unchanged). - the resumable-stream PR inserted g_stream_sessions.start_gc() after common_init() in server.cpp, shifting the common_params_parse -> common_params_parse_main flip context (@@ -82 -> @@ -87); regenerated. Patches 0002/0003/0004 apply unchanged. All four verified to apply + reverse-apply cleanly against b9829 via git apply --check over the actual b9829 sources (FetchContent git-clone is blocked in this sandbox). New feature now enabled — slot_prompt_similarity: configureParallelInference now applies slot_prompt_similarity live via server_context::set_slot_prompt_similarity() (the accessor added by upstream PR #22393, carried here as patches/0003), replacing the previously validated-but-discarded TODO block that was explicitly gated on this PR + a version bump. Other upstream changes in range (Mamba2 dt_rank generalization, OpenCL quantized-KV flash attention, CUDA cpy/out-prod fast paths, common/clip hardening) are internal to upstream-compiled TUs and bind no symbol the project references — no further source changes required. Recorded the full upgrade analysis in docs/history/llama-cpp-breaking-changes.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NoVagFhnb7af9DFSDzpsuY

… integration) Two layers of coverage for the new InferenceParameters.withDry* feature, beyond the existing InferenceParametersTest JSON-emission unit tests: C++ (deterministic, no model — src/test/cpp/test_server.cpp, +5 → 194): Happy-path ParamsFromJsonCmpl.Dry* tests pin that the exact JSON keys the Java withers emit (dry_multiplier / dry_base / dry_allowed_length / dry_penalty_last_n / dry_sequence_breakers) are the keys server-schema.cpp reads into common_params_sampling. Verified against the b9829 parser; DRY parsing is vocab-independent so they run with nullptr vocab like the existing schema tests. An upstream field rename now fails here instead of silently disabling the feature. Total C++ suite 454 → 459. Java (model-gated — LlamaModelTest.testDrySamplingAltersRepetitiveGeneration): End-to-end proof that the dry_* fields actually reach the native sampler. Greedy decoding (withTopK(1)) + a fixed seed make two completions of the same repetition-saturated prompt byte-identical unless the sampler changes; a strong DRY config (multiplier 4.0, allowed_length 2, penalty_last_n -1) must diverge from the DRY-disabled baseline. Self-skips via the class @BeforeAll assumeTrue(model present), so it runs only in CI (codellama-7b.Q2_K), exactly like the other model tests. Updated the C++ test counts + test_server.cpp scope note in CLAUDE.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NoVagFhnb7af9DFSDzpsuY

…just Linux) Previously the nomic-embedding, vision, and TTS integration tests (LlamaEmbeddingsTest, MultimodalIntegrationTest, TtsIntegrationTest) only ran on the primary Linux x86_64 job — the macOS (x3) and Windows (x2) test jobs downloaded just the required 5 + vision and ran without the nomic/tts properties, so those tests self-skipped there. The validate-models output's "optional, skipped: not present" lines (plus the Linux job validating before the TTS download step) made it look like the tests never ran at all. Now every Java test job downloads the full model set BEFORE validating and passes all the -Dnet.ladenthin.llama.* properties, so the embedding/vision/TTS tests run on all platforms: - publish.yml: add nomic + OuteTTS + WavTokenizer downloads to the 3 macOS and 2 Windows test jobs; add nomic.path + tts.ttc.model + tts.vocoder.model to each job's mvn invocation; on the Linux job move the TTS downloads ahead of the validate step so all downloads precede validation uniformly. - validate-models.sh / validate-models.bat: nomic + vision + TTS are now REQUIRED (a missing model hard-fails instead of silently self-skipping); only the audio-input model (no CI download) remains a self-skip. Cache: key stays `gguf-models-v1` (not bumped). Every test job now downloads the full set, so whichever job wins the immutable-key save race caches everything — but the existing v1 entry was saved without nomic/TTS and actions/cache won't overwrite a present key, so the old entry must be deleted once for the next run to rebuild a complete cache. Documented in CLAUDE.md "Java tests". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NoVagFhnb7af9DFSDzpsuY

@disabled

…soning budget The CI failure (ReasoningBudgetTest.testReasoningBudgetZero_parameterAccepted_ thinkingNotSuppressed) was the intended signal, not a regression: that test pinned the *unfixed* llama.cpp bug (per-request reasoning_budget_tokens dropped by the server-common.cpp copy loop) and asserted reasoning_content stays present. patches/0004 (upstream PR #23116), added on this branch, fixes the bug, so the CI-built native lib now suppresses thinking at budget=0 — and the bug-pinning assertion correctly fails. Its own message said: "If this assertion fails, the bug has been fixed — remove this test and enable [the suppression test]." Done exactly that, leaving one sharp test: - Removed testReasoningBudgetZero_parameterAccepted_thinkingNotSuppressed (the bug-behavior assertion). - Enabled + renamed the @disabled correct-behavior test to testReasoningBudgetZero_suppressesThinking; it asserts reasoning_content is empty when reasoning_budget_tokens=0, with temperature=0 for cross-platform determinism. Dropped the now-unused @disabled import. - Updated the class Javadoc / @ClaudeGenerated purpose from "known limitation, not enforced" to "enforced via patches/0004", and repointed the positive-budget test's dangling {@link} to the surviving test. If/when a pinned b<nnnn> includes PR #23116 and patches/0004 is dropped, this test keeps asserting the correct behavior and would flag any regression. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NoVagFhnb7af9DFSDzpsuY

The step was labelled "Download vision model (upstream kherud#103 / #34)" — a cosmetic cross-reference to the upstream issues that originally requested vision support, not a dependency. It reads like one, so strip it: the 6 step names become "Download vision model" and the env comment loses the same parenthetical. No behavior change. Untouched (different in kind): the SPDX `Konstantin Herud` copyright headers (MIT-license attribution — legally required for this fork), the README "forked from / many thanks" credit, the SECURITY/CHANGELOG pre-fork history, and the docs/history upstream-issue catalog. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NoVagFhnb7af9DFSDzpsuY

…the vision test Removes the cosmetic "(upstream kherud#103 / #34)" annotation from the three label spots — TestConstants Javadoc, the CLAUDE.md model table, and the README system-properties table — where it read like a dependency. Keeps the provenance in MultimodalIntegrationTest (it explains why the test exists), but as full URLs to the pre-fork upstream repo: kherud#103 and /issues/34 — not a bare "#103", which GitHub would resolve against THIS repo (bernardladenthin/java-llama.cpp) instead of kherud's. Untouched: SPDX Konstantin Herud copyright headers (MIT-license attribution), the README fork credit, and the SECURITY/CHANGELOG/docs-history provenance. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NoVagFhnb7af9DFSDzpsuY

sonarqubecloud · 2026-06-28T10:20:16Z

Quality Gate passed

Issues
13 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

claude added 2 commits June 28, 2026 08:53

bernardladenthin temporarily deployed to startgate June 28, 2026 09:15 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate June 28, 2026 09:24 — with GitHub Actions Inactive

bernardladenthin had a problem deploying to startgate June 28, 2026 09:53 — with GitHub Actions Error

bernardladenthin temporarily deployed to startgate June 28, 2026 09:58 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate June 28, 2026 10:13 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate June 28, 2026 10:17 — with GitHub Actions Inactive

bernardladenthin merged commit 6b7503d into main Jun 28, 2026
25 of 46 checks passed

bernardladenthin deleted the claude/inference-parameters-dry-sampling-j2tbm0 branch June 28, 2026 10:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade llama.cpp to b9829; add DRY sampling parameters#274

Upgrade llama.cpp to b9829; add DRY sampling parameters#274
bernardladenthin merged 7 commits into
mainfrom
claude/inference-parameters-dry-sampling-j2tbm0

bernardladenthin commented Jun 28, 2026

Uh oh!

sonarqubecloud Bot commented Jun 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bernardladenthin commented Jun 28, 2026

Summary

Test plan

Related issues / PRs

Checklist

Uh oh!

sonarqubecloud Bot commented Jun 28, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants