release: v0.9.2#1259
Merged
Merged
Conversation
Register the RF-DETR keypoint preview pose model with xnnpack, coreml and mlx backends (all fp32). This is a beta preview export and may be re-exported under a different constant once a stable version ships. - modelUrls/modelRegistry: add the three backend URLs and variant map - PoseEstimationModule/types: register the model config (single-`forward` export, no inputSize axis) and extend PoseEstimationModelSources - demo: load it via usePoseEstimation in the pose estimation screen - docs: list it in the model registry and usePoseEstimation supported models ## Description <!-- Provide a concise and descriptive summary of the changes implemented in this PR. --> ### Introduces a breaking change? - [ ] Yes - [ ] No ### Type of change - [ ] Bug fix (change which fixes an issue) - [ ] New feature (change which adds functionality) - [ ] Documentation update (improves or adds clarity to existing documentation) - [ ] Other (chores, tests, code style improvements etc.) ### Tested on - [ ] iOS - [ ] Android ### Testing instructions <!-- Provide step-by-step instructions on how to test your changes. Include setup details if necessary. --> ### Screenshots <!-- Add screenshots here, if applicable --> ### Related issues <!-- Link related issues here using #issue-number --> ### Checklist - [ ] I have performed a self-review of my code - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have updated the documentation accordingly - [ ] My changes generate no new warnings ### Additional notes <!-- Include any additional information, assumptions, or context that reviewers might need to understand this PR. --> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
barhanc
approved these changes
Jun 17, 2026
Member
|
We should definitely include fix for vision encoder in this patch. Please check if there are other applicable additions. |
## Description In any multimodal conversation with more than one image, the model starts describing earlier images as the most recently sent one on later turns. `VisionEncoder::encode` caches the `EValue` returned by `vision_encoder.execute()` per image path. That tensor aliases the method's reusable output buffer, so the next `execute()` (the second image, or any later encode) overwrites the bytes behind every cached entry. On re-prefilled turns the prefiller then splices the latest image's embeddings into every image slot. The audio path already snapshots its encoder output for exactly this reason (see the `AudioSlot` comment in `multimodal_prefiller.cpp`); vision never got the same treatment. The fix copies the encoder output into bytes owned by the cache entry immediately after `execute()` and serves cache hits from a tensor wrapping those owned bytes (`unordered_map` nodes are pointer-stable, so the blob stays valid). The bug is backend-independent (the cache sits above the delegate), so XNNPACK/Vulkan multimodal models are affected the same way. ### Introduces a breaking change? - [ ] Yes - [x] No ### Type of change - [x] Bug fix (change which fixes an issue) - [ ] New feature (change which adds functionality) - [ ] Documentation update (improves or adds clarity to existing documentation) - [ ] Other (chores, tests, code style improvements etc.) ### Tested on - [x] iOS - [ ] Android ### Testing instructions 1. Run the example LLM app with a multimodal model (e.g. Gemma 4 E2B multimodal) on the Multimodal LLM screen. 2. Send image A with "What's in this picture?" — answer is correct. 3. Send image B (different content) with the same question — answer is correct. 4. Ask "What was in the FIRST picture I sent?". Before this fix, step 4 describes image B's content (both image slots receive B's embeddings on the re-prefilled turn). After the fix, the model correctly recalls image A. ### Screenshots N/A ### Related issues N/A ### Checklist - [x] I have performed a self-review of my code - [x] I have commented my code, particularly in hard-to-understand areas - [ ] I have updated the documentation accordingly - [x] My changes generate no new warnings ### Additional notes Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Description Optimizes token sampling for large-vocabulary models (e.g. Gemma 4 E2B, 262k vocab), where the previous full-vocabulary sort in top-p dominated per-token latency. Two changes in `sampler.cpp`: - **`mask_topp`**: replaces the `O(n log n)` sort over all logits with a logit-space histogram (`kBins=2048`) that locates the nucleus threshold in two `O(n)` passes — no sort, no per-token vocab-sized allocation. Binning in logit space (rather than probability space) keeps uniform resolution for both peaked and flat distributions. - **`softmax`**: skips `exp()` on logits already masked to `lowest()` by top-k/top-p. The result underflows to zero anyway, and the call is slow on device. On an iPhone 17 Pro with Gemma 4 E2B (int4), per-token sampling drops from ~45 ms to ~10 ms. The histogram approximates the exact sort-based nucleus; the resulting sampled distribution is statistically equivalent (verified the kept-mass fraction stays within <1% of the exact nucleus across peaked, flat, and sharp distributions). ### Introduces a breaking change? - [ ] Yes - [x] No ### Type of change - [ ] Bug fix (change which fixes an issue) - [ ] New feature (change which adds functionality) - [ ] Documentation update (improves or adds clarity to existing documentation) - [x] Other (chores, tests, code style improvements etc.) ### Tested on - [x] iOS - [ ] Android ### Testing instructions 1. Run an LLM with a large vocabulary and a non-zero temperature with `topP` set (e.g. Gemma 4 E2B with `temperature: 0.8`, `topP: 0.9`). 2. Generate a long response and observe tokens/sec. 3. Confirm output remains coherent and sampling is unchanged in character (still stochastic, not greedy). Greedy decoding (`temperature: 0`) is unaffected — it bypasses this path entirely. ### Screenshots <!-- N/A --> ### Related issues <!-- N/A --> ### Checklist - [x] I have performed a self-review of my code - [x] I have commented my code, particularly in hard-to-understand areas - [ ] I have updated the documentation accordingly - [x] My changes generate no new warnings ### Additional notes The histogram is an approximation bounded by bin granularity (`kBins=2048` over a `kRange=40` logit span). This is intentional: exact top-p over a 262k vocab where the nucleus can exceed 100k tokens is inherently expensive, and the sampling outcome is statistically indistinguishable from the exact version. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
benITo47
approved these changes
Jun 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds a patch with RF-Detr Keypoint preview support
Introduces a breaking change?
Type of change
Tested on
Testing instructions
Screenshots
Related issues
Checklist
Additional notes