xmtp_mls: MlsGroup::create_external_invite (produce QR-invite payload + encrypted GroupInfo) by tylerhawkes · Pull Request #3673 · xmtp/libxmtp

tylerhawkes · 2026-05-21T08:21:25Z

Summary

Adds the invite-creation half of the QR-invite flow.

MlsGroup::create_external_invite(opts) -> CreateExternalInviteOutput
Computes group_id_hash = sha256(group_id()), generates fresh symmetric key + nonce, exports a current-epoch GroupInfo (with ratchet tree bundled), wraps it via wrap_group_info_with_nonce, and returns serialized ExternalInvitePayload + EncryptedGroupInfoBlob protos.
The envelope's epoch and group_state_hash fields are populated here (the caller has the MlsGroup; the pure-codec helpers in L-5 leave them zero).
Refuses to operate on pre-AppData groups — external-commit semantics depend on the AppData GROUP_MEMBERSHIP component existing.
Refuses to operate on DM groups (same constraint as set_external_commit_policy).

Stack

Base: tyler/qr-invite-l9-openmls-bump (deps: bump openmls fork to feature/external-commit-by-value @ 9cdd5e7 #3672).
Bindings (NAPI) in PR-B1.

Test plan

Integration coverage via PR-T1.
cargo check -p xmtp_mls passes.

Note

Add `MlsGroup::create_external_invite` to produce QR-invite payload and encrypted GroupInfo

Adds MlsGroup::create_external_invite in external_invite.rs that exports the current epoch GroupInfo, encrypts it via wrap_group_info, and builds an ExternalInvitePayload governed by the group's EXTERNAL_COMMIT_POLICY.
Adds MlsGroup::set_external_commit_policy and set_allow_external_commit in mod.rs to write policy into group AppData, generating a symmetric key and external group ID when enabling.
Adds ValidatedCommit::from_external_commit in validated_commit.rs and a corresponding Sender::NewMemberCommit path in mls_sync.rs so external commits authored by non-members are validated against policy and processed.
Replaces wrap_welcome/unwrap_welcome with builder-style wrap_payload_hpke/wrap_payload_symmetric APIs in payload_encryption.rs, adding WELCOME_HPKE_LABEL and XMTP_EXTERNAL_INVITE_LABEL for domain separation.
Removes the separate sync_api_client from ClientBuilder, XmtpMlsLocalContext, and all bindings; replaces dual api_clients() with single api_client() and introduces WorkerConfig for per-worker interval, jitter, and enablement control exposed to WASM, Node, iOS, and Android bindings.
Adds the xmtp_logging crate with a platform-aware XmtpLoggingBuilder, runtime LoggingHandle, OTLP telemetry support, and rolling-file logging for native targets; removes logging from xmtp_common.
Risk: removing sync_api_client is a breaking API change for all bindings and any callers using api_clients(); WorkerKind::Event is removed from the enum.

^{Macroscope summarized 57e7e30.}

claude · 2026-05-21T08:21:42Z

Claude finished @tylerhawkes's task —— View job

Code Review — 2026-05-27 18:35 UTC

Reviewed the implementation of MlsGroup::create_external_invite for producing QR-invite payloads with encrypted GroupInfo.

Summary

Overall solid implementation with good cryptographic practices and defensive validation. The code follows expected patterns for this codebase and properly delegates to well-tested helper functions. Main concerns are around test coverage gaps and a few edge cases in expiry computation.

Code Quality & Best Practices

Positive:

Clean separation of concerns: orchestration logic in external_invite.rs, crypto primitives delegated to xmtp_mls_common
Excellent documentation throughout — the module-level comments clearly explain the flow and lifecycle
Proper use of tracing instrumentation (#[tracing::instrument])
Good error handling with descriptive error messages
Constants are well-defined and documented (SYMMETRIC_KEY_LEN, MIN_EXTERNAL_GROUP_ID_LEN)

Minor issues:

compute_blob_expiry at line 182 uses u64::try_from(now_ns).unwrap_or(0) — this silently converts negative timestamps to 0, which could mask bugs. Consider adding validation or using u64::saturating_from with a comment explaining the edge case.

Potential Bugs / Edge Cases

Integer conversion safety (external_invite.rs:192):
```
let now_u64 = u64::try_from(now_ns).unwrap_or(0);
candidates.push(now_u64.saturating_add(policy.expire_in_ns));
```
If now_ns is negative (shouldn't be, but defensive programming), this becomes 0 and could produce unexpectedly short expiry times. Consider asserting now_ns >= 0 or documenting why negative values are impossible.
Expiry edge case (external_invite.rs:200):
When all expiry sources are 0, this returns 0 (no expiry). This is documented and correct, but worth noting that expired blobs (with expires_at_ns = 0) will be accepted indefinitely. The unwrap logic correctly handles this (see encrypted_group_info.rs:146).
DM group check missing:
The PR description states "Refuses to operate on DM groups (same constraint as set_external_commit_policy)," but I don't see a DM check in create_external_invite. Should there be a check like:
```
if self.is_dm()? {
    return Err(GroupError::DmGroupMetadataForbidden);
}
```
This might be enforced implicitly (DM groups shouldn't have the policy set), but an explicit check would be clearer and safer.
Pre-AppData group check (external_invite.rs:115-116):
The code correctly refuses to operate when EXTERNAL_COMMIT_POLICY is absent, which implicitly enforces the "pre-AppData groups" constraint mentioned in the PR description. This is correct but subtle — consider adding a comment explaining this connection.

Security Concerns

Strong points:

Proper use of ChaCha20Poly1305 AEAD via OpenMLS provider
Fresh nonce generation for each encryption (critical for AEAD security) — correctly delegated to wrap_group_info which calls generate_nonce()
Symmetric key length validation before use
External group ID minimum length enforcement
No key or sensitive data logged (tracing uses skip_all)

Observations:

The symmetric key is read from EXTERNAL_COMMIT_POLICY (set by admin via earlier PR in the stack). This means all members with the policy visible can create invites, which appears to be intentional per the documentation. Worth confirming this is the desired authorization model.
group_state_hash uses epoch_authenticator() which is correct per OpenMLS semantics.

Performance Considerations

export_group_info with with_ratchet_tree = true can be expensive for large groups, but this is documented and expected (joiner needs the tree for external commit). No issues here.
The lock pattern load_mls_group_with_lock_async is appropriate for this operation.
encode_to_vec() allocations are fine for invite-creation path (infrequent operation).

Test Coverage

Codecov reports 59.13% patch coverage with 47 uncovered lines. This is the main concern.

Looking at the tests:

create_external_invite_fails_when_policy_not_set (lines 216-228) covers the basic error case
compute_blob_expiry_picks_tightest_active_value (lines 238-273) has good unit coverage for the expiry logic

Missing coverage likely includes:

Happy path: successful invite creation when policy IS set ⚠️
Malformed policy scenarios (wrong key length, short external_group_id) ⚠️
Edge cases around ensure_not_paused() failure
export_group_info error path
Interaction between blob_expires_at_ns hint and policy expiry

Recommendation: The PR description notes "Integration coverage via PR-T1" (unchecked). Consider adding at least one happy-path test to this PR that:

Calls set_allow_external_commit(true) to enable the policy
Calls create_external_invite successfully
Validates the returned payload and encrypted blob can be decoded

This would significantly improve coverage and document expected usage.

Error Handling Review

The new error variants in error.rs are well-designed:

ExportGroupInfo(#[from] openmls::group::ExportGroupInfoError),          // Line 452
EncryptedGroupInfo(#[from] EncryptedGroupInfoError),                    // Line 460
ExternalCommitNotAllowed,                                               // Line 466
ExternalCommitPolicyMalformed(String),                                  // Line 473

All four are correctly marked as non-retryable in the RetryableError impl (lines 668-671). Good.

Recommendations

High priority: Add happy-path test coverage showing successful invite creation with a valid policy
Medium priority: Add explicit DM group check if intended (or add comment explaining why it's not needed)
Low priority: Consider adding validation for negative now_ns in compute_blob_expiry, or document why it's safe
Low priority: Add test case for malformed policy scenarios (wrong key/ID lengths)

Overall assessment: Well-implemented feature with good cryptographic hygiene. The main gap is test coverage of the happy path, which should be addressed before or in PR-T1.

macroscopeapp · 2026-05-21T08:22:34Z

Approvability

Verdict: Needs human review

Unable to check for correctness in 57e7e30. Diff is too large for automated approval analysis. A human reviewer should evaluate this PR.

^{You can customize Macroscope's approvability policy. Learn more.}

codecov · 2026-05-21T08:33:31Z

Codecov Report

❌ Patch coverage is 59.13043% with 47 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.82%. Comparing base (b4b3789) to head (e8978a2).

Files with missing lines	Patch %	Lines
crates/xmtp_mls/src/groups/external_invite.rs	59.13%	47 Missing ⚠️

Additional details and impacted files

@@                         Coverage Diff                         @@
##           tyler/qr-invite-l9-openmls-bump    #3673      +/-   ##
===================================================================
- Coverage                            83.96%   83.82%   -0.14%     
===================================================================
  Files                                  401      402       +1     
  Lines                                59586    59701     +115     
===================================================================
+ Hits                                 50031    50046      +15     
- Misses                                9555     9655     +100

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

### Add `--read-only` healthcheck mode and run reverse read-only pass in cross-talk test - Adds a `--read-only` flag to the healthcheck subcommand that skips all write operations and reuses an existing primary identity; the flag gates write ops via a new `WRITES` condition bit and a new `BOOTSTRAP` condition for empty identity stores. - Extends the cross-talk test healthcheck phase to run a forward write-enabled pass followed by a reverse read-only pass over older versions, annotating result rows with a `leg` field (`fwd`/`rev`). - Refactors `App` into a unit struct with a global `CONF` singleton and `App::network()`/`App::network_hash()` accessors, removing per-instance and per-call network arguments across clients, stores, and subcommands. - Introduces `DmStore` and a `Dm` data model for persisting direct message threads during sync, with metadata tracking and display support. - Adds `BootstrapIdentities` health op that creates and registers three new identities when both `BOOTSTRAP` and `WRITES` conditions are active. - Risk: `STRICT_VERSIONING` bit shifted from `1<<0` to `1<<1`; any serialized or compared condition values will be incorrect unless updated.  <a href="https://app.macroscope.com">Macroscope</a> summarized 2326a91.

### Patch `tracing-oslog` dependency to remove panics Replaces the `tracing-oslog` crate with a forked version pinned to a specific git revision that removes panics from the library.  <a href="https://app.macroscope.com">Macroscope</a> summarized bcccc6e.

…spam (#3693) Two focused changes to cut production log volume. Driven by an audit of a 1000-line VictoriaLogs sample from 5 herald instances, where **53% of lines were TRACE** and a single steady-state message accounted for **~40% of all log lines**. ## Commit 1 — `chore(logs): remove steady-state log noise` Deletes logs that fire on every poll/tick when *nothing happened* (the largest volume sources): - **key_package_cleaner**: drop per-poll `"No expired key packages to delete"` trace (~40% of sampled prod lines). - **device_sync worker**: drop the 20s `"New event: Tick"` heartbeat trace; log only real events, with `installation_id` as a structured field (not a `[..]` message prefix). - **sql_key_store**: drop the bare `"own_leaf_nodes"` trace (fires on every leaf-node read). - **worker metrics**: downgrade per-increment `"firing {metric}"` from `info` → `trace`. ## Commit 2 — `refactor(api): remove ApiDebugWrapper stats-on-error spam` On **every** API error, `ApiDebugWrapper` snapshotted all per-endpoint request counters into `AggregateStats` and embedded its multi-line `Debug` dump into `ApiClientError::{ErrorWithStats, ClientWithEndpointAndStats}` — which then got logged, and re-logged on each retry, as a verbose stats block. This removes the error-dump path so API errors are clean one-liners: - Delete `ApiDebugWrapper` and its module re-export. - Remove `ClientBuilder::enable_api_debug_wrapper` and drop the call from the mobile/node/wasm client builders. - `WrappedXmtpApiClient` is now `TrackedStatsClient<XmtpApiClient>`. - Remove the now-unused `ErrorWithStats` / `ClientWithEndpointAndStats` variants and their match arms. **Kept unchanged:** the `TrackedStatsClient` request counters and the FFI `api_statistics()` / `api_aggregate_statistics()` / `clear_all_statistics()` surface — only the error-message spam is removed. ## Verification - `cargo check` + `cargo clippy` clean on `xmtp_proto`, `xmtp_api`, `xmtp_mls`, `xmtp_db` (the only warning, `identity.rs:842` unused `mut`, is pre-existing and unrelated). - `just node check` and `just wasm check` both pass after the binding edits. ## Follow-ups (not in this PR) - Replace the in-RAM `TrackedStatsClient` counters with OTEL metrics (per-API-call latency/rate at the retry combinator), then retire the stats layer once the FFI is migrated. - Broader log standardization + OTEL traces/metrics rollout (separate audit docs exist locally). 🤖 Generated with [Claude Code](https://claude.com/claude-code)    > [!NOTE] > ### Remove API debug wrapper and reduce steady-state log noise across MLS workers > - Deletes the `ApiDebugWrapper` and `enable_api_debug_wrapper` builder method; the `WrappedXmtpApiClient` type alias now resolves to `TrackedStatsClient<XmtpApiClient>` without the debug layer. > - Removes `ClientWithEndpointAndStats` and `ErrorWithStats` variants from `ApiClientError`, so errors no longer carry embedded aggregate stats. > - Suppresses high-frequency log output: tick events in `DeviceSyncWorker`, no-op key package cleaner runs, repeated DB connection-outage errors, and per-metric info logs are all silenced or downgraded to trace. > - Behavioral Change: API errors previously wrapped by the debug adapter are now surfaced as plain `ClientWithEndpoint` or `Client` variants with no attached stats. > >  > > <a href="https://app.macroscope.com">Macroscope</a> summarized 480bc19. >  >   --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

When the SQLite connection pool cannot hand out a connection (exhausted / `timed out waiting for connection` / `Unable to open the database file`), a **single** root cause was logged dozens of times: each background worker logged the error itself and then propagated it to the supervisor, which logged it again on every poll/restart. In a production sample this turned one pool outage into a flood of duplicate `ERROR` lines. ## Approach: the supervisor is the single logger for propagated worker errors The worker supervisor already receives every error a worker returns and logs it with context (`{kind} worker error: {err}`), and routes pool failures to a quiet reconnect. So workers should **not** log errors they propagate. - **Remove the per-worker error logs** that just logged-then-propagated (`disappearing_messages`, `pending_self_remove` outer fetch, `key_package_cleaner` fetch). The supervisor is now the sole source. Per-group *swallow-and-continue* logs (e.g. one malformed `group_id`) are kept, since those never reach the supervisor. - **Fix a latent bug in `key_package_cleaner`**: it previously *logged and swallowed* a fetch error (returned `Ok`), so a pool outage was logged every 5s and never triggered the supervisor’s reconnect/restart path (silent infinite retry). It now propagates the error like the other workers. - **Classify pool-acquisition failures as needing a reconnect**: `StorageError::db_needs_connection()` now matches `PlatformStorageError::Pool(_)` (the r2d2 acquisition/timeout error) in addition to `PoolNeedsConnection`, so the supervisor takes the quiet reconnect path for a pool outage instead of the noisy error+restart path. This is the one place the predicate is consulted. - **Drop a redundant log**: `"Key package deletion successful"` immediately followed `"Deleted {N} expired key packages (up to ID …)"` — same meaning, removed. ## Tests - `pool_errors_need_connection` — unit test for the `StorageError::db_needs_connection()` classification (`PoolNeedsConnection` directly and wrapped in `Connection`; a non-pool error is not). The `Pool(_)` arm can’t be unit-constructed (`r2d2::PoolError` has no public ctor) so it’s covered end-to-end below. - `pool_failure_needs_connection` — end-to-end: disconnect a persistent store’s pool, run a real query, assert the resulting error reports `db_needs_connection()`. ## Incidental CI fix `xmtp_db`’s `test-utils` feature did not forward `xmtp_proto/test-utils`, so `cargo test -p xmtp_db` only compiled under workspace-wide feature unification (which CI uses). This masked a broken standalone per-crate test build. Added the missing feature forward. ## Verification - `cargo clippy --locked --all-features --all-targets --no-deps -- -Dwarnings` ✓ - `cargo fmt --check` ✓ · `cargo hakari generate --diff` ✓ - Both new tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code)    > [!NOTE] > ### Deduplicate DB pool-failure error spam across MLS workers > - Extends `StorageError::db_needs_connection()` and `ConnectionError::db_needs_connection()` to classify generic `Pool(_)` acquisition failures (not just `PoolNeedsConnection`) as needing a reconnect. > - Implements `NeedsDbReconnect` for `MlsStoreError` so worker restart logic can inspect wrapped DB errors. > - Refactors `DisappearingMessagesCleaner`, `KeyPackagesCleaner`, and `PendingSelfRemoveWorker` to propagate structured errors instead of logging and swallowing them, so pool failures surface once via the reconnect path rather than flooding logs. > - Behavioral Change: workers now abort and return an error on the first DB failure per cycle instead of continuing past errors; this reduces log spam but changes per-cycle recovery behavior. > >  > > <a href="https://app.macroscope.com">Macroscope</a> summarized ffe1014. >  >   Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Third log-noise PR (after #3693, #3696), driven by a **new herald log dump filtered to INFO-only** (2000 lines, ~15.5 min of healthy traffic — the first sample was TRACE-flooded and couldn't show INFO top-talkers). ## Headline finding from the INFO dump **`"Processing readd requests for N groups"` was 1705 of 2000 lines (85%)** — `commit_log.rs` `handle_incoming_pending_readds`, which runs every ~2s worker tick and logged at info **even when N=0**. Fixed: early-return when empty + downgrade to debug. This single change removes the dominant INFO offender. ## Changes (all info→debug, per-item/steady-state churn) - **commit_log.rs**: gate `"Processing readd requests"` on non-empty (the 85% offender); `"No commit log requests to query"` (nothing-to-do early return); the per-readd cluster (`Sending readd request`, `Sending readd request to`, `Sent readd request`, `Updated requested readd sequence id`); `Updating publish cursor for conversation` (per-conversation loop). - **welcome_sync.rs**: two `"[{}] syncing group"` (per-group, also dropped the duplicate inbox_id prefix); `unstuck previously paused group` (per-group). - **welcomes/xmtp_welcome.rs**: per-welcome routine logs (`attempting to commit welcome`, `updating cursor to`, `storing group with welcome id`, `Created GroupUpdated message`, `updated message cursor from welcome metadata`). Kept the semantic state-change events (re-add → ALLOWED, consent reset) and error logs at info/warn. - **mls_ext/decrypted_welcome.rs**: `Trying to decrypt welcome` / `welcome pointer` (per-welcome). - **welcome_pointer.rs**: `Resolving welcome pointer` (kept the retry-progress log). - **mls_store.rs**: `returning N welcomes` (per query). - **oneshot.rs**: `Processing oneshot welcome`. - **subscriptions/process_welcome.rs**: 3 per-stream-event logs. - **subscriptions/stream_messages.rs**: `encountered newly unprocessed message` (per message). - **xmtp_api_d14n/.../mls.rs**: `got N envelopes`, `querying welcomes` (per query). ## Deliberately out of scope - **mls_sync.rs** — the most sensitive code; its per-message logs are being decided separately to preserve debugging context. - **The structured `log_event!` event system** (the `➣`-prefixed events like `GroupSyncGroupInactive`) — left untouched pending a separate evaluation of whether it's worth keeping. ## Verification - `cargo clippy --locked --all-features --all-targets --no-deps -- -Dwarnings` ✓ - `cargo fmt --check` ✓ - Log-level/message-only changes; no behavior change except the readd early-return (skips a no-op log when there is no work). 🤖 Generated with [Claude Code](https://claude.com/claude-code)    > [!NOTE] > ### Downgrade per-item welcome, sync, and query log statements from info to debug > Reduces log noise by lowering high-frequency, per-item `tracing` statements across welcome processing, group sync, commit log, and MLS query paths from `info` to `debug` level. Also adds an early-return guard in `handle_incoming_pending_readds` to skip logging entirely when there are no groups to process. > >  > > <a href="https://app.macroscope.com">Macroscope</a> summarized 357ed5d. >  >   Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Continues the log-noise cleanup (after #3693). These three sites were among the top remaining volume sources in the production sample, all logging normal-course behavior at warn/info. ## Changes - **`xmtp_common/src/retry.rs`** — the retry combinator (every API call funnels through it): - Per-attempt `"retrying function that failed"` was logged at **warn** on *every* retry. A single retry is normal transient behavior, so a flapping endpoint flooded prod warn logs. → **debug** (with an `attempt` field). - `"retry strategy exceeded max wait time"` stays at **warn** (it signals a call that never succeeded) but now carries context: `attempts`, `elapsed_ms`, and the error — previously it was a bare string with no indication of which call or how long. - **`xmtp_mls/src/groups/mls_sync.rs`** — `"[{inbox_id}] syncing group, epoch = …"` fired at **info** on every group sync, and duplicated `inbox_id` (both a structured field *and* the `[{}]` message prefix). → **debug**, dropped the redundant prefix, `epoch` is now a field. - **`xmtp_mls/src/groups/welcome_sync.rs`** — `"sync_welcomes failed, continuing with group sync"` and `"unstick_paused_groups failed, continuing with sync"` were **warn**, but both are explicitly non-fatal recovered paths. → **debug**. Net effect: prod logs at `info`+ now reflect real problems; normal retry/sync churn drops to `debug`. ## Verification - `cargo clippy --locked --all-features --all-targets --no-deps -- -Dwarnings` ✓ - `cargo fmt --check` ✓ - Log-level/message-only changes; no behavior change. 🤖 Generated with [Claude Code](https://claude.com/claude-code)    > [!NOTE] > ### Downgrade retry and sync log levels from warn/info to debug > - Retry attempts in `retry_async` now log at debug instead of warn; exhaustion logs still warn but now include attempt count, elapsed_ms, and the error message. > - Group sync start in `MlsGroup.sync` is downgraded from info to debug and switches to structured fields (inbox_id, installation_id, group_id, epoch). > - Failures in `WelcomeSync.sync_groups` for `unstick_paused_groups` and `sync_welcomes` are downgraded from warn to debug. > >  > > <a href="https://app.macroscope.com">Macroscope</a> summarized 1e96453. >  >   Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

## Summary Workers in `xmtp_mls` run under `WorkerRunner`. Its outer loop (`Worker::spawn`) stops a worker **only** when an error returned from `run_tasks` reports `needs_db_reconnect() == true` — that's the signal that the connection pool was dropped and the worker should stop until reconnect. Several workers violated this contract: they caught per-item errors with `tracing::error!` and continued, so a dropped pool could never reach the supervisor and the worker hot-looped against a gone pool. The signal was also silently lost in wrapper error types that mapped `GroupError(_) => false`. This cleans the workers up to follow the contract, using the recently-cleaned `disappearing_messages` worker as the model. ## Root fix (error-type forwarding) - `ConnectionError` gets a `db_needs_connection()` helper mirroring `StorageError`. - `GroupError`, `MlsStoreError`, and `SubscribeError` now implement `NeedsDbReconnect`, forwarding their storage / connection / client / wrapped variants. - `DeviceSyncError::needs_db_reconnect` now forwards `Db` / `Group` / `MlsStore` / `Subscribe` instead of only `Client`. - `CommitLogError::needs_db_reconnect` forwards `GroupError` and narrows `Connection` to `db_needs_connection()` (was unconditionally `true`); removes the stale TODOs. ## Workers - **key_package_cleaner** — propagate the expired-KP fetch error (was swallowed → `Ok`); bubble a disconnect caught while deleting a single key package. - **pending_self_remove** — capture the group-load (`MlsStoreError`) path (previously never converted into the worker error) and bubble a disconnect from either per-group operation; fix `GroupError(_) => false`. - **commit_log** — bubble a disconnect from the readd-request and readd-validation loops. - **device_sync** — bubble a disconnect from the per-message processing loop. Per-item **non-DB** failures still log and continue, as before — only a dropped pool now stops the worker. ## Other - Removes the dead `WorkerKind::Event` variant (no impl, no construction, not in bindings/FFI/serde). - Adds native-only unit tests pinning the disconnect-forwarding contract for every worker error type (`disconnect_propagation_tests`). ## Test plan - `cargo nextest run -p xmtp_mls disconnect_propagation` — 7/7 pass. - `cargo check -p xmtp_mls --tests` clean. - `just lint-rust` clean (clippy / fmt / hakari). 🤖 Generated with [Claude Code](https://claude.com/claude-code)    > [!NOTE] > ### Propagate DB-disconnect errors out of workers to the supervisor > - Adds or fixes `needs_db_reconnect()` implementations across multiple error types (`GroupError`, `ClientError`, `IdentityError`, `SubscribeError`, `DeviceSyncError`, `PendingSelfRemoveWorkerError`) so dropped-pool signals are forwarded up the call stack rather than swallowed. > - Errors in `publish_commit_logs_to_remote` and `prepare_publish_commit_log_info` now propagate instead of being logged and ignored, so cursors only advance on success. > - `send_outgoing_readd_requests` now fails the operation when any readd request fails, instead of logging and continuing. > - `handle_incoming_pending_readds` now aborts early on retryable validation errors to avoid hot loops. > - Device-sync and archive import paths abort on DB errors instead of logging warnings and continuing. > - Behavioral Change: workers that previously swallowed DB-disconnect errors will now surface them to the supervisor, triggering reconnect handling. > >  > > <a href="https://app.macroscope.com">Macroscope</a> summarized 4c78768. >  >   Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

When an own-intent failed non-retryably, process_message_inner marked the intent row Error, committed, and returned Ok(identifier) — dropping the real GroupMessageProcessingError. process_messages recorded only a success, so sync_until_intent_resolved returned a summary that displayed '0 failed 1 succeeded' yet surfaced as a top-level GroupError::Sync with no cause. Capture the dropped cause via a ProcessedMessageOutcome struct and record it in ProcessSummary.errored, so the summary (and its source()) report the actual failure. No DB/txn-commit or retry-rollback behavior changes.    > [!NOTE] > ### Fix `SyncSummary.extend` to preserve the earliest error cause across sync retries > - `SyncSummary.extend` previously overwrote `self.other` unconditionally, clobbering earlier error causes with later `None` values; it now only sets `self.other` when `self.other` is currently `None`. > - `GroupError::SyncFailedToWait` now includes the `SyncSummary` in its `Display` output so the cause is visible in error messages. > - Per-attempt error logging inside `MlsGroup::sync` is removed; summaries are accumulated silently and logged once by `GroupSyncFinished`. > - `GroupSyncIntentErrored` events no longer redundantly include the summary metadata field. > - Two unit tests are added to [summary.rs](https://github.com/xmtp/libxmtp/pull/3701/files#diff-cc6e0f2ec334b2a161a0e24bf0a280e37287801dcc7c6e40c46b7d5153580240) validating the corrected `extend` accumulation behavior. > >  > > <a href="https://app.macroscope.com">Macroscope</a> summarized 55e3eec. >  >

…3702) ## Problem `SyncSummary` had two predicates that disagreed about what "errored" means: - **`Display`** (success banner) fires only when `publish_errors`, `post_commit_errors`, AND `other` are all empty. - **`is_errored()`** used `&&` for publish/post_commit (`!publish.is_empty() && !post_commit.is_empty()`), so a sync with *only* publish errors (or only post-commit errors) returned `is_errored() == false` — reported as not-errored — while its `Display` printed the `===== Errors Occurred During Sync =====` banner. These two govern whether a summary may sit in the `Err` position (`sync_with_conn`) and which `Display` branch is shown, so the divergence meant an `Ok(...)` summary could render as an error block (and vice versa). Separately, `GroupError::Sync` / `SyncFailedToWait` box a `SyncSummary` whose `source()` returned `None`, so the FFI `Caused by:` chain dead-ended at the summary's flattened string — the underlying error was never reachable. ## Changes (summary.rs only) - **`is_errored()`**: `&&` → `||` for publish/post_commit, and unify on one predicate. `process.errored` (per-message failures) is *deliberately excluded* and documented: a sync that decrypts some messages and fails others is still a successful sync of the group as a whole. - **`Display`**: success branch is now `if !self.is_errored()` — the two can no longer drift. - **`source()`**: returns the first available cause (`other` → first publish → first post_commit → first `process.errored`), so the FFI error chain reaches the real failure. - **6 unit tests**: clean / publish-only / post-commit-only / other / per-message / source-priority. The publish-only test is the explicit regression guard for the `&&` bug. ## Verification - `cargo clippy --locked -p xmtp_mls --all-features --all-targets --no-deps -- -Dwarnings` + `cargo fmt --check` clean - New unit tests pass; sync/welcome/commit_log/process_message integration suites pass (no regression from the `is_errored` behavior change) Pairs with #3701 (which captures the dropped intent-error cause into `process.errored`); this PR makes the type honest, #3701 fills in the cause. Independent-compiling; recommend landing this first. 🤖 Generated with [Claude Code](https://claude.com/claude-code)    > [!NOTE] > ### Fix `SyncSummary` error detection logic and expose error cause via `source()` > - `is_errored()` previously required both `publish_errors` and `post_commit_errors` to be non-empty (logical AND); it now returns `true` when any of `other`, `publish_errors`, or `post_commit_errors` is present. > - `std::error::Error::source()` now returns the first available underlying error, preferring `other`, then `publish_errors`, then `post_commit_errors`, then per-message processing errors; previously it returned `None`. > - `Display` formatting now delegates to `is_errored()` instead of duplicating the error-check logic. > - Behavioral Change: callers relying on `is_errored()` returning `false` when only one of publish or post-commit errors is present will now see `true`. > >  > > <a href="https://app.macroscope.com">Macroscope</a> summarized 9d90b03. >  > >

…st (#3703) ## Problem Loading clients in xdbg was dominated by TLS setup, not DB or MLS work. Phase timing while loading 100 identities showed ~100% of the time inside `network.client_bundle()`, with a tell-tale monotonic +42ms-per-call stair-step — i.e. a global lock serializing ~42ms of work per call, paid **twice per client** (api + sync channel). Root cause: `create_tls_channel` builds a fresh tonic `Endpoint` on every call, and `ClientTlsConfig::with_enabled_roots()` makes tonic call `rustls_native_certs::load_native_certs()` and parse the **entire OS trust store** into a rustls `ClientConfig` each time. On macOS that read is serialized through Security.framework (~40ms). 200 calls × ~40ms ≈ the ~9–19s we were seeing. ## Fix **`xmtp_api_grpc`:** cache the built `Endpoint` per `(host, rate_limit)`. The endpoint owns an `Arc<ClientConfig>` with the already-parsed roots, so `connect_lazy()` on a cheap clone still hands **every client its own connection** — no change to connection topology or trust semantics — while the native-cert load happens **once per host**. The lock is held across the one-time build so concurrent first-callers load certs exactly once. This benefits every platform that creates clients (node/mobile too), not just xdbg. **`xdbg`:** parallelize identity loading (was a sequential `for` loop). Each client open is blocking SQLite/MLS work, so they're handed to Tokio's blocking pool via `tokio::task::spawn_blocking` and awaited together; `load_all_identities` / `load_n_identities` and their callers become `async`. ## Result `cargo xdbg -b dev generate -e message -a 1`, 100 clients: | | load time | |---|---| | before | ~19s | | after | **~185ms** | No new dependencies. No trust-store/behavior change. Each client still gets its own connection. ## Test plan - [x] `cargo fmt --check` - [x] `cargo clippy --locked --all-features --all-targets --no-deps -- -Dwarnings` (full workspace, clean) - [x] manual: 100-client xdbg load ~19s → ~185ms, sends still succeed 🤖 Generated with [Claude Code](https://claude.com/claude-code)    > [!NOTE] > ## Problem > > Loading clients in xdbg was dominated by TLS setup, not DB or MLS work. Phase timing while loading 100 identities showed ~100% of the time inside `network.client_bundle()`, with a tell-tale monotonic +42ms-per-call stair-step — i.e. a global lock serializing ~42ms of work per call, paid **twice per client** (api + sync channel). > > Root cause: `create_tls_channel` builds a fresh tonic `Endpoint` on every call, and `ClientTlsConfig::with_enabled_roots()` makes tonic call `rustls_native_certs::load_native_certs()` and parse the **entire OS trust store** into a rustls `ClientConfig` each time. On macOS that read is serialized through Security.framework (~40ms). 200 calls × ~40ms ≈ the ~9–19s we were seeing. > > ## Fix > > **`xmtp_api_grpc`:** cache the built `Endpoint` per `(host, rate_limit)`. The endpoint owns an `Arc<ClientConfig>` with the already-parsed roots, so `connect_lazy()` on a cheap clone still hands **every client its own connection** — no change to connection topology or trust semantics — while the native-cert load happens **once per host**. The lock is held across the one-time build so concurrent first-callers load certs exactly once. This benefits every platform that creates clients (node/mobile too), not just xdbg. > > **`xdbg`:** parallelize identity loading (was a sequential `for` loop). Each client open is blocking SQLite/MLS work, so they're handed to Tokio's blocking pool via `tokio::task::spawn_blocking` and awaited together; `load_all_identities` / `load_n_identities` and their callers become `async`. > > ## Result > > `cargo xdbg -b dev generate -e message -a 1`, 100 clients: > > | | load time | > |---|---| > | before | ~19s | > | after | **~185ms** | > > No new dependencies. No trust-store/behavior change. Each client still gets its own connection. > > ## Test plan > - [x] `cargo fmt --check` > - [x] `cargo clippy --locked --all-features --all-targets --no-deps -- -Dwarnings` (full workspace, clean) > - [x] manual: 100-client xdbg load ~19s → ~185ms, sends still succeed > > 🤖 Generated with [Claude Code](https://claude.com/claude-code) > #### Changes since #3703 opened > > - Converted `load_clients_parallel`, `load_all_identities`, and `load_n_identities` utilities to async functions [69570ca] > - Updated `GenerateMessages::new` constructor and `GenerateGroups::create_groups` workflow to async [69570ca] > - Updated `Generate::run` and `Sync::run` command methods to await async operations [69570ca] >  >  --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…sable (#3706) ## Summary Makes `xmtp_mls` background-worker scheduling configurable to fix a **thundering-herd / stampede** problem: when many clients (250+) deploy at the same time, their workers tick in lockstep and fire a wall of simultaneous requests. Adds, surfaced through the **Node, WASM, and Mobile** SDKs: - **Configurable intervals** — a global default plus per-worker overrides (nanoseconds, consistent with the existing `ForkRecoveryOpts::worker_interval_ns`). - **Optional jitter** (`worker_interval_jitter_ns`) — a one-time random startup offset (de-syncs clients that booted together) plus per-tick jitter (keeps them from re-converging). Fully optional; `0`/absent = deterministic, byte-for-byte today's behavior. - **Per-worker enable/disable** — a unified `enabled` map. The legacy `device_sync_worker_mode` and `disable_commit_log_worker` toggles fold into the same map (still work unchanged); `disable_workers` remains the global kill-switch. All new config is opt-in: `WorkerConfig::default()` reproduces existing behavior exactly. No DB migration, no protocol change, no change to default cadences. ## Design Mirrors the in-tree `fork_recovery_opts` pattern: `WorkerConfig` is carried on `XmtpMlsLocalContext` and exposed via `XmtpSharedContext::worker_config()`. A new `xmtp_common::time::jittered_interval_stream(base, jitter)` adds the startup offset + per-tick jitter (`jitter == 0` short-circuits to the existing `interval_stream`, pulling no randomness). Each worker resolves its interval through a context helper instead of its hardcoded const, keeping the const as the fallback default. Workers wired: disappearing-messages, key-package-cleaner, pending-self-remove, commit-log, device-sync heartbeat. TaskRunner is event/dynamic-driven and left as-is. For commit-log, the existing fork-recovery `worker_interval_ns` override still wins where set. ## SDK surface (all fields optional) | Concept | Node | WASM | Mobile | |---|---|---|---| | Global interval | `workerConfig.defaultIntervalNs` | `workerConfig.defaultIntervalNs` | `FfiWorkerConfig.default_interval_ns` | | Jitter | `workerConfig.jitterNs` | `workerConfig.jitterNs` | `FfiWorkerConfig.jitter_ns` | | Per-worker interval | `workerConfig.workerIntervalsNs` | `workerConfig.workerIntervalsNs` | `FfiWorkerConfig.worker_intervals_ns` | | Per-worker disable | `workerConfig.disabledWorkers` | `workerConfig.disabledWorkers` | `FfiWorkerConfig.disabled_workers` | New `WorkerKind` / `FfiWorkerKind` enum mirrors the core `WorkerKind`. ## Testing - `xmtp_common`: `jittered_interval_stream` — zero-jitter equals base; jittered period stays within bounds; startup offset bounded. - `xmtp_mls`: `WorkerConfig::interval` precedence (override > default > const), zero-clamp, jitter carry, `worker_enabled` default-on; a disabled worker is not registered. - Existing worker + device-sync suites pass unchanged (defaults preserve behavior). Verified: `just lint-rust`, `just node lint`, `cargo nextest run -p xmtp_common time`, `cargo nextest run -p xmtp_mls worker`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… iOS/Android (#3707) Normalize Node and WASM 1.10.0 release notes to match the structure already used in ios/4.10.0.md and android/4.10.0.md. No content changes, structural and formatting fixes only. Changes: node-bindings/1.10.0.md: Add `## What's Changed` section header before intro text Add `## Breaking Changes` section header `### History sync is now manual` → moved under `## Breaking Changes` `### Upcoming deprecation of preferences.sync()` → moved under `## Breaking Changes` Add `## New Features` section header `### Manual history sync archive management` → moved under `## New Features` `### Archive-based backups` → moved under `## New Features` Add `## Bug Fixes` section header `### Performance improvement` → removed subheading, text moved under `## Bug Fixes` `**THIS IS A BREAKING CHANGE.**` → `**This is a breaking change.**` wasm-bindings/1.10.0.md: (identical structural changes as node-bindings/1.10.0.md) Test plan: Documentation-only change; no code paths affected.    > [!NOTE] > ### Normalize Node and WASM 1.10.0 release notes structure to match iOS/Android > Restructures the 1.10.0 release notes for [Node](https://github.com/xmtp/libxmtp/pull/3707/files#diff-883d74d020e02801e31065330f11fdbd52fd8892ed0e14c97ba4616d470e07df) and [WASM](https://github.com/xmtp/libxmtp/pull/3707/files#diff-1dd324ff90565158b66b4d4de5d306179d8970d17b88a2537449749d67fb51de) bindings to match the section layout used by the iOS and Android notes. Adds explicit `What's Changed`, `Breaking Changes`, `New Features`, and `Bug Fixes` headers, moves content into the appropriate sections, deduplicates the `preferences.sync()` deprecation notice, and normalizes minor formatting. > >  > > <a href="https://app.macroscope.com">Macroscope</a> summarized 83ce8bf. >  >

…3709) ## Summary Follow-up to #3706. `WorkerConfig` shipped with a single **global** `jitter_ns` that applied to *every* background worker. Consumers who only want to jitter one worker — e.g. scheduling the daily CommitLog fork-recovery worker with 6h jitter to avoid a fleet stampede — end up jittering the fast workers too. A 1s-interval worker (DisappearingMessages) inheriting up to 6h of jitter is effectively broken. This makes jitter **per-worker**, mirroring how interval already works via `interval_overrides`. ## Changes - **Core** (`crates/xmtp_mls/src/worker.rs`): remove `WorkerConfig.jitter_ns: Option<u64>`, add `jitter_overrides: HashMap<WorkerKind, u64>`. `interval()` resolves jitter per-kind; absent entry => `0` (deterministic). No global default jitter — that's the foot-gun being removed. - **Bindings** (node / wasm / mobile): drop the scalar `jitterNs` from `WorkerConfigOptions`, add a `workerJittersNs` array of `{ kind, jitterNs }` mirroring `workerIntervalsNs`. Generated TS surface: ```ts workerConfig.workerJittersNs?: Array<{ kind: WorkerKind; jitterNs: bigint }> ``` ## Breaking change Breaks the nightly-only `WorkerConfigOptions.jitterNs` field. No stable consumers; nightly breakage is expected. Downstream consumers set per-worker jitter instead, e.g.: ```ts workerJittersNs: [{ kind: "CommitLog", jitterNs: 21600n * 1_000_000_000n }] ``` ## Notes - `create_client` FFI signatures unchanged (still one `worker_config` param) — only the inside of `WorkerConfigOptions` changes, so no SDK-wrapper or caller churn. - `WorkerConfig::default()` behavior unchanged (empty maps, deterministic). No DB/protocol change. ## Testing - Core: per-kind jitter carried; jitter scoped per worker (jittered worker gets it, un-listed worker stays `0`). - `just lint-rust` ✓ · `just node lint` ✓ · all three bindings compile (`bindings_node`, `bindings_wasm` wasm32, `xmtpv3`) ✓ · `xmtp_mls worker_config` tests ✓ 🤖 Generated with [Claude Code](https://claude.com/claude-code)    > [!NOTE] > ### Replace global `jitter_ns` with per-worker `jitter_overrides` in `WorkerConfig` > - Removes the global `jitter_ns: Option<u64>` field from `WorkerConfig` and replaces it with `jitter_overrides: HashMap<WorkerKind, u64>`, so jitter is scoped per worker kind with a default of zero. > - Updates `WorkerConfig.interval()` to resolve jitter from the per-kind map instead of a single global value. > - Mirrors the change across all three binding layers: [worker_config.rs](https://github.com/xmtp/libxmtp/pull/3709/files#diff-3b21fab5bc1e30d4eb92b2d0e14461f1c2fd991164b484046e665363bb3fd48d) (mobile/FFI), [options.rs](https://github.com/xmtp/libxmtp/pull/3709/files#diff-5306d98c5fe6413b641310877f4657e782cd87bc1e7497e28c0478154fb062d7) (Node), and [client.rs](https://github.com/xmtp/libxmtp/pull/3709/files#diff-6fa64ca0f35092cf055d31985b6e071f8dda1fcac5f9f3216f813422411af7c8) (WASM), each replacing a single `jitter_ns` field with a `worker_jitters_ns` array of per-worker override objects. > - Behavioral Change: workers with no explicit jitter entry now receive `Duration::ZERO` instead of any previously configured global jitter; callers must migrate to per-worker overrides. > >  > > <a href="https://app.macroscope.com">Macroscope</a> summarized 187d245. >  >   --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

## Summary Adds a `worker_turn` tracing span to each background worker's per-turn unit of work, so telemetry/otel can report **per-worker turn latency and counts, sliced by worker kind**. **Why:** each worker's `run_tasks()`/`run()` is an internal infinite loop (`while intervals.next().await { ... }`), so the existing `worker_run` span is lifetime-long and idle-inclusive — useless for per-turn metrics. The fix attaches `#[tracing::instrument]` to each worker's actual per-turn method (the loop body), not the looping function. ## What changed A `worker_turn` span on all 6 workers' turn methods, all carrying the byte-identical `operation = "worker_turn"` (the spanmetrics aggregation key) and a `worker` dimension: | Worker | turn span site | |---|---| | CommitLog | `tick()` | | DisappearingMessages | `delete_expired_messages()` | | PendingSelfRemove | `remove_pending_remove_users()` | | KeyPackageCleaner | extracted `tick()` | | TaskRunner | `run_and_reschedule_task()` | | DeviceSync | extracted `handle_event()` | - `worker = ?self.kind()` (Debug) on the 5 method sites; `TaskRunner` uses the literal `"TaskRunner"` (its turn method is an associated fn with no `self`). Both formats equal the existing `worker_run` span's value, so the two span series correlate cleanly. - `skip_all` everywhere (workers hold a `Context` with DB/API handles). - Two behavior-preserving extractions to host the macro: `KeyPackageCleaner::tick()` and `DeviceSync::handle_event()`. The DeviceSync 20s `Tick` heartbeat is deliberately bypassed before dispatch so idle heartbeats open no span (and emit no log). ## Deliberately out of scope - Existing `worker_run` / `libxmtp_worker` spans left untouched (don't break existing dashboards). - Deep spans in `mls_sync.rs` (`sync`, `post_commit`, `publish_intents`, ...) not touched — they're shared by foreground user calls too, so there's no static worker to stamp. otel traces already carry the worker→deep-span lineage; deep-span dashboard slicing is a telemetry-repo concern. Design + plan included under `docs/superpowers/`. ## Test Plan - [x] `just check crate xmtp_mls` — compiles clean - [x] `just lint-rust` — clippy / fmt / hakari clean - [x] `just test v3 worker` — 29/29 worker tests pass (incl. all device_sync + disconnect-propagation) - [x] Invariant check: exactly 6 `operation = "worker_turn"` sites, byte-identical 🤖 Generated with [Claude Code](https://claude.com/claude-code)    > [!NOTE] > ### Add per-worker-turn tracing spans across MLS background workers > Adds `#[tracing::instrument]` spans with `worker` and `operation="worker_turn"` fields to each background worker's main processing method, enabling per-turn telemetry in distributed traces. > > - [`commit_log.rs`](https://github.com/xmtp/libxmtp/pull/3711/files#diff-523943065ed63a511a3a552e44cb0ba0a05fecab9163088d5588cb5c949e6a5d), [`disappearing_messages.rs`](https://github.com/xmtp/libxmtp/pull/3711/files#diff-93ae9a4a70ee163ee3ce898babe5119780fa34ca9d91b823ee0135fda32f1160), [`pending_self_remove.rs`](https://github.com/xmtp/libxmtp/pull/3711/files#diff-2c915110e23b0ac1caa8ae0026e6cf76cff97c528ff81f91eef1a26a5d421f81), and [`tasks.rs`](https://github.com/xmtp/libxmtp/pull/3711/files#diff-b3d01dc512a077cf4b625f516f8c36ea61978f070bcce720ed8d522065a268b6) each get a span on their primary tick/run method with no logic changes. > - [`key_package_cleaner.rs`](https://github.com/xmtp/libxmtp/pull/3711/files#diff-f474b9d0bca387a8ef826d7b34e811f8644323a8ac5584efc6eef034cfaeeb8d) extracts a new `tick` method to serve as the span boundary. > - [`device_sync/worker.rs`](https://github.com/xmtp/libxmtp/pull/3711/files#diff-3d1d107a7d85b7aaf674a2365c789f51bda6243c3a37caab478baa8672ec737e) introduces a `handle_event` method that wraps non-`Tick` events in a span with an additional `event` field; `Tick` events bypass the span to avoid noise. > >  > > <a href="https://app.macroscope.com">Macroscope</a> summarized 7d32c55. >  >

…t gating (phase 1) (#3710) ## Summary Phase 1 of an automated release flow for libxmtp: a **git-cliff–driven** always-open Release PR with auto-generated changelog, a **GitHub Release "hub"** that fans out to per-SDK nightly releases with track-aware versions, and a **cross-test gate** that blocks nightlies unless cross-talk + cross-version are green for the exact release SHA. ### Why git-cliff (not release-plz) This started as "use release-plz." Investigation found release-plz's Release-PR **cannot** generate a changelog without also bumping `Cargo.toml`'s version — and our workspace version is the **runtime MLS protocol identity** (`PROPOSALS_MIN_PROTOCOL_VERSION <= CARGO_PKG_VERSION` is CI-enforced; flipping it backward bricks `enable_proposals`). So the version must never be auto-bumped. **git-cliff** (the engine release-plz uses internally) gives the same UX — always-open changelog PR + a print-only version oracle — while never touching any manifest. No crates.io publish. ### What's in this PR - **git-cliff seed:** `cliff.toml` (changelog config, anchored to stable `v*` tags) + `.github/workflows/release-pr.yml` (always-open changelog-only Release PR) + git-cliff in the Nix devshell. - **SDK registry extension** (`dev/release-tools`): `versionTrack` (`follows-libxmtp` vs `independent`), notes globs, `releaseWorkflow`, `channels` per SDK — the extensibility spine (xmtp-js later = 2 entries, no engine change). - **Version-track engine:** `resolve-sdk-version` (follows-libxmtp takes the 1.x number verbatim; independent mirrors the bump-kind onto its own base, e.g. iOS 4.10 → 4.11), preview-next prerelease versioning (semver-correct nightlies), git-cliff oracle parsing (`pending-version`). - **Cross-test gate:** `cross-test-gate` (fail-closed, SHA-exact) + nightly wiring in `release.yml` (gate → pending → fan-out → hub), with an idempotent hub GitHub Release carrying an SDK-version rollup. ### Not in this PR (future phases) - Stable release-branch flow, backports, RC (phase 2). - xmtp-js (node-sdk 6.x, browser-sdk 7.x) onboarding (phase 3). - Wiring per-SDK nightly *filtered* notes onto the hub (thin follow-up; globs are already data). The runtime/manifest version (`1.11.0-dev`) is intentionally **untouched** — verified across the whole diff. ## Test Plan - [x] `dev/release-tools`: `yarn typecheck` clean, `yarn test` 158 passed - [x] All 5 workflow YAML files validate - [x] git-cliff `--bumped-version` oracle verified print-only against the real repo (resolves to last stable tag, writes nothing) - [x] CLI smoke tests: `pending-version`, `resolve-sdk-version` (ios → 4.11.0), `list-sdks`, `cross-test-gate` (pass + fail cases, exit 0 on skip) - [ ] **Manual (post-merge / dispatch on a test branch):** trigger the Release workflow nightly path and confirm gate-skip behavior, pending version/kind, per-SDK preview-next versions, and the hub release with rollup. (CI workflows can't be fully unit-tested.) 🤖 Generated with [Claude Code](https://claude.com/claude-code)    > [!NOTE] > ### Add git-cliff nightly release orchestration with cross-test gating and auto-generated changelogs > - Adds a gate job in [release.yml](.github/workflows/release.yml) that blocks nightly releases unless cross-test CI passes for the exact commit SHA, using the new `xmtp-release cross-test-gate` command. > - Adds a pending job that uses `git-cliff --bump --context` to detect unreleased conventional commits; skips the nightly entirely if nothing is pending. > - Nightly SDK release workflows ([iOS](.github/workflows/release-ios.yml), [Android](.github/workflows/release-android.yml), [Node](.github/workflows/release-node.yml), [WASM](.github/workflows/release-wasm.yml)) now require `pending-version` and `pending-kind` inputs and resolve versions via the new `xmtp-release resolve-sdk-version` command. > - Adds a hub job that creates or updates a prerelease GitHub Release summarizing all SDK versions shipped in a nightly run, skipped if no SDK shipped. > - Adds [release-pr.yml](.github/workflows/release-pr.yml) to maintain a persistent open PR that updates `CHANGELOG.md` on every push to main. > - Adds new release-tools CLI commands (`pending-version`, `resolve-sdk-version`, `list-sdks`, `cross-test-gate`) and supporting library modules for gate evaluation, git-cliff context parsing, and track-aware SDK version resolution. > - Risk: nightly releases now fail closed — missing cross-test runs or absent pending commits will skip or abort the release rather than proceeding. > >  > > <a href="https://app.macroscope.com">Macroscope</a> summarized 988e379. >  >   --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

## Summary An already-processed welcome (duplicate delivery, or a resume past our cursor for a group we're already in) is an **expected, gracefully-handled** condition — the error is returned, the cursor is handled, the welcome is skipped. But it was logged at `tracing::error!`. The otel tracing layer (`tracing_opentelemetry`, default config) sets a span's status to `Error` for any `error!` event fired while that span is active — it's event-driven, not return-value-driven. So this benign condition marked the enclosing worker/sync span `status:error` and **inflated the telemetry error rate** even though nothing actually failed. ## Fix Downgrade both emitters of this condition from `error!` to `warn!`: - **`groups/welcomes/xmtp_welcome.rs`** — the upstream `"Skipping already processed welcome"` log inside `.process()`. This is the one that actually fires inside the worker span and poisons its status; it's the dominant source for the duplicate-delivery case. - **`groups/welcome_sync.rs`** — the downstream catch-all `Err(err)` arm, which would otherwise log the returned `WelcomeAlreadyProcessed` at `error!` for the already-in-group path. This mirrors the existing storage-duplicate handling a few lines up, which already logs at `warn!`. **Behavior is otherwise unchanged:** the error is still returned, cursor handling is untouched. This is a telemetry/log-level-only change. ## Test Plan - [x] `just check crate xmtp_mls` — compiles clean - [x] `just lint-rust` — clippy / fmt / hakari clean - [x] `just test v3 welcome` — 60/60 welcome tests pass - [x] Confirmed no `error!` remains for this condition (both sites now `warn!`) 🤖 Generated with [Claude Code](https://claude.com/claude-code)    > [!NOTE] > ### Log already-processed welcomes at warn instead of error in `process_new_welcome` > Encountering an already-processed welcome is an expected, non-retryable condition, so logging it at `error` level produced noise. Both [`welcome_sync.rs`](https://github.com/xmtp/libxmtp/pull/3714/files#diff-5ad5d6ca4506c4960727c32b22b9168a128f14af4d61497cd168ba5c80fb43f5) and [`xmtp_welcome.rs`](https://github.com/xmtp/libxmtp/pull/3714/files#diff-7ad79af03cc69f17888ed323b42554f52c5c706674e76a981c4fd4663464bb4e) now use `warn` for this case. Error return semantics are unchanged. > >  > > <a href="https://app.macroscope.com">Macroscope</a> summarized 43d7550. >  >

…3713) Two changelog improvements, both stemming from the scruffy first generated Release PR (#3712). ## 1. Tidy the generated changelog (`cliff.toml`) **Body-dump scruff** — the template rendered `commit.message` (full message = subject + body). GitHub squash-merges fill the commit body with the PR description, so each merge dumped its whole PR body into the changelog: Macroscope HTML-comment summary blocks, `Co-authored-by:` trailers, even a pasted Nix build-error log (~316 of #3712's 461 lines). ```diff -{{ commit.message | upper_first }} +{{ commit.message | split(pat="\n") | first | trim | upper_first }} ``` **Fake per-scope groups** — non-conventional commits (e.g. `xmtp_proto: …`, `bindings/mobile: …`) made git-cliff invent junk section headers like `Xmtp_proto` / `Bindings/mobile`. A catch-all collects them into one `Other` group: ```diff + { message = ".*", group = "Other" }, ``` Result (regenerated from `main`): group headers are exactly **Features · Bug Fixes · Performance · Documentation · Other**, every entry a clean subject line, zero scruff. ## 2. Per-nightly changelog in the hub release (`release.yml`) Each nightly hub GitHub Release now appends a **"Changes since the last nightly"** section — the git-cliff delta over `<previous-nightly-tag>..HEAD` (`--strip header`), computed before this run is tagged so the newest existing `v*-nightly.*` tag is genuinely the prior one. First-ever nightly (no prior nightly tag) falls back to since-last-stable. Empty output is dropped. Adds git-cliff to the hub job. > Uses an explicit `<tag>..HEAD` range rather than the config's stable-anchored `tag_pattern`, so it diffs against the last *nightly*, not the last stable. Relies on the §1 fixes for clean output. ## Verification - Regenerated changelog from `main`: clean, 5 well-formed groups, no scruff. - Simulated the per-nightly delta (`<tag>..HEAD --strip header`): clean grouped output, 22 bullets, no header. - All workflow YAML validated. Once merged, the open Release PR (#3712) regenerates clean on the next push to main, and the next real nightly carries a proper since-last-nightly changelog. 🤖 Generated with [Claude Code](https://claude.com/claude-code)    > [!NOTE] > ### Tidy generated changelog to show subject-only commits and group non-conventional commits into 'Other' > - Updates [cliff.toml](https://github.com/xmtp/libxmtp/pull/3713/files#diff-e1372c8b03c40942b5d828a90975054cb8aaed3b38189f434396f922ec41a584) to trim commit messages to their first line only and capitalise the first character, removing multi-line noise from changelogs. > - Adds a catch-all parser rule in [cliff.toml](https://github.com/xmtp/libxmtp/pull/3713/files#diff-e1372c8b03c40942b5d828a90975054cb8aaed3b38189f434396f922ec41a584) that places non-conventional commits into an 'Other' group instead of creating per-scope groups. > - Updates [release.yml](https://github.com/xmtp/libxmtp/pull/3713/files#diff-87db21a973eed4fef5f32b267aa60fcee5cbdf03c67fafdc2a9b553bb0b15f34) to install `git-cliff`, compute a delta changelog from the previous nightly tag to `HEAD`, and append a > >  > > <a href="https://app.macroscope.com">Macroscope</a> summarized eecd171. >  >   Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…iber + otel) + node migration (#3708) ## What Introduces a new `crates/xmtp_logging` crate as the **sole owner** of the `tracing-subscriber` and `opentelemetry` dependencies, and migrates the **node** binding onto it. All production logging/tracing pipeline construction, the OTLP trace exporter, and the runtime-control handle now live in one place. ## Why Logging/telemetry setup was scattered across `xmtp_common` (which sat near the bottom of the dep graph, so every crate that touched it pulled in `tracing-subscriber` + the full OpenTelemetry stack) and re-implemented per binding. Consolidating into a dedicated crate gives us: - one builder + one `LoggingConfig` record at the FFI boundary - a `LoggingHandle` for runtime control (set level, enable/disable file logging, enable telemetry, flush) without re-installing the subscriber - `xmtp_common` no longer carries `opentelemetry*` / `tracing-subscriber` as production deps (test/bench-only now) ## What's in the crate - `config.rs` — canonical `Level` / `Rotation` / `ProcessType` enums and `FileConfig` / `TelemetryConfig` / `LoggingConfig` records. - `filter.rs` — `filter_directive` (moved from `xmtp_common`). - `telemetry.rs` — OTLP span exporter + `TelemetryGuard` (native-only, `cfg(not(wasm32))`; always compiled, enabled at runtime when an endpoint is set — no Rust feature flag). - `layers/` — stdout/json fmt, rolling file, per-OS native (logcat/os_log/ server-compact), and browser console/perf layers. - `builder.rs` — `XmtpLoggingBuilder::from_config(cfg).install()` installs the global subscriber. - `handle.rs` — `LoggingHandle` with reloadable level / file / telemetry slots. - test subscriber (the obsolete `tracing-forest` tree logger and `TestLogReplace` / `REPLACE_IDS` are removed — OTEL spans replace them). ## Consumer changes - **node** — fully migrated: `init_logging` builds a `LoggingConfig` from `LogOptions` and installs once; `flush_telemetry` delegates to the handle. Drops the direct `tracing-subscriber` dep. - **mobile / wasm / xmtp_debug** — repointed `filter_directive` to `xmtp_logging` and dropped the `xmtp_common` `logging` feature. These keep their own subscribers for now; fully migrating them to the builder is a follow-up. ## Verification Native + wasm build, clippy, rustfmt, taplo, hakari (regenerated — drops `tracing-log`, the removed transitive), and unit tests (`xmtp_logging` + `xmtp_common`) all pass locally. ## Follow-ups (separate PRs) - Migrate wasm + mobile bindings onto the builder/handle (not just `filter_directive`). - Migrate apps. 🤖 Generated with [Claude Code](https://claude.com/claude-code)    > [!NOTE] > ### Extract logging and telemetry into a dedicated `xmtp_logging` crate as sole owner of tracing-subscriber and OpenTelemetry > - Creates a new `xmtp_logging` crate with a fluent builder API (`XmtpLoggingBuilder`) that installs a global tracing subscriber with reloadable log level, optional rolling-file output, and optional OTLP telemetry export. > - Migrates all logging/telemetry code out of `xmtp_common`, removing the `logging` feature flag from that crate; all consumers now depend on `xmtp_logging` directly. > - Updates the Node binding's `init_logging` to use `xmtp_logging` via a process-global `LOGGING_HANDLE`, and changes `flush_telemetry` to call `handle.flush()` instead of shutting down the exporter. > - Removes `TestLogReplace` and per-instance log-ID replacement from `xmtp_common::test` and `Tester`; test harnesses use `xmtp_logging::test_logging` instead. > - Risk: `TestLogReplace` log-ID substitution is permanently removed from test output; tests that relied on named replacements in log lines will no longer see them. > >  > #### Changes since #3708 opened > > - Replaced `once_cell` dependency with standard library synchronization primitives [6d0ab95] > - Standardized workspace dependency notation across Cargo.toml files [6d0ab95] > - Updated Cargo.lock [6d0ab95] >  > >  > > <a href="https://app.macroscope.com">Macroscope</a> summarized b45b18f. >  > >

…ry-run gate (#3717) ## Why On 2026-06-04 the nightly pipeline published a bogus **libxmtp `2.0.0`** and **iOS/Android `5.0.0`**, and the Android `5.0.0-nightly` is now **permanently on Maven Central** (immutable). Root causes: 1. **Non-deterministic git-cliff.** The oracle installed git-cliff via `cargo binstall git-cliff || cargo install git-cliff` — floating latest. At the same commit, CI computed a **major** bump while local computed minor. 2. **Nightly auto-majored.** A `BREAKING CHANGE:` footer drove a major bump; a nightly should never auto-major. 3. **No dry-run / preview gate.** Nothing sat between "compute version" and the irreversible publishes. ## What (three layers of protection) - **COMPUTE** — new pure `capBumpKind` in `xmtp-release` clamps a nightly `major` → `minor` (recomputing the version from the last stable tag). Exposed via `pending-version --max-bump minor`. Unit-tested. - **ASSERT** — the plan job fails if the bump is still `major` after the cap (defense-in-depth; every release-*/hub job `needs:` the plan, so a failure blocks all publishes). Plus a human-readable release plan printed to the step summary. - **PREVIEW** — git-cliff is pinned to **2.13.1**; a `dry-run` dispatch input runs the full plan but skips every publish/tag; and `gate`+`pending` are extracted into a reusable `release-gate-plan.yml` so a new **`nightly-dry-run.yml`** can preview the real nightly version logic on demand (no publish jobs exist in it — structurally cannot publish). ## Verification (against the exact v5 failure mode) Reproduced locally with the pinned git-cliff 2.13.1 against real upstream `main`: | case | result | | --- | --- | | real current main, capped | `1.11.0` / minor ✓ | | v5 mode (git-cliff says major), **no cap** | `2.0.0` / major — reproduces the bug | | v5 mode **with `--max-bump minor`** | **`1.11.0` / minor** — clamped ✓ | | assertion if a major slipped through | fires `::error:: refusing to publish`, exit 1 ✓ | 167 unit tests pass; typecheck clean; all 3 workflow YAMLs valid. The refactor was verified to NOT break non-nightly (dev/rc/final) releases (a skipped `plan` dependency doesn't block them — `!cancelled()` + `release-type != 'nightly'` short-circuit). > **Post-merge step:** `workflow_dispatch` requires the workflow on the default branch, so the `nightly-dry-run.yml` GH-Actions wiring gets its full end-to-end check by dispatching it on `main` after merge (safe — no publish jobs) before the next cron nightly. ## Out of scope - Recovering the already-published `5.0.0-nightly` (impossible — Maven Central is immutable) and the resulting Android-v5 / iOS-sync decision (team call). - The larger release-process simplification (unify version lines / libsignal model / SPM-only) — tracked separately, needs team buy-in. 🤖 Generated with [Claude Code](https://claude.com/claude-code)    > [!NOTE] > ### Make nightly release version deterministic, cap bump at minor, and add dry-run gate > - Extracts release gating and version computation into a reusable [release-gate-plan.yml](.github/workflows/release-gate-plan.yml) workflow that resolves the exact SHA, checks cross-test status, and computes a pending version capped at minor bump. > - Adds `--max-bump` to the `pending-version` CLI command via a new `capBumpKind` utility in [sdk-version.ts](dev/release-tools/src/lib/sdk-version.ts) that clamps the bump kind and recomputes the version from the last-shipped baseline. > - Replaces separate `gate` and `pending` jobs in [release.yml](.github/workflows/release.yml) with a single `plan` job; all release jobs now read `pending-version` and `pending-kind` from plan outputs and pin git-cliff to 2.13.1. > - Adds a new manual-only [nightly-dry-run.yml](.github/workflows/nightly-dry-run.yml) workflow and an optional `dry-run` input to `release.yml` that prints the computed plan without publishing, tagging, or merging. > - Risk: the workflow now fails hard if a major bump slips through after capping, which will block nightly releases until the bump is addressed. > >  > > <a href="https://app.macroscope.com">Macroscope</a> summarized c775131. >  >   --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

) ## What The native gRPC HTTP/2 keepalive parameters are now overridable via environment variables, read once per process in `apply_channel_options`. **Defaults are unchanged** (fast detection). ## Why herald-lite #70 (`UNAVAILABLE` / status 14 disconnects on herald dev) was root-caused to libxmtp's own client-side h2 keepalive: ping every 16s, demand a PONG within 10s, even idle. On a cross-cloud Fly→AWS path under load the PONG starves past the 10s deadline, hyper drops the connection, tonic maps it to `UNAVAILABLE`, and the retry layer masks it. A **global** relax is wrong: mobile wants *fast* dead-connection detection, herald wants *forgiving*. (A global 30/30 broke `client::tests::should_reconnect`, which deliberately relies on ~26s keepalive detection — its comment cites "16s interval + 10s timeout".) So the defaults stay fast and a high-latency deployment opts out via env. ## Env vars (all optional; unset = default) | env var | overrides | default | |---|---|---| | `XMTP_GRPC_KEEPALIVE_INTERVAL_SECS` | `http2_keep_alive_interval` | 16 | | `XMTP_GRPC_KEEPALIVE_TIMEOUT_SECS` | `keep_alive_timeout` | 10 | | `XMTP_GRPC_TCP_KEEPALIVE_SECS` | `tcp_keepalive` (0 disables) | 16 | | `XMTP_GRPC_KEEPALIVE_WHILE_IDLE` | `keep_alive_while_idle` | true | Read once via a `LazyLock` (servers set them before start), so every connection after the first is a struct copy, not an env read. Native-only (wasm fetch has no h2 knobs). herald rides `@xmtp/node-sdk`, so env vars need **no** binding/SDK change — it just sets them in its Fly deploy.

macroscopeapp · 2026-06-04T19:53:27Z

+        &self,
+        policy: xmtp_proto::xmtp::mls::message_contents::ExternalCommitPolicyV1,
+    ) -> Result<(), GroupError> {
+        self.ensure_not_paused().await?;


🟡 Medium groups/mod.rs:2139

set_external_commit_policy and set_allow_external_commit do not check ConversationType::Dm before queuing the intent, unlike every other metadata-mutating API in this file (update_group_name, update_app_data, update_group_description, update_group_image_url_square, update_permission_policy, update_admin_list). A migrated DM can therefore have EXTERNAL_COMMIT_POLICY persisted, enabling QR/external-commit state on a conversation type the codebase explicitly forbids elsewhere (e.g., create_external_invite documents DM prohibition). Consider adding the standard DM guard to reject the operation early.

Also found in 1 other location(s)

crates/xmtp_mls/src/groups/external_invite.rs:107

create_external_invite never checks the group's conversation_type before exporting a QR invite. The PR explicitly requires DM groups to be rejected, but this implementation only checks pause state and policy presence. If a DM group carries an EXTERNAL_COMMIT_POLICY entry (for example from malformed or legacy state), this method will still mint an external invite for that DM, violating the intended restriction.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file crates/xmtp_mls/src/groups/mod.rs around line 2139: `set_external_commit_policy` and `set_allow_external_commit` do not check `ConversationType::Dm` before queuing the intent, unlike every other metadata-mutating API in this file (`update_group_name`, `update_app_data`, `update_group_description`, `update_group_image_url_square`, `update_permission_policy`, `update_admin_list`). A migrated DM can therefore have `EXTERNAL_COMMIT_POLICY` persisted, enabling QR/external-commit state on a conversation type the codebase explicitly forbids elsewhere (e.g., `create_external_invite` documents DM prohibition). Consider adding the standard DM guard to reject the operation early. Also found in 1 other location(s): - crates/xmtp_mls/src/groups/external_invite.rs:107 -- `create_external_invite` never checks the group's `conversation_type` before exporting a QR invite. The PR explicitly requires DM groups to be rejected, but this implementation only checks pause state and policy presence. If a DM group carries an `EXTERNAL_COMMIT_POLICY` entry (for example from malformed or legacy state), this method will still mint an external invite for that DM, violating the intended restriction.

macroscopeapp · 2026-06-04T19:53:27Z

+    for mutation in &delta.mutations {
+        let key = match mutation {
+            TlsMapMutation::Insert { key, .. }
+            | TlsMapMutation::Update { key, .. }
+            | TlsMapMutation::Delete { key } => key,
+        };
+        if key != &joiner_id {
+            return Err(CommitValidationError::ExternalCommitAppDataUpdateOutOfScope);
+        }
+    }


🟡 Medium groups/validated_commit.rs:2775

enforce_app_data_update_scope accepts TlsMapMutation::Update and TlsMapMutation::Delete mutations for the joiner's inbox id, so an external commit can pass validation while deleting or mutating the joiner's GROUP_MEMBERSHIP entry instead of inserting it. This violates the function's documented contract that "the joiner is authorized to insert their own entry — but ONLY their own" and leaves the tree membership and AppData membership out of sync. Consider rejecting Update and Delete mutations to ensure the joiner can only insert their own entry.

- for mutation in &delta.mutations { - let key = match mutation { - TlsMapMutation::Insert { key, .. } - | TlsMapMutation::Update { key, .. } - | TlsMapMutation::Delete { key } => key, - }; - if key != &joiner_id { - return Err(CommitValidationError::ExternalCommitAppDataUpdateOutOfScope); - } + for mutation in &delta.mutations { + let key = match mutation { + TlsMapMutation::Insert { key, .. } => key, + _ => { + return Err(CommitValidationError::ExternalCommitAppDataUpdateOutOfScope); + } + }; + if key != &joiner_id { + return Err(CommitValidationError::ExternalCommitAppDataUpdateOutOfScope); + }

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file crates/xmtp_mls/src/groups/validated_commit.rs around lines 2775-2784: `enforce_app_data_update_scope` accepts `TlsMapMutation::Update` and `TlsMapMutation::Delete` mutations for the joiner's inbox id, so an external commit can pass validation while deleting or mutating the joiner's `GROUP_MEMBERSHIP` entry instead of inserting it. This violates the function's documented contract that "the joiner is authorized to insert their own entry — but ONLY their own" and leaves the tree membership and AppData membership out of sync. Consider rejecting `Update` and `Delete` mutations to ensure the joiner can only insert their own entry.

macroscopeapp · 2026-06-04T19:53:27Z

+    /// Sugar wrapper over [`MlsGroup::set_external_commit_policy`] that
+    /// only flips the master switch and leaves the time-window controls
+    /// at their defaults (no automatic expiry / no staleness bound).
+    pub async fn set_allow_external_commit(&self, allowed: bool) -> Result<(), GroupError> {


🟡 Medium groups/mod.rs:2171

set_allow_external_commit(true) unconditionally generates new symmetric_key and external_group_id values via generate_symmetric_key() and generate_external_group_id(). When external commits are already enabled, this call rotates the invite material and invalidates previously issued QR codes, deep links, and uploaded blobs indexed by the old external_group_id. Consider reading the existing policy first and re-using its key/id when allow_external_commit is already true to preserve stability across calls.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file crates/xmtp_mls/src/groups/mod.rs around line 2171: `set_allow_external_commit(true)` unconditionally generates new `symmetric_key` and `external_group_id` values via `generate_symmetric_key()` and `generate_external_group_id()`. When external commits are already enabled, this call rotates the invite material and invalidates previously issued QR codes, deep links, and uploaded blobs indexed by the old `external_group_id`. Consider reading the existing policy first and re-using its key/id when `allow_external_commit` is already `true` to preserve stability across calls.

macroscopeapp · 2026-06-04T19:53:27Z

+/// explicit nonce management is required (e.g. tests). Reusing a `(key,
+/// nonce)` pair across distinct ciphertexts breaks ChaCha20Poly1305's security
+/// argument.
+pub fn wrap_group_info_with_nonce(


🟠 High invite/encrypted_group_info.rs:92

wrap_group_info_with_nonce encrypts plaintext with AEAD but writes epoch, group_state_hash, and expires_at_ns into EncryptedGroupInfoBlobV1 in cleartext without including them as AEAD associated data. An attacker can tamper with expires_at_ns (e.g., changing an expired timestamp to 0 or a far-future value) and unwrap_group_info will accept the modified blob because the AEAD tag only authenticates ciphertext. This allows expired QR invites to remain valid and breaks integrity of the envelope metadata. Consider including these fields as AEAD associated data so tampering is detected during decryption.

Also found in 1 other location(s)

crates/xmtp_mls_common/src/mls_ext/payload_encryption.rs:139

unwrap_payload_hpke silently accepts a missing secondary ciphertext at line 139 by returning vec![] whenever wrapped_secondary_payload is empty. Because the presence of the second ciphertext is not authenticated anywhere in wrapped_payload, an attacker can strip wrapped_secondary_payload in transit and the receiver will still accept the first plaintext as valid while seeing an empty secondary payload instead of a tamper error. Any protocol field carried in the secondary blob loses integrity protection for its presence.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file crates/xmtp_mls_common/src/invite/encrypted_group_info.rs around line 92: `wrap_group_info_with_nonce` encrypts `plaintext` with AEAD but writes `epoch`, `group_state_hash`, and `expires_at_ns` into `EncryptedGroupInfoBlobV1` in cleartext without including them as AEAD associated data. An attacker can tamper with `expires_at_ns` (e.g., changing an expired timestamp to `0` or a far-future value) and `unwrap_group_info` will accept the modified blob because the AEAD tag only authenticates `ciphertext`. This allows expired QR invites to remain valid and breaks integrity of the envelope metadata. Consider including these fields as AEAD associated data so tampering is detected during decryption. Also found in 1 other location(s): - crates/xmtp_mls_common/src/mls_ext/payload_encryption.rs:139 -- `unwrap_payload_hpke` silently accepts a missing secondary ciphertext at line `139` by returning `vec![]` whenever `wrapped_secondary_payload` is empty. Because the presence of the second ciphertext is not authenticated anywhere in `wrapped_payload`, an attacker can strip `wrapped_secondary_payload` in transit and the receiver will still accept the first plaintext as valid while seeing an empty secondary payload instead of a tamper error. Any protocol field carried in the secondary blob loses integrity protection for its presence.

macroscopeapp · 2026-06-04T19:53:27Z

+        let joiner_leaf = staged_commit
+            .update_path_leaf_node()
+            .ok_or(CommitValidationError::ExternalCommitMissingPathLeaf)?;
+        let joiner_inbox_id = inbox_id_from_credential(joiner_leaf.credential())?;


🔴 Critical groups/validated_commit.rs:986

from_external_commit extracts the joiner's inbox id from the credential (line 986) and checks that every Add proposal's credential string matches it (line 1000), but it never verifies that the joiner's installation key (or any added installation key) is actually registered in the association state for that inbox id. In contrast, from_staged_commit performs this check (lines 760-778) by looking up each participant's installation_id in the AssociationState at the relevant sequence_id. Without this lookup, a crafted external commit can claim any inbox id and join (or add installations) with signature keys that have no legitimate association to that inbox, bypassing XMTP's installation↔inbox binding. Consider adding the same association-state verification that from_staged_commit performs — this will require threading in a context (or equivalent identity-lookup capability) and making the function async.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file crates/xmtp_mls/src/groups/validated_commit.rs around line 986: `from_external_commit` extracts the joiner's inbox id from the credential (line 986) and checks that every `Add` proposal's credential string matches it (line 1000), but it never verifies that the joiner's installation key (or any added installation key) is actually registered in the association state for that inbox id. In contrast, `from_staged_commit` performs this check (lines 760-778) by looking up each participant's `installation_id` in the `AssociationState` at the relevant `sequence_id`. Without this lookup, a crafted external commit can claim any inbox id and join (or add installations) with signature keys that have no legitimate association to that inbox, bypassing XMTP's installation↔inbox binding. Consider adding the same association-state verification that `from_staged_commit` performs — this will require threading in a `context` (or equivalent identity-lookup capability) and making the function async.

macroscopeapp · 2026-06-04T19:53:27Z

+
+            let epoch = openmls_group.epoch().as_u64();
+            let group_state_hash = openmls_group.epoch_authenticator().as_slice().to_vec();
+            let blob_expires_at_ns = compute_blob_expiry(


🟢 Low groups/external_invite.rs:149

create_external_invite computes blob_expires_at_ns using xmtp_common::time::now_ns() before returning to the caller, but the caller uploads encrypted_group_info only after this method returns. When policy.expire_in_ns is used, the relative expiry is anchored to invite creation time instead of upload time, so any delay between creation and upload can make a freshly uploaded blob already stale or expired.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file crates/xmtp_mls/src/groups/external_invite.rs around line 149: `create_external_invite` computes `blob_expires_at_ns` using `xmtp_common::time::now_ns()` before returning to the caller, but the caller uploads `encrypted_group_info` only after this method returns. When `policy.expire_in_ns` is used, the relative expiry is anchored to invite creation time instead of upload time, so any delay between creation and upload can make a freshly uploaded blob already stale or expired.

…lpers) (#3670) New module `xmtp_mls_common::invite::payload` providing pure-protocol helpers around the `ExternalInvitePayload` proto: - `generate_key()` / `generate_nonce()` — CSPRNG-backed material generation. - `group_id_hash(&[u8])` — sha256(group_id) for the service lookup key. - `is_expired(payload, now_ns)` — ns-precision expiry check (zero = no expiry). - `validate_version(payload)` — rejects zero and future versions.

…ion (#3669) ## Summary Pure-refactor PR — no behavior change. - Moves `crates/xmtp_mls/src/groups/mls_ext/welcome_wrapper.rs` to `crates/xmtp_mls_common/src/mls_ext/payload_encryption.rs`. - Renames the four helpers from welcome-specific names to generic ones: - `wrap_welcome` -> `wrap_payload_hpke` - `unwrap_welcome` -> `unwrap_payload_hpke` - `wrap_welcome_symmetric` -> `wrap_payload_symmetric` - `unwrap_welcome_symmetric` -> `unwrap_payload_symmetric` - The HPKE label becomes a parameter (was hardcoded to the welcome label). Welcome callers still pass `xmtp_configuration::WELCOME_HPKE_LABEL`; the QR-invite path (downstream PR) will pass `XMTP_EXTERNAL_INVITE_LABEL`. - All call sites in xmtp_mls updated to import from the new path. ## Why this PR Sets up the encryption primitive used by the QR-invite (external commit) flow. The functions are already generic — they don't depend on welcome semantics. Lifting them to `xmtp_mls_common` lets the invite module use them without pulling in `xmtp_mls`. ## Preserved XWING codepoint compat (#3662) The relocated `wrap_payload_hpke` / `unwrap_payload_hpke` still call `wrapper_algorithm.to_hpke_config()`, which lives in `xmtp_id::WrapperAlgorithm` and pins `XWingMLKEM768Draft6` to `KemAlgorithm::XWingDraft06Obsolete` (codepoint `0x004D`) for wire compat with v1.9 / v1.10 clients. Workspace `Cargo.toml`'s `[patch.crates-io]` block pinning the xmtp hpke-rs fork is untouched. Verified post-rebase: `round_trip_xwing_hpke` test passes.

## Why Creating an XMTP client required **two gRPC connections**. They were not a message/payer or primary/secondary split — they were two `tonic::Channel`s to the **same host**, wrapping the **same** client type, built from the **same** backend config. The second ("sync") connection was introduced on the assumption that a dedicated sync channel buys isolation; that isolation was never realized (same endpoint, no stream/concurrency separation), and the second client was used by exactly two unary reads. The redundancy cost every client an extra TCP connection + file descriptor (the iOS SDK even kept a whole second client cache and dialed the backend twice), with no measured benefit. This collapses the design to a single connection. > Note: the d14n payer/message split (`xmtp_api_d14n`) is a separate internal concern and is **not** touched here. ## What changed Commit-by-commit (each compiles + is self-contained): 1. **core** — remove `sync_api_client` field + `sync_api()` accessor from `XmtpMlsLocalContext`/`XmtpSharedContext`; collapse `ClientBuilder::api_clients(api, sync_api)` → `api_client(api)`; the two `.sync_api()` readers (`query_group_messages`, `load_identity_updates`) now use `.api()`. 2. **tests/benches** — every `.api_clients(a, b)` call site reduced to `.api_client(a)`. 3. **apps** — cli + xmtp_debug build one client. 4. **node + wasm** — `create_client_inner` drops its second client param; bindings build the backend once. 5. **mobile FFI** — **breaking**: `create_client` drops its second `sync_api: Arc<XmtpApiClient>` argument (plus its tests/bench call sites). 6. **sdks** — iOS + Android connect once, cache once, pass once. Android `xmtpv3.kt` regenerated. ## Verification - `just check`, `just wasm check`, `just lint` (rust), `just node lint` — all green locally. - `just android check` — green (bindings regenerated). - Pre-existing markdown-lint failures in untouched repo docs (`sdks/ios/README.md`, `VALIDATION.md`, `TEST_SCENARIOS.md`) were left alone — not part of this change. ## ⚠️ Follow-up required before merge - **iOS `xmtpv3.swift` must be regenerated on macOS.** The mobile FFI signature changed, but the iOS xcframework + Swift glue can only be regenerated on macOS/Xcode (`ios-xcframeworks-fast` has no Linux build), so the generated `xmtpv3.swift` in this branch still shows the old 2-arg signature. The hand-written `Client.swift` already targets the new 1-arg signature. Run `./sdks/ios/dev/bindings` on a Mac then `just ios check`. - Minor: `sdks/ios/dev/.setup` resolves the repo root via `git rev-parse` with no jj fallback (Android's already has one), which breaks iOS regen from a jj workspace. 🤖 Generated with [Claude Code](https://claude.com/claude-code)    > [!NOTE] > ### Collapse two gRPC connections into one across all client bindings > - Removes the dedicated sync API client from `ClientBuilder`, `XmtpMlsLocalContext`, and the `XmtpSharedContext` trait in [builder.rs](https://github.com/xmtp/libxmtp/pull/3721/files#diff-112b6dd965ad29739b36ccfb7f1283b1dbc20ec772591a4487a0898c6c5650f6) and [context.rs](https://github.com/xmtp/libxmtp/pull/3721/files#diff-ca0115cb0e326f7fa5c9bc660a4a1c1aff68c42a4c048eb79529bba4575aa909); all network calls now route through a single API client. > - Replaces the `.api_clients(api, sync_api)` builder method with `.api_client(api)` across all call sites: mobile FFI, Node, WASM, Android, iOS, and test utilities. > - Operations previously routed through the sync client (group message queries, identity update fetches) now use the primary API client. > - Behavioral Change: the `create_client_does_not_hit_network` test expectation for `get_identity_updates_v2` increases from 2 to 3 because all identity-update reads are now counted on the single client. > >  > #### Changes since #3721 opened > > - Modified `FfiXmtpClient` to support graceful shutdown and worker configuration [6f64938] > - Updated root resolution logic in setup script to support multiple version control systems [6f64938] >  > >  > > <a href="https://app.macroscope.com">Macroscope</a> summarized 5ab010b. >  > >

…ITE_LABEL

… generic AppDataUpdate intent Defense-in-depth master switch + per-component declarative external committer authorization for the QR-invite flow (MLS External Commits, RFC 9420 §12.4.3.2). - ComponentId::EXTERNAL_COMMIT_POLICY = 0x800C in xmtp_mls_common. - Type-dispatch entry: ComponentType::Bytes for the new id. - crates/xmtp_mls/src/groups/external_commit_policy.rs (new file): - load_external_commit_policy(group) -> Option<ExternalCommitPolicyV1> - is_external_commit_allowed(group) -> bool - external_committer_permissions_for(group, id) -> Option<ComponentPermissions> - MlsGroup::set_external_commit_policy(policy) — granular setter writes the full ExternalCommitPolicyV1 via the generic AppDataUpdate intent (AppDataUpdateOp::Replace, since the component is full-replace Bytes-typed). - MlsGroup::set_allow_external_commit(bool) — sugar wrapper. Depends on L-0 (generic IntentKind::AppDataUpdate) for the intent machinery. No new IntentKind variant added — the generic path covers this and every future full-replace component setter. No bootstrap synth: absent EXTERNAL_COMMIT_POLICY = disabled default. First setter call creates the entry via AppDataUpdateOperation::Update upsert semantics. Zero risk of cross-client serialization drift.

…nal-commit joins

…validator

… + encrypted GroupInfo)

tylerhawkes requested a review from a team as a code owner May 21, 2026 08:21

tylerhawkes mentioned this pull request May 21, 2026

xmtp_mls: Client::join_group_by_external_invite (atomic external-commit join) #3674

Open

2 tasks

tylerhawkes mentioned this pull request May 21, 2026

bindings/node: add Conversation.createExternalInvite NAPI binding #3675

Open

2 tasks

tylerhawkes force-pushed the tyler/qr-invite-l9-openmls-bump branch from 8f3e6d6 to 9ba7051 Compare May 26, 2026 22:49

tylerhawkes force-pushed the tyler/qr-invite-l10-create-invite branch from ffaaf90 to c12079b Compare May 26, 2026 22:49

tylerhawkes force-pushed the tyler/qr-invite-l9-openmls-bump branch from 9ba7051 to e861d43 Compare May 26, 2026 22:59

tylerhawkes force-pushed the tyler/qr-invite-l10-create-invite branch from c12079b to 096e94d Compare May 26, 2026 22:59

tylerhawkes force-pushed the tyler/qr-invite-l9-openmls-bump branch from e861d43 to 6b218e5 Compare May 27, 2026 18:17

tylerhawkes force-pushed the tyler/qr-invite-l10-create-invite branch 2 times, most recently from 5ebf785 to e8978a2 Compare May 27, 2026 18:33

tylerhawkes force-pushed the tyler/qr-invite-l9-openmls-bump branch from 6b218e5 to b4b3789 Compare May 27, 2026 18:33

insipx and others added 16 commits May 27, 2026 16:56

insipx -> xmtp for tracing-oslog patch (#3699)

736d6d1

feat(observability): OpenTelemetry trace export (node) (#3705)

89c3fc3

insipx and others added 6 commits June 3, 2026 21:32

tylerhawkes force-pushed the tyler/qr-invite-l10-create-invite branch from e8978a2 to e8ceed3 Compare June 4, 2026 19:38

macroscopeapp Bot reviewed Jun 4, 2026

View reviewed changes

tylerhawkes and others added 9 commits June 4, 2026 20:02

xmtp_mls_common: add invite::encrypted_group_info + XMTP_EXTERNAL_INV…

54721d0

…ITE_LABEL

xmtp_mls: ValidatedCommit::from_external_commit — accept atomic exter…

9a9ec22

…nal-commit joins

xmtp_mls: route Sender::NewMemberCommit through from_external_commit …

0a8293b

…validator

deps: bump openmls fork to feature/external-commit-by-value @ 9cdd5e7

f0017b0

xmtp_mls: MlsGroup::create_external_invite (produce QR-invite payload…

57e7e30

… + encrypted GroupInfo)

tylerhawkes force-pushed the tyler/qr-invite-l10-create-invite branch from e8ceed3 to 57e7e30 Compare June 6, 2026 05:13

tylerhawkes requested a review from a team as a code owner June 6, 2026 05:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xmtp_mls: MlsGroup::create_external_invite (produce QR-invite payload + encrypted GroupInfo)#3673

xmtp_mls: MlsGroup::create_external_invite (produce QR-invite payload + encrypted GroupInfo)#3673
tylerhawkes wants to merge 31 commits into
tyler/qr-invite-l9-openmls-bumpfrom
tyler/qr-invite-l10-create-invite

tylerhawkes commented May 21, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

claude Bot commented May 21, 2026 •

edited

Loading

Uh oh!

macroscopeapp Bot commented May 21, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 21, 2026 •

edited

Loading

Uh oh!

macroscopeapp Bot Jun 4, 2026

Uh oh!

macroscopeapp Bot Jun 4, 2026

Uh oh!

macroscopeapp Bot Jun 4, 2026

Uh oh!

macroscopeapp Bot Jun 4, 2026

Uh oh!

macroscopeapp Bot Jun 4, 2026

Uh oh!

macroscopeapp Bot Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tylerhawkes commented May 21, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stack

Test plan

Add MlsGroup::create_external_invite to produce QR-invite payload and encrypted GroupInfo

Uh oh!

claude Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — 2026-05-27 18:35 UTC

Summary

Code Quality & Best Practices

Potential Bugs / Edge Cases

Security Concerns

Performance Considerations

Test Coverage

Error Handling Review

Recommendations

Uh oh!

macroscopeapp Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

codecov Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

macroscopeapp Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tylerhawkes commented May 21, 2026 •

edited by macroscopeapp Bot

Loading

Add `MlsGroup::create_external_invite` to produce QR-invite payload and encrypted GroupInfo

claude Bot commented May 21, 2026 •

edited

Loading

macroscopeapp Bot commented May 21, 2026 •

edited

Loading

codecov Bot commented May 21, 2026 •

edited

Loading