xmtp_mls: MlsGroup::create_external_invite (produce QR-invite payload + encrypted GroupInfo)#3673
Conversation
|
Claude finished @tylerhawkes's task —— View job Code Review — 2026-05-27 18:35 UTCReviewed the implementation of SummaryOverall solid implementation with good cryptographic practices and defensive validation. The code follows expected patterns for this codebase and properly delegates to well-tested helper functions. Main concerns are around test coverage gaps and a few edge cases in expiry computation. Code Quality & Best PracticesPositive:
Minor issues:
Potential Bugs / Edge Cases
Security ConcernsStrong points:
Observations:
Performance Considerations
Test CoverageCodecov reports 59.13% patch coverage with 47 uncovered lines. This is the main concern. Looking at the tests:
Missing coverage likely includes:
Recommendation: The PR description notes "Integration coverage via PR-T1" (unchecked). Consider adding at least one happy-path test to this PR that:
This would significantly improve coverage and document expected usage. Error Handling ReviewThe new error variants in ExportGroupInfo(#[from] openmls::group::ExportGroupInfoError), // Line 452
EncryptedGroupInfo(#[from] EncryptedGroupInfoError), // Line 460
ExternalCommitNotAllowed, // Line 466
ExternalCommitPolicyMalformed(String), // Line 473All four are correctly marked as non-retryable in the Recommendations
Overall assessment: Well-implemented feature with good cryptographic hygiene. The main gap is test coverage of the happy path, which should be addressed before or in PR-T1. |
ApprovabilityVerdict: Needs human review Unable to check for correctness in 57e7e30. Diff is too large for automated approval analysis. A human reviewer should evaluate this PR. You can customize Macroscope's approvability policy. Learn more. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## tyler/qr-invite-l9-openmls-bump #3673 +/- ##
===================================================================
- Coverage 83.96% 83.82% -0.14%
===================================================================
Files 401 402 +1
Lines 59586 59701 +115
===================================================================
+ Hits 50031 50046 +15
- Misses 9555 9655 +100 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
8f3e6d6 to
9ba7051
Compare
ffaaf90 to
c12079b
Compare
9ba7051 to
e861d43
Compare
c12079b to
096e94d
Compare
e861d43 to
6b218e5
Compare
5ebf785 to
e8978a2
Compare
6b218e5 to
b4b3789
Compare
<!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. --> ### Add `--read-only` healthcheck mode and run reverse read-only pass in cross-talk test - Adds a `--read-only` flag to the healthcheck subcommand that skips all write operations and reuses an existing primary identity; the flag gates write ops via a new `WRITES` condition bit and a new `BOOTSTRAP` condition for empty identity stores. - Extends the cross-talk test healthcheck phase to run a forward write-enabled pass followed by a reverse read-only pass over older versions, annotating result rows with a `leg` field (`fwd`/`rev`). - Refactors `App` into a unit struct with a global `CONF` singleton and `App::network()`/`App::network_hash()` accessors, removing per-instance and per-call network arguments across clients, stores, and subcommands. - Introduces `DmStore` and a `Dm` data model for persisting direct message threads during sync, with metadata tracking and display support. - Adds `BootstrapIdentities` health op that creates and registers three new identities when both `BOOTSTRAP` and `WRITES` conditions are active. - Risk: `STRICT_VERSIONING` bit shifted from `1<<0` to `1<<1`; any serialized or compared condition values will be incorrect unless updated. <!-- Macroscope's review summary starts here --> <sup><a href="https://app.macroscope.com">Macroscope</a> summarized 2326a91.</sup> <!-- Macroscope's review summary ends here --> <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here -->
<!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. --> ### Patch `tracing-oslog` dependency to remove panics Replaces the `tracing-oslog` crate with a forked version pinned to a specific git revision that removes panics from the library. <!-- Macroscope's review summary starts here --> <sup><a href="https://app.macroscope.com">Macroscope</a> summarized bcccc6e.</sup> <!-- Macroscope's review summary ends here --> <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here -->
…spam (#3693) Two focused changes to cut production log volume. Driven by an audit of a 1000-line VictoriaLogs sample from 5 herald instances, where **53% of lines were TRACE** and a single steady-state message accounted for **~40% of all log lines**. ## Commit 1 — `chore(logs): remove steady-state log noise` Deletes logs that fire on every poll/tick when *nothing happened* (the largest volume sources): - **key_package_cleaner**: drop per-poll `"No expired key packages to delete"` trace (~40% of sampled prod lines). - **device_sync worker**: drop the 20s `"New event: Tick"` heartbeat trace; log only real events, with `installation_id` as a structured field (not a `[..]` message prefix). - **sql_key_store**: drop the bare `"own_leaf_nodes"` trace (fires on every leaf-node read). - **worker metrics**: downgrade per-increment `"firing {metric}"` from `info` → `trace`. ## Commit 2 — `refactor(api): remove ApiDebugWrapper stats-on-error spam` On **every** API error, `ApiDebugWrapper` snapshotted all per-endpoint request counters into `AggregateStats` and embedded its multi-line `Debug` dump into `ApiClientError::{ErrorWithStats, ClientWithEndpointAndStats}` — which then got logged, and re-logged on each retry, as a verbose stats block. This removes the error-dump path so API errors are clean one-liners: - Delete `ApiDebugWrapper` and its module re-export. - Remove `ClientBuilder::enable_api_debug_wrapper` and drop the call from the mobile/node/wasm client builders. - `WrappedXmtpApiClient` is now `TrackedStatsClient<XmtpApiClient>`. - Remove the now-unused `ErrorWithStats` / `ClientWithEndpointAndStats` variants and their match arms. **Kept unchanged:** the `TrackedStatsClient` request counters and the FFI `api_statistics()` / `api_aggregate_statistics()` / `clear_all_statistics()` surface — only the error-message spam is removed. ## Verification - `cargo check` + `cargo clippy` clean on `xmtp_proto`, `xmtp_api`, `xmtp_mls`, `xmtp_db` (the only warning, `identity.rs:842` unused `mut`, is pre-existing and unrelated). - `just node check` and `just wasm check` both pass after the binding edits. ## Follow-ups (not in this PR) - Replace the in-RAM `TrackedStatsClient` counters with OTEL metrics (per-API-call latency/rate at the retry combinator), then retire the stats layer once the FFI is migrated. - Broader log standardization + OTEL traces/metrics rollout (separate audit docs exist locally). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. --> > [!NOTE] > ### Remove API debug wrapper and reduce steady-state log noise across MLS workers > - Deletes the `ApiDebugWrapper` and `enable_api_debug_wrapper` builder method; the `WrappedXmtpApiClient` type alias now resolves to `TrackedStatsClient<XmtpApiClient>` without the debug layer. > - Removes `ClientWithEndpointAndStats` and `ErrorWithStats` variants from `ApiClientError`, so errors no longer carry embedded aggregate stats. > - Suppresses high-frequency log output: tick events in `DeviceSyncWorker`, no-op key package cleaner runs, repeated DB connection-outage errors, and per-metric info logs are all silenced or downgraded to trace. > - Behavioral Change: API errors previously wrapped by the debug adapter are now surfaced as plain `ClientWithEndpoint` or `Client` variants with no attached stats. > > <!-- Macroscope's review summary starts here --> > > <sup><a href="https://app.macroscope.com">Macroscope</a> summarized 480bc19.</sup> > <!-- Macroscope's review summary ends here --> > <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here --> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When the SQLite connection pool cannot hand out a connection (exhausted
/ `timed out waiting for connection` / `Unable to open the database
file`), a **single** root cause was logged dozens of times: each
background worker logged the error itself and then propagated it to the
supervisor, which logged it again on every poll/restart. In a production
sample this turned one pool outage into a flood of duplicate `ERROR`
lines.
## Approach: the supervisor is the single logger for propagated worker
errors
The worker supervisor already receives every error a worker returns and
logs it with context (`{kind} worker error: {err}`), and routes pool
failures to a quiet reconnect. So workers should **not** log errors they
propagate.
- **Remove the per-worker error logs** that just logged-then-propagated
(`disappearing_messages`, `pending_self_remove` outer fetch,
`key_package_cleaner` fetch). The supervisor is now the sole source.
Per-group *swallow-and-continue* logs (e.g. one malformed `group_id`)
are kept, since those never reach the supervisor.
- **Fix a latent bug in `key_package_cleaner`**: it previously *logged
and swallowed* a fetch error (returned `Ok`), so a pool outage was
logged every 5s and never triggered the supervisor’s reconnect/restart
path (silent infinite retry). It now propagates the error like the other
workers.
- **Classify pool-acquisition failures as needing a reconnect**:
`StorageError::db_needs_connection()` now matches
`PlatformStorageError::Pool(_)` (the r2d2 acquisition/timeout error) in
addition to `PoolNeedsConnection`, so the supervisor takes the quiet
reconnect path for a pool outage instead of the noisy error+restart
path. This is the one place the predicate is consulted.
- **Drop a redundant log**: `"Key package deletion successful"`
immediately followed `"Deleted {N} expired key packages (up to ID …)"` —
same meaning, removed.
## Tests
- `pool_errors_need_connection` — unit test for the
`StorageError::db_needs_connection()` classification
(`PoolNeedsConnection` directly and wrapped in `Connection`; a non-pool
error is not). The `Pool(_)` arm can’t be unit-constructed
(`r2d2::PoolError` has no public ctor) so it’s covered end-to-end below.
- `pool_failure_needs_connection` — end-to-end: disconnect a persistent
store’s pool, run a real query, assert the resulting error reports
`db_needs_connection()`.
## Incidental CI fix
`xmtp_db`’s `test-utils` feature did not forward
`xmtp_proto/test-utils`, so `cargo test -p xmtp_db` only compiled under
workspace-wide feature unification (which CI uses). This masked a broken
standalone per-crate test build. Added the missing feature forward.
## Verification
- `cargo clippy --locked --all-features --all-targets --no-deps --
-Dwarnings` ✓
- `cargo fmt --check` ✓ · `cargo hakari generate --diff` ✓
- Both new tests pass.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- Macroscope's pull request summary starts here -->
<!-- Macroscope will only edit the content between these invisible
markers, and the markers themselves will not be visible in the GitHub
rendered markdown. -->
<!-- If you delete either of the start / end markers from your PR's
description, Macroscope will append its summary at the bottom of the
description. -->
> [!NOTE]
> ### Deduplicate DB pool-failure error spam across MLS workers
> - Extends `StorageError::db_needs_connection()` and
`ConnectionError::db_needs_connection()` to classify generic `Pool(_)`
acquisition failures (not just `PoolNeedsConnection`) as needing a
reconnect.
> - Implements `NeedsDbReconnect` for `MlsStoreError` so worker restart
logic can inspect wrapped DB errors.
> - Refactors `DisappearingMessagesCleaner`, `KeyPackagesCleaner`, and
`PendingSelfRemoveWorker` to propagate structured errors instead of
logging and swallowing them, so pool failures surface once via the
reconnect path rather than flooding logs.
> - Behavioral Change: workers now abort and return an error on the
first DB failure per cycle instead of continuing past errors; this
reduces log spam but changes per-cycle recovery behavior.
>
> <!-- Macroscope's review summary starts here -->
>
> <sup><a href="https://app.macroscope.com">Macroscope</a> summarized
ffe1014.</sup>
> <!-- Macroscope's review summary ends here -->
>
<!-- macroscope-ui-refresh -->
<!-- Macroscope's pull request summary ends here -->
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Third log-noise PR (after #3693, #3696), driven by a **new herald log dump filtered to INFO-only** (2000 lines, ~15.5 min of healthy traffic — the first sample was TRACE-flooded and couldn't show INFO top-talkers). ## Headline finding from the INFO dump **`"Processing readd requests for N groups"` was 1705 of 2000 lines (85%)** — `commit_log.rs` `handle_incoming_pending_readds`, which runs every ~2s worker tick and logged at info **even when N=0**. Fixed: early-return when empty + downgrade to debug. This single change removes the dominant INFO offender. ## Changes (all info→debug, per-item/steady-state churn) - **commit_log.rs**: gate `"Processing readd requests"` on non-empty (the 85% offender); `"No commit log requests to query"` (nothing-to-do early return); the per-readd cluster (`Sending readd request`, `Sending readd request to`, `Sent readd request`, `Updated requested readd sequence id`); `Updating publish cursor for conversation` (per-conversation loop). - **welcome_sync.rs**: two `"[{}] syncing group"` (per-group, also dropped the duplicate inbox_id prefix); `unstuck previously paused group` (per-group). - **welcomes/xmtp_welcome.rs**: per-welcome routine logs (`attempting to commit welcome`, `updating cursor to`, `storing group with welcome id`, `Created GroupUpdated message`, `updated message cursor from welcome metadata`). Kept the semantic state-change events (re-add → ALLOWED, consent reset) and error logs at info/warn. - **mls_ext/decrypted_welcome.rs**: `Trying to decrypt welcome` / `welcome pointer` (per-welcome). - **welcome_pointer.rs**: `Resolving welcome pointer` (kept the retry-progress log). - **mls_store.rs**: `returning N welcomes` (per query). - **oneshot.rs**: `Processing oneshot welcome`. - **subscriptions/process_welcome.rs**: 3 per-stream-event logs. - **subscriptions/stream_messages.rs**: `encountered newly unprocessed message` (per message). - **xmtp_api_d14n/.../mls.rs**: `got N envelopes`, `querying welcomes` (per query). ## Deliberately out of scope - **mls_sync.rs** — the most sensitive code; its per-message logs are being decided separately to preserve debugging context. - **The structured `log_event!` event system** (the `➣`-prefixed events like `GroupSyncGroupInactive`) — left untouched pending a separate evaluation of whether it's worth keeping. ## Verification - `cargo clippy --locked --all-features --all-targets --no-deps -- -Dwarnings` ✓ - `cargo fmt --check` ✓ - Log-level/message-only changes; no behavior change except the readd early-return (skips a no-op log when there is no work). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. --> > [!NOTE] > ### Downgrade per-item welcome, sync, and query log statements from info to debug > Reduces log noise by lowering high-frequency, per-item `tracing` statements across welcome processing, group sync, commit log, and MLS query paths from `info` to `debug` level. Also adds an early-return guard in `handle_incoming_pending_readds` to skip logging entirely when there are no groups to process. > > <!-- Macroscope's review summary starts here --> > > <sup><a href="https://app.macroscope.com">Macroscope</a> summarized 357ed5d.</sup> > <!-- Macroscope's review summary ends here --> > <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here --> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Continues the log-noise cleanup (after #3693). These three sites were among the top remaining volume sources in the production sample, all logging normal-course behavior at warn/info. ## Changes - **`xmtp_common/src/retry.rs`** — the retry combinator (every API call funnels through it): - Per-attempt `"retrying function that failed"` was logged at **warn** on *every* retry. A single retry is normal transient behavior, so a flapping endpoint flooded prod warn logs. → **debug** (with an `attempt` field). - `"retry strategy exceeded max wait time"` stays at **warn** (it signals a call that never succeeded) but now carries context: `attempts`, `elapsed_ms`, and the error — previously it was a bare string with no indication of which call or how long. - **`xmtp_mls/src/groups/mls_sync.rs`** — `"[{inbox_id}] syncing group, epoch = …"` fired at **info** on every group sync, and duplicated `inbox_id` (both a structured field *and* the `[{}]` message prefix). → **debug**, dropped the redundant prefix, `epoch` is now a field. - **`xmtp_mls/src/groups/welcome_sync.rs`** — `"sync_welcomes failed, continuing with group sync"` and `"unstick_paused_groups failed, continuing with sync"` were **warn**, but both are explicitly non-fatal recovered paths. → **debug**. Net effect: prod logs at `info`+ now reflect real problems; normal retry/sync churn drops to `debug`. ## Verification - `cargo clippy --locked --all-features --all-targets --no-deps -- -Dwarnings` ✓ - `cargo fmt --check` ✓ - Log-level/message-only changes; no behavior change. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. --> > [!NOTE] > ### Downgrade retry and sync log levels from warn/info to debug > - Retry attempts in `retry_async` now log at debug instead of warn; exhaustion logs still warn but now include attempt count, elapsed_ms, and the error message. > - Group sync start in `MlsGroup.sync` is downgraded from info to debug and switches to structured fields (inbox_id, installation_id, group_id, epoch). > - Failures in `WelcomeSync.sync_groups` for `unstick_paused_groups` and `sync_welcomes` are downgraded from warn to debug. > > <!-- Macroscope's review summary starts here --> > > <sup><a href="https://app.macroscope.com">Macroscope</a> summarized 1e96453.</sup> > <!-- Macroscope's review summary ends here --> > <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here --> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
## Summary Workers in `xmtp_mls` run under `WorkerRunner`. Its outer loop (`Worker::spawn`) stops a worker **only** when an error returned from `run_tasks` reports `needs_db_reconnect() == true` — that's the signal that the connection pool was dropped and the worker should stop until reconnect. Several workers violated this contract: they caught per-item errors with `tracing::error!` and continued, so a dropped pool could never reach the supervisor and the worker hot-looped against a gone pool. The signal was also silently lost in wrapper error types that mapped `GroupError(_) => false`. This cleans the workers up to follow the contract, using the recently-cleaned `disappearing_messages` worker as the model. ## Root fix (error-type forwarding) - `ConnectionError` gets a `db_needs_connection()` helper mirroring `StorageError`. - `GroupError`, `MlsStoreError`, and `SubscribeError` now implement `NeedsDbReconnect`, forwarding their storage / connection / client / wrapped variants. - `DeviceSyncError::needs_db_reconnect` now forwards `Db` / `Group` / `MlsStore` / `Subscribe` instead of only `Client`. - `CommitLogError::needs_db_reconnect` forwards `GroupError` and narrows `Connection` to `db_needs_connection()` (was unconditionally `true`); removes the stale TODOs. ## Workers - **key_package_cleaner** — propagate the expired-KP fetch error (was swallowed → `Ok`); bubble a disconnect caught while deleting a single key package. - **pending_self_remove** — capture the group-load (`MlsStoreError`) path (previously never converted into the worker error) and bubble a disconnect from either per-group operation; fix `GroupError(_) => false`. - **commit_log** — bubble a disconnect from the readd-request and readd-validation loops. - **device_sync** — bubble a disconnect from the per-message processing loop. Per-item **non-DB** failures still log and continue, as before — only a dropped pool now stops the worker. ## Other - Removes the dead `WorkerKind::Event` variant (no impl, no construction, not in bindings/FFI/serde). - Adds native-only unit tests pinning the disconnect-forwarding contract for every worker error type (`disconnect_propagation_tests`). ## Test plan - `cargo nextest run -p xmtp_mls disconnect_propagation` — 7/7 pass. - `cargo check -p xmtp_mls --tests` clean. - `just lint-rust` clean (clippy / fmt / hakari). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. --> > [!NOTE] > ### Propagate DB-disconnect errors out of workers to the supervisor > - Adds or fixes `needs_db_reconnect()` implementations across multiple error types (`GroupError`, `ClientError`, `IdentityError`, `SubscribeError`, `DeviceSyncError`, `PendingSelfRemoveWorkerError`) so dropped-pool signals are forwarded up the call stack rather than swallowed. > - Errors in `publish_commit_logs_to_remote` and `prepare_publish_commit_log_info` now propagate instead of being logged and ignored, so cursors only advance on success. > - `send_outgoing_readd_requests` now fails the operation when any readd request fails, instead of logging and continuing. > - `handle_incoming_pending_readds` now aborts early on retryable validation errors to avoid hot loops. > - Device-sync and archive import paths abort on DB errors instead of logging warnings and continuing. > - Behavioral Change: workers that previously swallowed DB-disconnect errors will now surface them to the supervisor, triggering reconnect handling. > > <!-- Macroscope's review summary starts here --> > > <sup><a href="https://app.macroscope.com">Macroscope</a> summarized 4c78768.</sup> > <!-- Macroscope's review summary ends here --> > <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here --> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When an own-intent failed non-retryably, process_message_inner marked the intent row Error, committed, and returned Ok(identifier) — dropping the real GroupMessageProcessingError. process_messages recorded only a success, so sync_until_intent_resolved returned a summary that displayed '0 failed 1 succeeded' yet surfaced as a top-level GroupError::Sync with no cause. Capture the dropped cause via a ProcessedMessageOutcome struct and record it in ProcessSummary.errored, so the summary (and its source()) report the actual failure. No DB/txn-commit or retry-rollback behavior changes. <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. --> > [!NOTE] > ### Fix `SyncSummary.extend` to preserve the earliest error cause across sync retries > - `SyncSummary.extend` previously overwrote `self.other` unconditionally, clobbering earlier error causes with later `None` values; it now only sets `self.other` when `self.other` is currently `None`. > - `GroupError::SyncFailedToWait` now includes the `SyncSummary` in its `Display` output so the cause is visible in error messages. > - Per-attempt error logging inside `MlsGroup::sync` is removed; summaries are accumulated silently and logged once by `GroupSyncFinished`. > - `GroupSyncIntentErrored` events no longer redundantly include the summary metadata field. > - Two unit tests are added to [summary.rs](https://github.com/xmtp/libxmtp/pull/3701/files#diff-cc6e0f2ec334b2a161a0e24bf0a280e37287801dcc7c6e40c46b7d5153580240) validating the corrected `extend` accumulation behavior. > > <!-- Macroscope's review summary starts here --> > > <sup><a href="https://app.macroscope.com">Macroscope</a> summarized 55e3eec.</sup> > <!-- Macroscope's review summary ends here --> > <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here -->
…3702) ## Problem `SyncSummary` had two predicates that disagreed about what "errored" means: - **`Display`** (success banner) fires only when `publish_errors`, `post_commit_errors`, AND `other` are all empty. - **`is_errored()`** used `&&` for publish/post_commit (`!publish.is_empty() && !post_commit.is_empty()`), so a sync with *only* publish errors (or only post-commit errors) returned `is_errored() == false` — reported as not-errored — while its `Display` printed the `===== Errors Occurred During Sync =====` banner. These two govern whether a summary may sit in the `Err` position (`sync_with_conn`) and which `Display` branch is shown, so the divergence meant an `Ok(...)` summary could render as an error block (and vice versa). Separately, `GroupError::Sync` / `SyncFailedToWait` box a `SyncSummary` whose `source()` returned `None`, so the FFI `Caused by:` chain dead-ended at the summary's flattened string — the underlying error was never reachable. ## Changes (summary.rs only) - **`is_errored()`**: `&&` → `||` for publish/post_commit, and unify on one predicate. `process.errored` (per-message failures) is *deliberately excluded* and documented: a sync that decrypts some messages and fails others is still a successful sync of the group as a whole. - **`Display`**: success branch is now `if !self.is_errored()` — the two can no longer drift. - **`source()`**: returns the first available cause (`other` → first publish → first post_commit → first `process.errored`), so the FFI error chain reaches the real failure. - **6 unit tests**: clean / publish-only / post-commit-only / other / per-message / source-priority. The publish-only test is the explicit regression guard for the `&&` bug. ## Verification - `cargo clippy --locked -p xmtp_mls --all-features --all-targets --no-deps -- -Dwarnings` + `cargo fmt --check` clean - New unit tests pass; sync/welcome/commit_log/process_message integration suites pass (no regression from the `is_errored` behavior change) Pairs with #3701 (which captures the dropped intent-error cause into `process.errored`); this PR makes the type honest, #3701 fills in the cause. Independent-compiling; recommend landing this first. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. --> > [!NOTE] > ### Fix `SyncSummary` error detection logic and expose error cause via `source()` > - `is_errored()` previously required both `publish_errors` and `post_commit_errors` to be non-empty (logical AND); it now returns `true` when any of `other`, `publish_errors`, or `post_commit_errors` is present. > - `std::error::Error::source()` now returns the first available underlying error, preferring `other`, then `publish_errors`, then `post_commit_errors`, then per-message processing errors; previously it returned `None`. > - `Display` formatting now delegates to `is_errored()` instead of duplicating the error-check logic. > - Behavioral Change: callers relying on `is_errored()` returning `false` when only one of publish or post-commit errors is present will now see `true`. > > <!-- Macroscope's review summary starts here --> > > <sup><a href="https://app.macroscope.com">Macroscope</a> summarized 9d90b03.</sup> > <!-- Macroscope's review summary ends here --> > > <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here -->
…st (#3703) ## Problem Loading clients in xdbg was dominated by TLS setup, not DB or MLS work. Phase timing while loading 100 identities showed ~100% of the time inside `network.client_bundle()`, with a tell-tale monotonic +42ms-per-call stair-step — i.e. a global lock serializing ~42ms of work per call, paid **twice per client** (api + sync channel). Root cause: `create_tls_channel` builds a fresh tonic `Endpoint` on every call, and `ClientTlsConfig::with_enabled_roots()` makes tonic call `rustls_native_certs::load_native_certs()` and parse the **entire OS trust store** into a rustls `ClientConfig` each time. On macOS that read is serialized through Security.framework (~40ms). 200 calls × ~40ms ≈ the ~9–19s we were seeing. ## Fix **`xmtp_api_grpc`:** cache the built `Endpoint` per `(host, rate_limit)`. The endpoint owns an `Arc<ClientConfig>` with the already-parsed roots, so `connect_lazy()` on a cheap clone still hands **every client its own connection** — no change to connection topology or trust semantics — while the native-cert load happens **once per host**. The lock is held across the one-time build so concurrent first-callers load certs exactly once. This benefits every platform that creates clients (node/mobile too), not just xdbg. **`xdbg`:** parallelize identity loading (was a sequential `for` loop). Each client open is blocking SQLite/MLS work, so they're handed to Tokio's blocking pool via `tokio::task::spawn_blocking` and awaited together; `load_all_identities` / `load_n_identities` and their callers become `async`. ## Result `cargo xdbg -b dev generate -e message -a 1`, 100 clients: | | load time | |---|---| | before | ~19s | | after | **~185ms** | No new dependencies. No trust-store/behavior change. Each client still gets its own connection. ## Test plan - [x] `cargo fmt --check` - [x] `cargo clippy --locked --all-features --all-targets --no-deps -- -Dwarnings` (full workspace, clean) - [x] manual: 100-client xdbg load ~19s → ~185ms, sends still succeed 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. --> > [!NOTE] > ## Problem > > Loading clients in xdbg was dominated by TLS setup, not DB or MLS work. Phase timing while loading 100 identities showed ~100% of the time inside `network.client_bundle()`, with a tell-tale monotonic +42ms-per-call stair-step — i.e. a global lock serializing ~42ms of work per call, paid **twice per client** (api + sync channel). > > Root cause: `create_tls_channel` builds a fresh tonic `Endpoint` on every call, and `ClientTlsConfig::with_enabled_roots()` makes tonic call `rustls_native_certs::load_native_certs()` and parse the **entire OS trust store** into a rustls `ClientConfig` each time. On macOS that read is serialized through Security.framework (~40ms). 200 calls × ~40ms ≈ the ~9–19s we were seeing. > > ## Fix > > **`xmtp_api_grpc`:** cache the built `Endpoint` per `(host, rate_limit)`. The endpoint owns an `Arc<ClientConfig>` with the already-parsed roots, so `connect_lazy()` on a cheap clone still hands **every client its own connection** — no change to connection topology or trust semantics — while the native-cert load happens **once per host**. The lock is held across the one-time build so concurrent first-callers load certs exactly once. This benefits every platform that creates clients (node/mobile too), not just xdbg. > > **`xdbg`:** parallelize identity loading (was a sequential `for` loop). Each client open is blocking SQLite/MLS work, so they're handed to Tokio's blocking pool via `tokio::task::spawn_blocking` and awaited together; `load_all_identities` / `load_n_identities` and their callers become `async`. > > ## Result > > `cargo xdbg -b dev generate -e message -a 1`, 100 clients: > > | | load time | > |---|---| > | before | ~19s | > | after | **~185ms** | > > No new dependencies. No trust-store/behavior change. Each client still gets its own connection. > > ## Test plan > - [x] `cargo fmt --check` > - [x] `cargo clippy --locked --all-features --all-targets --no-deps -- -Dwarnings` (full workspace, clean) > - [x] manual: 100-client xdbg load ~19s → ~185ms, sends still succeed > > 🤖 Generated with [Claude Code](https://claude.com/claude-code)<!-- Macroscope's changelog starts here --> > #### Changes since #3703 opened > > - Converted `load_clients_parallel`, `load_all_identities`, and `load_n_identities` utilities to async functions [69570ca] > - Updated `GenerateMessages::new` constructor and `GenerateGroups::create_groups` workflow to async [69570ca] > - Updated `Generate::run` and `Sync::run` command methods to await async operations [69570ca] > <!-- Macroscope's changelog ends here --> > <!-- Macroscope's pull request summary ends here --> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sable (#3706) ## Summary Makes `xmtp_mls` background-worker scheduling configurable to fix a **thundering-herd / stampede** problem: when many clients (250+) deploy at the same time, their workers tick in lockstep and fire a wall of simultaneous requests. Adds, surfaced through the **Node, WASM, and Mobile** SDKs: - **Configurable intervals** — a global default plus per-worker overrides (nanoseconds, consistent with the existing `ForkRecoveryOpts::worker_interval_ns`). - **Optional jitter** (`worker_interval_jitter_ns`) — a one-time random startup offset (de-syncs clients that booted together) plus per-tick jitter (keeps them from re-converging). Fully optional; `0`/absent = deterministic, byte-for-byte today's behavior. - **Per-worker enable/disable** — a unified `enabled` map. The legacy `device_sync_worker_mode` and `disable_commit_log_worker` toggles fold into the same map (still work unchanged); `disable_workers` remains the global kill-switch. All new config is opt-in: `WorkerConfig::default()` reproduces existing behavior exactly. No DB migration, no protocol change, no change to default cadences. ## Design Mirrors the in-tree `fork_recovery_opts` pattern: `WorkerConfig` is carried on `XmtpMlsLocalContext` and exposed via `XmtpSharedContext::worker_config()`. A new `xmtp_common::time::jittered_interval_stream(base, jitter)` adds the startup offset + per-tick jitter (`jitter == 0` short-circuits to the existing `interval_stream`, pulling no randomness). Each worker resolves its interval through a context helper instead of its hardcoded const, keeping the const as the fallback default. Workers wired: disappearing-messages, key-package-cleaner, pending-self-remove, commit-log, device-sync heartbeat. TaskRunner is event/dynamic-driven and left as-is. For commit-log, the existing fork-recovery `worker_interval_ns` override still wins where set. ## SDK surface (all fields optional) | Concept | Node | WASM | Mobile | |---|---|---|---| | Global interval | `workerConfig.defaultIntervalNs` | `workerConfig.defaultIntervalNs` | `FfiWorkerConfig.default_interval_ns` | | Jitter | `workerConfig.jitterNs` | `workerConfig.jitterNs` | `FfiWorkerConfig.jitter_ns` | | Per-worker interval | `workerConfig.workerIntervalsNs` | `workerConfig.workerIntervalsNs` | `FfiWorkerConfig.worker_intervals_ns` | | Per-worker disable | `workerConfig.disabledWorkers` | `workerConfig.disabledWorkers` | `FfiWorkerConfig.disabled_workers` | New `WorkerKind` / `FfiWorkerKind` enum mirrors the core `WorkerKind`. ## Testing - `xmtp_common`: `jittered_interval_stream` — zero-jitter equals base; jittered period stays within bounds; startup offset bounded. - `xmtp_mls`: `WorkerConfig::interval` precedence (override > default > const), zero-clamp, jitter carry, `worker_enabled` default-on; a disabled worker is not registered. - Existing worker + device-sync suites pass unchanged (defaults preserve behavior). Verified: `just lint-rust`, `just node lint`, `cargo nextest run -p xmtp_common time`, `cargo nextest run -p xmtp_mls worker`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… iOS/Android (#3707) Normalize Node and WASM 1.10.0 release notes to match the structure already used in ios/4.10.0.md and android/4.10.0.md. No content changes, structural and formatting fixes only. Changes: node-bindings/1.10.0.md: Add `## What's Changed` section header before intro text Add `## Breaking Changes` section header `### History sync is now manual` → moved under `## Breaking Changes` `### Upcoming deprecation of preferences.sync()` → moved under `## Breaking Changes` Add `## New Features` section header `### Manual history sync archive management` → moved under `## New Features` `### Archive-based backups` → moved under `## New Features` Add `## Bug Fixes` section header `### Performance improvement` → removed subheading, text moved under `## Bug Fixes` `**THIS IS A BREAKING CHANGE.**` → `**This is a breaking change.**` wasm-bindings/1.10.0.md: (identical structural changes as node-bindings/1.10.0.md) Test plan: Documentation-only change; no code paths affected. <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. --> > [!NOTE] > ### Normalize Node and WASM 1.10.0 release notes structure to match iOS/Android > Restructures the 1.10.0 release notes for [Node](https://github.com/xmtp/libxmtp/pull/3707/files#diff-883d74d020e02801e31065330f11fdbd52fd8892ed0e14c97ba4616d470e07df) and [WASM](https://github.com/xmtp/libxmtp/pull/3707/files#diff-1dd324ff90565158b66b4d4de5d306179d8970d17b88a2537449749d67fb51de) bindings to match the section layout used by the iOS and Android notes. Adds explicit `What's Changed`, `Breaking Changes`, `New Features`, and `Bug Fixes` headers, moves content into the appropriate sections, deduplicates the `preferences.sync()` deprecation notice, and normalizes minor formatting. > > <!-- Macroscope's review summary starts here --> > > <sup><a href="https://app.macroscope.com">Macroscope</a> summarized 83ce8bf.</sup> > <!-- Macroscope's review summary ends here --> > <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here -->
…3709) ## Summary Follow-up to #3706. `WorkerConfig` shipped with a single **global** `jitter_ns` that applied to *every* background worker. Consumers who only want to jitter one worker — e.g. scheduling the daily CommitLog fork-recovery worker with 6h jitter to avoid a fleet stampede — end up jittering the fast workers too. A 1s-interval worker (DisappearingMessages) inheriting up to 6h of jitter is effectively broken. This makes jitter **per-worker**, mirroring how interval already works via `interval_overrides`. ## Changes - **Core** (`crates/xmtp_mls/src/worker.rs`): remove `WorkerConfig.jitter_ns: Option<u64>`, add `jitter_overrides: HashMap<WorkerKind, u64>`. `interval()` resolves jitter per-kind; absent entry => `0` (deterministic). No global default jitter — that's the foot-gun being removed. - **Bindings** (node / wasm / mobile): drop the scalar `jitterNs` from `WorkerConfigOptions`, add a `workerJittersNs` array of `{ kind, jitterNs }` mirroring `workerIntervalsNs`. Generated TS surface: ```ts workerConfig.workerJittersNs?: Array<{ kind: WorkerKind; jitterNs: bigint }> ``` ## Breaking change Breaks the nightly-only `WorkerConfigOptions.jitterNs` field. No stable consumers; nightly breakage is expected. Downstream consumers set per-worker jitter instead, e.g.: ```ts workerJittersNs: [{ kind: "CommitLog", jitterNs: 21600n * 1_000_000_000n }] ``` ## Notes - `create_client` FFI signatures unchanged (still one `worker_config` param) — only the inside of `WorkerConfigOptions` changes, so no SDK-wrapper or caller churn. - `WorkerConfig::default()` behavior unchanged (empty maps, deterministic). No DB/protocol change. ## Testing - Core: per-kind jitter carried; jitter scoped per worker (jittered worker gets it, un-listed worker stays `0`). - `just lint-rust` ✓ · `just node lint` ✓ · all three bindings compile (`bindings_node`, `bindings_wasm` wasm32, `xmtpv3`) ✓ · `xmtp_mls worker_config` tests ✓ 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. --> > [!NOTE] > ### Replace global `jitter_ns` with per-worker `jitter_overrides` in `WorkerConfig` > - Removes the global `jitter_ns: Option<u64>` field from `WorkerConfig` and replaces it with `jitter_overrides: HashMap<WorkerKind, u64>`, so jitter is scoped per worker kind with a default of zero. > - Updates `WorkerConfig.interval()` to resolve jitter from the per-kind map instead of a single global value. > - Mirrors the change across all three binding layers: [worker_config.rs](https://github.com/xmtp/libxmtp/pull/3709/files#diff-3b21fab5bc1e30d4eb92b2d0e14461f1c2fd991164b484046e665363bb3fd48d) (mobile/FFI), [options.rs](https://github.com/xmtp/libxmtp/pull/3709/files#diff-5306d98c5fe6413b641310877f4657e782cd87bc1e7497e28c0478154fb062d7) (Node), and [client.rs](https://github.com/xmtp/libxmtp/pull/3709/files#diff-6fa64ca0f35092cf055d31985b6e071f8dda1fcac5f9f3216f813422411af7c8) (WASM), each replacing a single `jitter_ns` field with a `worker_jitters_ns` array of per-worker override objects. > - Behavioral Change: workers with no explicit jitter entry now receive `Duration::ZERO` instead of any previously configured global jitter; callers must migrate to per-worker overrides. > > <!-- Macroscope's review summary starts here --> > > <sup><a href="https://app.macroscope.com">Macroscope</a> summarized 187d245.</sup> > <!-- Macroscope's review summary ends here --> > <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here --> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
## Summary
Adds a `worker_turn` tracing span to each background worker's per-turn
unit of work, so telemetry/otel can report **per-worker turn latency and
counts, sliced by worker kind**.
**Why:** each worker's `run_tasks()`/`run()` is an internal infinite
loop (`while intervals.next().await { ... }`), so the existing
`worker_run` span is lifetime-long and idle-inclusive — useless for
per-turn metrics. The fix attaches `#[tracing::instrument]` to each
worker's actual per-turn method (the loop body), not the looping
function.
## What changed
A `worker_turn` span on all 6 workers' turn methods, all carrying the
byte-identical `operation = "worker_turn"` (the spanmetrics aggregation
key) and a `worker` dimension:
| Worker | turn span site |
|---|---|
| CommitLog | `tick()` |
| DisappearingMessages | `delete_expired_messages()` |
| PendingSelfRemove | `remove_pending_remove_users()` |
| KeyPackageCleaner | extracted `tick()` |
| TaskRunner | `run_and_reschedule_task()` |
| DeviceSync | extracted `handle_event()` |
- `worker = ?self.kind()` (Debug) on the 5 method sites; `TaskRunner`
uses the literal `"TaskRunner"` (its turn method is an associated fn
with no `self`). Both formats equal the existing `worker_run` span's
value, so the two span series correlate cleanly.
- `skip_all` everywhere (workers hold a `Context` with DB/API handles).
- Two behavior-preserving extractions to host the macro:
`KeyPackageCleaner::tick()` and `DeviceSync::handle_event()`. The
DeviceSync 20s `Tick` heartbeat is deliberately bypassed before dispatch
so idle heartbeats open no span (and emit no log).
## Deliberately out of scope
- Existing `worker_run` / `libxmtp_worker` spans left untouched (don't
break existing dashboards).
- Deep spans in `mls_sync.rs` (`sync`, `post_commit`, `publish_intents`,
...) not touched — they're shared by foreground user calls too, so
there's no static worker to stamp. otel traces already carry the
worker→deep-span lineage; deep-span dashboard slicing is a
telemetry-repo concern.
Design + plan included under `docs/superpowers/`.
## Test Plan
- [x] `just check crate xmtp_mls` — compiles clean
- [x] `just lint-rust` — clippy / fmt / hakari clean
- [x] `just test v3 worker` — 29/29 worker tests pass (incl. all
device_sync + disconnect-propagation)
- [x] Invariant check: exactly 6 `operation = "worker_turn"` sites,
byte-identical
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- Macroscope's pull request summary starts here -->
<!-- Macroscope will only edit the content between these invisible
markers, and the markers themselves will not be visible in the GitHub
rendered markdown. -->
<!-- If you delete either of the start / end markers from your PR's
description, Macroscope will append its summary at the bottom of the
description. -->
> [!NOTE]
> ### Add per-worker-turn tracing spans across MLS background workers
> Adds `#[tracing::instrument]` spans with `worker` and
`operation="worker_turn"` fields to each background worker's main
processing method, enabling per-turn telemetry in distributed traces.
>
> -
[`commit_log.rs`](https://github.com/xmtp/libxmtp/pull/3711/files#diff-523943065ed63a511a3a552e44cb0ba0a05fecab9163088d5588cb5c949e6a5d),
[`disappearing_messages.rs`](https://github.com/xmtp/libxmtp/pull/3711/files#diff-93ae9a4a70ee163ee3ce898babe5119780fa34ca9d91b823ee0135fda32f1160),
[`pending_self_remove.rs`](https://github.com/xmtp/libxmtp/pull/3711/files#diff-2c915110e23b0ac1caa8ae0026e6cf76cff97c528ff81f91eef1a26a5d421f81),
and
[`tasks.rs`](https://github.com/xmtp/libxmtp/pull/3711/files#diff-b3d01dc512a077cf4b625f516f8c36ea61978f070bcce720ed8d522065a268b6)
each get a span on their primary tick/run method with no logic changes.
> -
[`key_package_cleaner.rs`](https://github.com/xmtp/libxmtp/pull/3711/files#diff-f474b9d0bca387a8ef826d7b34e811f8644323a8ac5584efc6eef034cfaeeb8d)
extracts a new `tick` method to serve as the span boundary.
> -
[`device_sync/worker.rs`](https://github.com/xmtp/libxmtp/pull/3711/files#diff-3d1d107a7d85b7aaf674a2365c789f51bda6243c3a37caab478baa8672ec737e)
introduces a `handle_event` method that wraps non-`Tick` events in a
span with an additional `event` field; `Tick` events bypass the span to
avoid noise.
>
> <!-- Macroscope's review summary starts here -->
>
> <sup><a href="https://app.macroscope.com">Macroscope</a> summarized
7d32c55.</sup>
> <!-- Macroscope's review summary ends here -->
>
<!-- macroscope-ui-refresh -->
<!-- Macroscope's pull request summary ends here -->
…t gating (phase 1) (#3710) ## Summary Phase 1 of an automated release flow for libxmtp: a **git-cliff–driven** always-open Release PR with auto-generated changelog, a **GitHub Release "hub"** that fans out to per-SDK nightly releases with track-aware versions, and a **cross-test gate** that blocks nightlies unless cross-talk + cross-version are green for the exact release SHA. ### Why git-cliff (not release-plz) This started as "use release-plz." Investigation found release-plz's Release-PR **cannot** generate a changelog without also bumping `Cargo.toml`'s version — and our workspace version is the **runtime MLS protocol identity** (`PROPOSALS_MIN_PROTOCOL_VERSION <= CARGO_PKG_VERSION` is CI-enforced; flipping it backward bricks `enable_proposals`). So the version must never be auto-bumped. **git-cliff** (the engine release-plz uses internally) gives the same UX — always-open changelog PR + a print-only version oracle — while never touching any manifest. No crates.io publish. ### What's in this PR - **git-cliff seed:** `cliff.toml` (changelog config, anchored to stable `v*` tags) + `.github/workflows/release-pr.yml` (always-open changelog-only Release PR) + git-cliff in the Nix devshell. - **SDK registry extension** (`dev/release-tools`): `versionTrack` (`follows-libxmtp` vs `independent`), notes globs, `releaseWorkflow`, `channels` per SDK — the extensibility spine (xmtp-js later = 2 entries, no engine change). - **Version-track engine:** `resolve-sdk-version` (follows-libxmtp takes the 1.x number verbatim; independent mirrors the bump-kind onto its own base, e.g. iOS 4.10 → 4.11), preview-next prerelease versioning (semver-correct nightlies), git-cliff oracle parsing (`pending-version`). - **Cross-test gate:** `cross-test-gate` (fail-closed, SHA-exact) + nightly wiring in `release.yml` (gate → pending → fan-out → hub), with an idempotent hub GitHub Release carrying an SDK-version rollup. ### Not in this PR (future phases) - Stable release-branch flow, backports, RC (phase 2). - xmtp-js (node-sdk 6.x, browser-sdk 7.x) onboarding (phase 3). - Wiring per-SDK nightly *filtered* notes onto the hub (thin follow-up; globs are already data). The runtime/manifest version (`1.11.0-dev`) is intentionally **untouched** — verified across the whole diff. ## Test Plan - [x] `dev/release-tools`: `yarn typecheck` clean, `yarn test` 158 passed - [x] All 5 workflow YAML files validate - [x] git-cliff `--bumped-version` oracle verified print-only against the real repo (resolves to last stable tag, writes nothing) - [x] CLI smoke tests: `pending-version`, `resolve-sdk-version` (ios → 4.11.0), `list-sdks`, `cross-test-gate` (pass + fail cases, exit 0 on skip) - [ ] **Manual (post-merge / dispatch on a test branch):** trigger the Release workflow nightly path and confirm gate-skip behavior, pending version/kind, per-SDK preview-next versions, and the hub release with rollup. (CI workflows can't be fully unit-tested.) 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. --> > [!NOTE] > ### Add git-cliff nightly release orchestration with cross-test gating and auto-generated changelogs > - Adds a gate job in [release.yml](.github/workflows/release.yml) that blocks nightly releases unless cross-test CI passes for the exact commit SHA, using the new `xmtp-release cross-test-gate` command. > - Adds a pending job that uses `git-cliff --bump --context` to detect unreleased conventional commits; skips the nightly entirely if nothing is pending. > - Nightly SDK release workflows ([iOS](.github/workflows/release-ios.yml), [Android](.github/workflows/release-android.yml), [Node](.github/workflows/release-node.yml), [WASM](.github/workflows/release-wasm.yml)) now require `pending-version` and `pending-kind` inputs and resolve versions via the new `xmtp-release resolve-sdk-version` command. > - Adds a hub job that creates or updates a prerelease GitHub Release summarizing all SDK versions shipped in a nightly run, skipped if no SDK shipped. > - Adds [release-pr.yml](.github/workflows/release-pr.yml) to maintain a persistent open PR that updates `CHANGELOG.md` on every push to main. > - Adds new release-tools CLI commands (`pending-version`, `resolve-sdk-version`, `list-sdks`, `cross-test-gate`) and supporting library modules for gate evaluation, git-cliff context parsing, and track-aware SDK version resolution. > - Risk: nightly releases now fail closed — missing cross-test runs or absent pending commits will skip or abort the release rather than proceeding. > > <!-- Macroscope's review summary starts here --> > > <sup><a href="https://app.macroscope.com">Macroscope</a> summarized 988e379.</sup> > <!-- Macroscope's review summary ends here --> > <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here --> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
## Summary An already-processed welcome (duplicate delivery, or a resume past our cursor for a group we're already in) is an **expected, gracefully-handled** condition — the error is returned, the cursor is handled, the welcome is skipped. But it was logged at `tracing::error!`. The otel tracing layer (`tracing_opentelemetry`, default config) sets a span's status to `Error` for any `error!` event fired while that span is active — it's event-driven, not return-value-driven. So this benign condition marked the enclosing worker/sync span `status:error` and **inflated the telemetry error rate** even though nothing actually failed. ## Fix Downgrade both emitters of this condition from `error!` to `warn!`: - **`groups/welcomes/xmtp_welcome.rs`** — the upstream `"Skipping already processed welcome"` log inside `.process()`. This is the one that actually fires inside the worker span and poisons its status; it's the dominant source for the duplicate-delivery case. - **`groups/welcome_sync.rs`** — the downstream catch-all `Err(err)` arm, which would otherwise log the returned `WelcomeAlreadyProcessed` at `error!` for the already-in-group path. This mirrors the existing storage-duplicate handling a few lines up, which already logs at `warn!`. **Behavior is otherwise unchanged:** the error is still returned, cursor handling is untouched. This is a telemetry/log-level-only change. ## Test Plan - [x] `just check crate xmtp_mls` — compiles clean - [x] `just lint-rust` — clippy / fmt / hakari clean - [x] `just test v3 welcome` — 60/60 welcome tests pass - [x] Confirmed no `error!` remains for this condition (both sites now `warn!`) 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. --> > [!NOTE] > ### Log already-processed welcomes at warn instead of error in `process_new_welcome` > Encountering an already-processed welcome is an expected, non-retryable condition, so logging it at `error` level produced noise. Both [`welcome_sync.rs`](https://github.com/xmtp/libxmtp/pull/3714/files#diff-5ad5d6ca4506c4960727c32b22b9168a128f14af4d61497cd168ba5c80fb43f5) and [`xmtp_welcome.rs`](https://github.com/xmtp/libxmtp/pull/3714/files#diff-7ad79af03cc69f17888ed323b42554f52c5c706674e76a981c4fd4663464bb4e) now use `warn` for this case. Error return semantics are unchanged. > > <!-- Macroscope's review summary starts here --> > > <sup><a href="https://app.macroscope.com">Macroscope</a> summarized 43d7550.</sup> > <!-- Macroscope's review summary ends here --> > <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here -->
…3713) Two changelog improvements, both stemming from the scruffy first generated Release PR (#3712). ## 1. Tidy the generated changelog (`cliff.toml`) **Body-dump scruff** — the template rendered `commit.message` (full message = subject + body). GitHub squash-merges fill the commit body with the PR description, so each merge dumped its whole PR body into the changelog: Macroscope HTML-comment summary blocks, `Co-authored-by:` trailers, even a pasted Nix build-error log (~316 of #3712's 461 lines). ```diff -{{ commit.message | upper_first }} +{{ commit.message | split(pat="\n") | first | trim | upper_first }} ``` **Fake per-scope groups** — non-conventional commits (e.g. `xmtp_proto: …`, `bindings/mobile: …`) made git-cliff invent junk section headers like `Xmtp_proto` / `Bindings/mobile`. A catch-all collects them into one `Other` group: ```diff + { message = ".*", group = "<!-- 8 -->Other" }, ``` Result (regenerated from `main`): group headers are exactly **Features · Bug Fixes · Performance · Documentation · Other**, every entry a clean subject line, zero scruff. ## 2. Per-nightly changelog in the hub release (`release.yml`) Each nightly hub GitHub Release now appends a **"Changes since the last nightly"** section — the git-cliff delta over `<previous-nightly-tag>..HEAD` (`--strip header`), computed before this run is tagged so the newest existing `v*-nightly.*` tag is genuinely the prior one. First-ever nightly (no prior nightly tag) falls back to since-last-stable. Empty output is dropped. Adds git-cliff to the hub job. > Uses an explicit `<tag>..HEAD` range rather than the config's stable-anchored `tag_pattern`, so it diffs against the last *nightly*, not the last stable. Relies on the §1 fixes for clean output. ## Verification - Regenerated changelog from `main`: clean, 5 well-formed groups, no scruff. - Simulated the per-nightly delta (`<tag>..HEAD --strip header`): clean grouped output, 22 bullets, no header. - All workflow YAML validated. Once merged, the open Release PR (#3712) regenerates clean on the next push to main, and the next real nightly carries a proper since-last-nightly changelog. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. --> > [!NOTE] > ### Tidy generated changelog to show subject-only commits and group non-conventional commits into 'Other' > - Updates [cliff.toml](https://github.com/xmtp/libxmtp/pull/3713/files#diff-e1372c8b03c40942b5d828a90975054cb8aaed3b38189f434396f922ec41a584) to trim commit messages to their first line only and capitalise the first character, removing multi-line noise from changelogs. > - Adds a catch-all parser rule in [cliff.toml](https://github.com/xmtp/libxmtp/pull/3713/files#diff-e1372c8b03c40942b5d828a90975054cb8aaed3b38189f434396f922ec41a584) that places non-conventional commits into an 'Other' group instead of creating per-scope groups. > - Updates [release.yml](https://github.com/xmtp/libxmtp/pull/3713/files#diff-87db21a973eed4fef5f32b267aa60fcee5cbdf03c67fafdc2a9b553bb0b15f34) to install `git-cliff`, compute a delta changelog from the previous nightly tag to `HEAD`, and append a > > <!-- Macroscope's review summary starts here --> > > <sup><a href="https://app.macroscope.com">Macroscope</a> summarized eecd171.</sup> > <!-- Macroscope's review summary ends here --> > <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here --> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…iber + otel) + node migration (#3708) ## What Introduces a new `crates/xmtp_logging` crate as the **sole owner** of the `tracing-subscriber` and `opentelemetry` dependencies, and migrates the **node** binding onto it. All production logging/tracing pipeline construction, the OTLP trace exporter, and the runtime-control handle now live in one place. ## Why Logging/telemetry setup was scattered across `xmtp_common` (which sat near the bottom of the dep graph, so every crate that touched it pulled in `tracing-subscriber` + the full OpenTelemetry stack) and re-implemented per binding. Consolidating into a dedicated crate gives us: - one builder + one `LoggingConfig` record at the FFI boundary - a `LoggingHandle` for runtime control (set level, enable/disable file logging, enable telemetry, flush) without re-installing the subscriber - `xmtp_common` no longer carries `opentelemetry*` / `tracing-subscriber` as production deps (test/bench-only now) ## What's in the crate - `config.rs` — canonical `Level` / `Rotation` / `ProcessType` enums and `FileConfig` / `TelemetryConfig` / `LoggingConfig` records. - `filter.rs` — `filter_directive` (moved from `xmtp_common`). - `telemetry.rs` — OTLP span exporter + `TelemetryGuard` (native-only, `cfg(not(wasm32))`; always compiled, enabled at runtime when an endpoint is set — no Rust feature flag). - `layers/` — stdout/json fmt, rolling file, per-OS native (logcat/os_log/ server-compact), and browser console/perf layers. - `builder.rs` — `XmtpLoggingBuilder::from_config(cfg).install()` installs the global subscriber. - `handle.rs` — `LoggingHandle` with reloadable level / file / telemetry slots. - test subscriber (the obsolete `tracing-forest` tree logger and `TestLogReplace` / `REPLACE_IDS` are removed — OTEL spans replace them). ## Consumer changes - **node** — fully migrated: `init_logging` builds a `LoggingConfig` from `LogOptions` and installs once; `flush_telemetry` delegates to the handle. Drops the direct `tracing-subscriber` dep. - **mobile / wasm / xmtp_debug** — repointed `filter_directive` to `xmtp_logging` and dropped the `xmtp_common` `logging` feature. These keep their own subscribers for now; fully migrating them to the builder is a follow-up. ## Verification Native + wasm build, clippy, rustfmt, taplo, hakari (regenerated — drops `tracing-log`, the removed transitive), and unit tests (`xmtp_logging` + `xmtp_common`) all pass locally. ## Follow-ups (separate PRs) - Migrate wasm + mobile bindings onto the builder/handle (not just `filter_directive`). - Migrate apps. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. --> > [!NOTE] > ### Extract logging and telemetry into a dedicated `xmtp_logging` crate as sole owner of tracing-subscriber and OpenTelemetry > - Creates a new `xmtp_logging` crate with a fluent builder API (`XmtpLoggingBuilder`) that installs a global tracing subscriber with reloadable log level, optional rolling-file output, and optional OTLP telemetry export. > - Migrates all logging/telemetry code out of `xmtp_common`, removing the `logging` feature flag from that crate; all consumers now depend on `xmtp_logging` directly. > - Updates the Node binding's `init_logging` to use `xmtp_logging` via a process-global `LOGGING_HANDLE`, and changes `flush_telemetry` to call `handle.flush()` instead of shutting down the exporter. > - Removes `TestLogReplace` and per-instance log-ID replacement from `xmtp_common::test` and `Tester`; test harnesses use `xmtp_logging::test_logging` instead. > - Risk: `TestLogReplace` log-ID substitution is permanently removed from test output; tests that relied on named replacements in log lines will no longer see them. > > <!-- Macroscope's changelog starts here --> > #### Changes since #3708 opened > > - Replaced `once_cell` dependency with standard library synchronization primitives [6d0ab95] > - Standardized workspace dependency notation across Cargo.toml files [6d0ab95] > - Updated Cargo.lock [6d0ab95] > <!-- Macroscope's changelog ends here --> > > <!-- Macroscope's review summary starts here --> > > <sup><a href="https://app.macroscope.com">Macroscope</a> summarized b45b18f.</sup> > <!-- Macroscope's review summary ends here --> > > <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here -->
…ry-run gate (#3717) ## Why On 2026-06-04 the nightly pipeline published a bogus **libxmtp `2.0.0`** and **iOS/Android `5.0.0`**, and the Android `5.0.0-nightly` is now **permanently on Maven Central** (immutable). Root causes: 1. **Non-deterministic git-cliff.** The oracle installed git-cliff via `cargo binstall git-cliff || cargo install git-cliff` — floating latest. At the same commit, CI computed a **major** bump while local computed minor. 2. **Nightly auto-majored.** A `BREAKING CHANGE:` footer drove a major bump; a nightly should never auto-major. 3. **No dry-run / preview gate.** Nothing sat between "compute version" and the irreversible publishes. ## What (three layers of protection) - **COMPUTE** — new pure `capBumpKind` in `xmtp-release` clamps a nightly `major` → `minor` (recomputing the version from the last stable tag). Exposed via `pending-version --max-bump minor`. Unit-tested. - **ASSERT** — the plan job fails if the bump is still `major` after the cap (defense-in-depth; every release-*/hub job `needs:` the plan, so a failure blocks all publishes). Plus a human-readable release plan printed to the step summary. - **PREVIEW** — git-cliff is pinned to **2.13.1**; a `dry-run` dispatch input runs the full plan but skips every publish/tag; and `gate`+`pending` are extracted into a reusable `release-gate-plan.yml` so a new **`nightly-dry-run.yml`** can preview the real nightly version logic on demand (no publish jobs exist in it — structurally cannot publish). ## Verification (against the exact v5 failure mode) Reproduced locally with the pinned git-cliff 2.13.1 against real upstream `main`: | case | result | | --- | --- | | real current main, capped | `1.11.0` / minor ✓ | | v5 mode (git-cliff says major), **no cap** | `2.0.0` / major — reproduces the bug | | v5 mode **with `--max-bump minor`** | **`1.11.0` / minor** — clamped ✓ | | assertion if a major slipped through | fires `::error:: refusing to publish`, exit 1 ✓ | 167 unit tests pass; typecheck clean; all 3 workflow YAMLs valid. The refactor was verified to NOT break non-nightly (dev/rc/final) releases (a skipped `plan` dependency doesn't block them — `!cancelled()` + `release-type != 'nightly'` short-circuit). > **Post-merge step:** `workflow_dispatch` requires the workflow on the default branch, so the `nightly-dry-run.yml` GH-Actions wiring gets its full end-to-end check by dispatching it on `main` after merge (safe — no publish jobs) before the next cron nightly. ## Out of scope - Recovering the already-published `5.0.0-nightly` (impossible — Maven Central is immutable) and the resulting Android-v5 / iOS-sync decision (team call). - The larger release-process simplification (unify version lines / libsignal model / SPM-only) — tracked separately, needs team buy-in. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. --> > [!NOTE] > ### Make nightly release version deterministic, cap bump at minor, and add dry-run gate > - Extracts release gating and version computation into a reusable [release-gate-plan.yml](.github/workflows/release-gate-plan.yml) workflow that resolves the exact SHA, checks cross-test status, and computes a pending version capped at minor bump. > - Adds `--max-bump` to the `pending-version` CLI command via a new `capBumpKind` utility in [sdk-version.ts](dev/release-tools/src/lib/sdk-version.ts) that clamps the bump kind and recomputes the version from the last-shipped baseline. > - Replaces separate `gate` and `pending` jobs in [release.yml](.github/workflows/release.yml) with a single `plan` job; all release jobs now read `pending-version` and `pending-kind` from plan outputs and pin git-cliff to 2.13.1. > - Adds a new manual-only [nightly-dry-run.yml](.github/workflows/nightly-dry-run.yml) workflow and an optional `dry-run` input to `release.yml` that prints the computed plan without publishing, tagging, or merging. > - Risk: the workflow now fails hard if a major bump slips through after capping, which will block nightly releases until the bump is addressed. > > <!-- Macroscope's review summary starts here --> > > <sup><a href="https://app.macroscope.com">Macroscope</a> summarized c775131.</sup> > <!-- Macroscope's review summary ends here --> > <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here --> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
) ## What The native gRPC HTTP/2 keepalive parameters are now overridable via environment variables, read once per process in `apply_channel_options`. **Defaults are unchanged** (fast detection). ## Why herald-lite #70 (`UNAVAILABLE` / status 14 disconnects on herald dev) was root-caused to libxmtp's own client-side h2 keepalive: ping every 16s, demand a PONG within 10s, even idle. On a cross-cloud Fly→AWS path under load the PONG starves past the 10s deadline, hyper drops the connection, tonic maps it to `UNAVAILABLE`, and the retry layer masks it. A **global** relax is wrong: mobile wants *fast* dead-connection detection, herald wants *forgiving*. (A global 30/30 broke `client::tests::should_reconnect`, which deliberately relies on ~26s keepalive detection — its comment cites "16s interval + 10s timeout".) So the defaults stay fast and a high-latency deployment opts out via env. ## Env vars (all optional; unset = default) | env var | overrides | default | |---|---|---| | `XMTP_GRPC_KEEPALIVE_INTERVAL_SECS` | `http2_keep_alive_interval` | 16 | | `XMTP_GRPC_KEEPALIVE_TIMEOUT_SECS` | `keep_alive_timeout` | 10 | | `XMTP_GRPC_TCP_KEEPALIVE_SECS` | `tcp_keepalive` (0 disables) | 16 | | `XMTP_GRPC_KEEPALIVE_WHILE_IDLE` | `keep_alive_while_idle` | true | Read once via a `LazyLock` (servers set them before start), so every connection after the first is a struct copy, not an env read. Native-only (wasm fetch has no h2 knobs). herald rides `@xmtp/node-sdk`, so env vars need **no** binding/SDK change — it just sets them in its Fly deploy.
e8978a2 to
e8ceed3
Compare
| &self, | ||
| policy: xmtp_proto::xmtp::mls::message_contents::ExternalCommitPolicyV1, | ||
| ) -> Result<(), GroupError> { | ||
| self.ensure_not_paused().await?; |
There was a problem hiding this comment.
🟡 Medium groups/mod.rs:2139
set_external_commit_policy and set_allow_external_commit do not check ConversationType::Dm before queuing the intent, unlike every other metadata-mutating API in this file (update_group_name, update_app_data, update_group_description, update_group_image_url_square, update_permission_policy, update_admin_list). A migrated DM can therefore have EXTERNAL_COMMIT_POLICY persisted, enabling QR/external-commit state on a conversation type the codebase explicitly forbids elsewhere (e.g., create_external_invite documents DM prohibition). Consider adding the standard DM guard to reject the operation early.
Also found in 1 other location(s)
crates/xmtp_mls/src/groups/external_invite.rs:107
create_external_invitenever checks the group'sconversation_typebefore exporting a QR invite. The PR explicitly requires DM groups to be rejected, but this implementation only checks pause state and policy presence. If a DM group carries anEXTERNAL_COMMIT_POLICYentry (for example from malformed or legacy state), this method will still mint an external invite for that DM, violating the intended restriction.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file crates/xmtp_mls/src/groups/mod.rs around line 2139:
`set_external_commit_policy` and `set_allow_external_commit` do not check `ConversationType::Dm` before queuing the intent, unlike every other metadata-mutating API in this file (`update_group_name`, `update_app_data`, `update_group_description`, `update_group_image_url_square`, `update_permission_policy`, `update_admin_list`). A migrated DM can therefore have `EXTERNAL_COMMIT_POLICY` persisted, enabling QR/external-commit state on a conversation type the codebase explicitly forbids elsewhere (e.g., `create_external_invite` documents DM prohibition). Consider adding the standard DM guard to reject the operation early.
Also found in 1 other location(s):
- crates/xmtp_mls/src/groups/external_invite.rs:107 -- `create_external_invite` never checks the group's `conversation_type` before exporting a QR invite. The PR explicitly requires DM groups to be rejected, but this implementation only checks pause state and policy presence. If a DM group carries an `EXTERNAL_COMMIT_POLICY` entry (for example from malformed or legacy state), this method will still mint an external invite for that DM, violating the intended restriction.
| for mutation in &delta.mutations { | ||
| let key = match mutation { | ||
| TlsMapMutation::Insert { key, .. } | ||
| | TlsMapMutation::Update { key, .. } | ||
| | TlsMapMutation::Delete { key } => key, | ||
| }; | ||
| if key != &joiner_id { | ||
| return Err(CommitValidationError::ExternalCommitAppDataUpdateOutOfScope); | ||
| } | ||
| } |
There was a problem hiding this comment.
🟡 Medium groups/validated_commit.rs:2775
enforce_app_data_update_scope accepts TlsMapMutation::Update and TlsMapMutation::Delete mutations for the joiner's inbox id, so an external commit can pass validation while deleting or mutating the joiner's GROUP_MEMBERSHIP entry instead of inserting it. This violates the function's documented contract that "the joiner is authorized to insert their own entry — but ONLY their own" and leaves the tree membership and AppData membership out of sync. Consider rejecting Update and Delete mutations to ensure the joiner can only insert their own entry.
- for mutation in &delta.mutations {
- let key = match mutation {
- TlsMapMutation::Insert { key, .. }
- | TlsMapMutation::Update { key, .. }
- | TlsMapMutation::Delete { key } => key,
- };
- if key != &joiner_id {
- return Err(CommitValidationError::ExternalCommitAppDataUpdateOutOfScope);
- }
+ for mutation in &delta.mutations {
+ let key = match mutation {
+ TlsMapMutation::Insert { key, .. } => key,
+ _ => {
+ return Err(CommitValidationError::ExternalCommitAppDataUpdateOutOfScope);
+ }
+ };
+ if key != &joiner_id {
+ return Err(CommitValidationError::ExternalCommitAppDataUpdateOutOfScope);
+ }🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file crates/xmtp_mls/src/groups/validated_commit.rs around lines 2775-2784:
`enforce_app_data_update_scope` accepts `TlsMapMutation::Update` and `TlsMapMutation::Delete` mutations for the joiner's inbox id, so an external commit can pass validation while deleting or mutating the joiner's `GROUP_MEMBERSHIP` entry instead of inserting it. This violates the function's documented contract that "the joiner is authorized to insert their own entry — but ONLY their own" and leaves the tree membership and AppData membership out of sync. Consider rejecting `Update` and `Delete` mutations to ensure the joiner can only insert their own entry.
| /// Sugar wrapper over [`MlsGroup::set_external_commit_policy`] that | ||
| /// only flips the master switch and leaves the time-window controls | ||
| /// at their defaults (no automatic expiry / no staleness bound). | ||
| pub async fn set_allow_external_commit(&self, allowed: bool) -> Result<(), GroupError> { |
There was a problem hiding this comment.
🟡 Medium groups/mod.rs:2171
set_allow_external_commit(true) unconditionally generates new symmetric_key and external_group_id values via generate_symmetric_key() and generate_external_group_id(). When external commits are already enabled, this call rotates the invite material and invalidates previously issued QR codes, deep links, and uploaded blobs indexed by the old external_group_id. Consider reading the existing policy first and re-using its key/id when allow_external_commit is already true to preserve stability across calls.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file crates/xmtp_mls/src/groups/mod.rs around line 2171:
`set_allow_external_commit(true)` unconditionally generates new `symmetric_key` and `external_group_id` values via `generate_symmetric_key()` and `generate_external_group_id()`. When external commits are already enabled, this call rotates the invite material and invalidates previously issued QR codes, deep links, and uploaded blobs indexed by the old `external_group_id`. Consider reading the existing policy first and re-using its key/id when `allow_external_commit` is already `true` to preserve stability across calls.
| /// explicit nonce management is required (e.g. tests). Reusing a `(key, | ||
| /// nonce)` pair across distinct ciphertexts breaks ChaCha20Poly1305's security | ||
| /// argument. | ||
| pub fn wrap_group_info_with_nonce( |
There was a problem hiding this comment.
🟠 High invite/encrypted_group_info.rs:92
wrap_group_info_with_nonce encrypts plaintext with AEAD but writes epoch, group_state_hash, and expires_at_ns into EncryptedGroupInfoBlobV1 in cleartext without including them as AEAD associated data. An attacker can tamper with expires_at_ns (e.g., changing an expired timestamp to 0 or a far-future value) and unwrap_group_info will accept the modified blob because the AEAD tag only authenticates ciphertext. This allows expired QR invites to remain valid and breaks integrity of the envelope metadata. Consider including these fields as AEAD associated data so tampering is detected during decryption.
Also found in 1 other location(s)
crates/xmtp_mls_common/src/mls_ext/payload_encryption.rs:139
unwrap_payload_hpkesilently accepts a missing secondary ciphertext at line139by returningvec![]wheneverwrapped_secondary_payloadis empty. Because the presence of the second ciphertext is not authenticated anywhere inwrapped_payload, an attacker can stripwrapped_secondary_payloadin transit and the receiver will still accept the first plaintext as valid while seeing an empty secondary payload instead of a tamper error. Any protocol field carried in the secondary blob loses integrity protection for its presence.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file crates/xmtp_mls_common/src/invite/encrypted_group_info.rs around line 92:
`wrap_group_info_with_nonce` encrypts `plaintext` with AEAD but writes `epoch`, `group_state_hash`, and `expires_at_ns` into `EncryptedGroupInfoBlobV1` in cleartext without including them as AEAD associated data. An attacker can tamper with `expires_at_ns` (e.g., changing an expired timestamp to `0` or a far-future value) and `unwrap_group_info` will accept the modified blob because the AEAD tag only authenticates `ciphertext`. This allows expired QR invites to remain valid and breaks integrity of the envelope metadata. Consider including these fields as AEAD associated data so tampering is detected during decryption.
Also found in 1 other location(s):
- crates/xmtp_mls_common/src/mls_ext/payload_encryption.rs:139 -- `unwrap_payload_hpke` silently accepts a missing secondary ciphertext at line `139` by returning `vec![]` whenever `wrapped_secondary_payload` is empty. Because the presence of the second ciphertext is not authenticated anywhere in `wrapped_payload`, an attacker can strip `wrapped_secondary_payload` in transit and the receiver will still accept the first plaintext as valid while seeing an empty secondary payload instead of a tamper error. Any protocol field carried in the secondary blob loses integrity protection for its presence.
| let joiner_leaf = staged_commit | ||
| .update_path_leaf_node() | ||
| .ok_or(CommitValidationError::ExternalCommitMissingPathLeaf)?; | ||
| let joiner_inbox_id = inbox_id_from_credential(joiner_leaf.credential())?; |
There was a problem hiding this comment.
🔴 Critical groups/validated_commit.rs:986
from_external_commit extracts the joiner's inbox id from the credential (line 986) and checks that every Add proposal's credential string matches it (line 1000), but it never verifies that the joiner's installation key (or any added installation key) is actually registered in the association state for that inbox id. In contrast, from_staged_commit performs this check (lines 760-778) by looking up each participant's installation_id in the AssociationState at the relevant sequence_id. Without this lookup, a crafted external commit can claim any inbox id and join (or add installations) with signature keys that have no legitimate association to that inbox, bypassing XMTP's installation↔inbox binding. Consider adding the same association-state verification that from_staged_commit performs — this will require threading in a context (or equivalent identity-lookup capability) and making the function async.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file crates/xmtp_mls/src/groups/validated_commit.rs around line 986:
`from_external_commit` extracts the joiner's inbox id from the credential (line 986) and checks that every `Add` proposal's credential string matches it (line 1000), but it never verifies that the joiner's installation key (or any added installation key) is actually registered in the association state for that inbox id. In contrast, `from_staged_commit` performs this check (lines 760-778) by looking up each participant's `installation_id` in the `AssociationState` at the relevant `sequence_id`. Without this lookup, a crafted external commit can claim any inbox id and join (or add installations) with signature keys that have no legitimate association to that inbox, bypassing XMTP's installation↔inbox binding. Consider adding the same association-state verification that `from_staged_commit` performs — this will require threading in a `context` (or equivalent identity-lookup capability) and making the function async.
|
|
||
| let epoch = openmls_group.epoch().as_u64(); | ||
| let group_state_hash = openmls_group.epoch_authenticator().as_slice().to_vec(); | ||
| let blob_expires_at_ns = compute_blob_expiry( |
There was a problem hiding this comment.
🟢 Low groups/external_invite.rs:149
create_external_invite computes blob_expires_at_ns using xmtp_common::time::now_ns() before returning to the caller, but the caller uploads encrypted_group_info only after this method returns. When policy.expire_in_ns is used, the relative expiry is anchored to invite creation time instead of upload time, so any delay between creation and upload can make a freshly uploaded blob already stale or expired.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file crates/xmtp_mls/src/groups/external_invite.rs around line 149:
`create_external_invite` computes `blob_expires_at_ns` using `xmtp_common::time::now_ns()` before returning to the caller, but the caller uploads `encrypted_group_info` only after this method returns. When `policy.expire_in_ns` is used, the relative expiry is anchored to invite creation time instead of upload time, so any delay between creation and upload can make a freshly uploaded blob already stale or expired.
…lpers) (#3670) New module `xmtp_mls_common::invite::payload` providing pure-protocol helpers around the `ExternalInvitePayload` proto: - `generate_key()` / `generate_nonce()` — CSPRNG-backed material generation. - `group_id_hash(&[u8])` — sha256(group_id) for the service lookup key. - `is_expired(payload, now_ns)` — ns-precision expiry check (zero = no expiry). - `validate_version(payload)` — rejects zero and future versions.
…ion (#3669) ## Summary Pure-refactor PR — no behavior change. - Moves `crates/xmtp_mls/src/groups/mls_ext/welcome_wrapper.rs` to `crates/xmtp_mls_common/src/mls_ext/payload_encryption.rs`. - Renames the four helpers from welcome-specific names to generic ones: - `wrap_welcome` -> `wrap_payload_hpke` - `unwrap_welcome` -> `unwrap_payload_hpke` - `wrap_welcome_symmetric` -> `wrap_payload_symmetric` - `unwrap_welcome_symmetric` -> `unwrap_payload_symmetric` - The HPKE label becomes a parameter (was hardcoded to the welcome label). Welcome callers still pass `xmtp_configuration::WELCOME_HPKE_LABEL`; the QR-invite path (downstream PR) will pass `XMTP_EXTERNAL_INVITE_LABEL`. - All call sites in xmtp_mls updated to import from the new path. ## Why this PR Sets up the encryption primitive used by the QR-invite (external commit) flow. The functions are already generic — they don't depend on welcome semantics. Lifting them to `xmtp_mls_common` lets the invite module use them without pulling in `xmtp_mls`. ## Preserved XWING codepoint compat (#3662) The relocated `wrap_payload_hpke` / `unwrap_payload_hpke` still call `wrapper_algorithm.to_hpke_config()`, which lives in `xmtp_id::WrapperAlgorithm` and pins `XWingMLKEM768Draft6` to `KemAlgorithm::XWingDraft06Obsolete` (codepoint `0x004D`) for wire compat with v1.9 / v1.10 clients. Workspace `Cargo.toml`'s `[patch.crates-io]` block pinning the xmtp hpke-rs fork is untouched. Verified post-rebase: `round_trip_xwing_hpke` test passes.
## Why
Creating an XMTP client required **two gRPC connections**. They were not
a message/payer or primary/secondary split — they were two
`tonic::Channel`s to the **same host**, wrapping the **same** client
type, built from the **same** backend config. The second ("sync")
connection was introduced on the assumption that a dedicated sync
channel buys isolation; that isolation was never realized (same
endpoint, no stream/concurrency separation), and the second client was
used by exactly two unary reads.
The redundancy cost every client an extra TCP connection + file
descriptor (the iOS SDK even kept a whole second client cache and dialed
the backend twice), with no measured benefit.
This collapses the design to a single connection.
> Note: the d14n payer/message split (`xmtp_api_d14n`) is a separate
internal concern and is **not** touched here.
## What changed
Commit-by-commit (each compiles + is self-contained):
1. **core** — remove `sync_api_client` field + `sync_api()` accessor
from `XmtpMlsLocalContext`/`XmtpSharedContext`; collapse
`ClientBuilder::api_clients(api, sync_api)` → `api_client(api)`; the two
`.sync_api()` readers (`query_group_messages`, `load_identity_updates`)
now use `.api()`.
2. **tests/benches** — every `.api_clients(a, b)` call site reduced to
`.api_client(a)`.
3. **apps** — cli + xmtp_debug build one client.
4. **node + wasm** — `create_client_inner` drops its second client
param; bindings build the backend once.
5. **mobile FFI** — **breaking**: `create_client` drops its second
`sync_api: Arc<XmtpApiClient>` argument (plus its tests/bench call
sites).
6. **sdks** — iOS + Android connect once, cache once, pass once. Android
`xmtpv3.kt` regenerated.
## Verification
- `just check`, `just wasm check`, `just lint` (rust), `just node lint`
— all green locally.
- `just android check` — green (bindings regenerated).
- Pre-existing markdown-lint failures in untouched repo docs
(`sdks/ios/README.md`, `VALIDATION.md`, `TEST_SCENARIOS.md`) were left
alone — not part of this change.
## ⚠️ Follow-up required before merge
- **iOS `xmtpv3.swift` must be regenerated on macOS.** The mobile FFI
signature changed, but the iOS xcframework + Swift glue can only be
regenerated on macOS/Xcode (`ios-xcframeworks-fast` has no Linux build),
so the generated `xmtpv3.swift` in this branch still shows the old 2-arg
signature. The hand-written `Client.swift` already targets the new 1-arg
signature. Run `./sdks/ios/dev/bindings` on a Mac then `just ios check`.
- Minor: `sdks/ios/dev/.setup` resolves the repo root via `git
rev-parse` with no jj fallback (Android's already has one), which breaks
iOS regen from a jj workspace.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- Macroscope's pull request summary starts here -->
<!-- Macroscope will only edit the content between these invisible
markers, and the markers themselves will not be visible in the GitHub
rendered markdown. -->
<!-- If you delete either of the start / end markers from your PR's
description, Macroscope will append its summary at the bottom of the
description. -->
> [!NOTE]
> ### Collapse two gRPC connections into one across all client bindings
> - Removes the dedicated sync API client from `ClientBuilder`,
`XmtpMlsLocalContext`, and the `XmtpSharedContext` trait in
[builder.rs](https://github.com/xmtp/libxmtp/pull/3721/files#diff-112b6dd965ad29739b36ccfb7f1283b1dbc20ec772591a4487a0898c6c5650f6)
and
[context.rs](https://github.com/xmtp/libxmtp/pull/3721/files#diff-ca0115cb0e326f7fa5c9bc660a4a1c1aff68c42a4c048eb79529bba4575aa909);
all network calls now route through a single API client.
> - Replaces the `.api_clients(api, sync_api)` builder method with
`.api_client(api)` across all call sites: mobile FFI, Node, WASM,
Android, iOS, and test utilities.
> - Operations previously routed through the sync client (group message
queries, identity update fetches) now use the primary API client.
> - Behavioral Change: the `create_client_does_not_hit_network` test
expectation for `get_identity_updates_v2` increases from 2 to 3 because
all identity-update reads are now counted on the single client.
>
> <!-- Macroscope's changelog starts here -->
> #### Changes since #3721 opened
>
> - Modified `FfiXmtpClient` to support graceful shutdown and worker
configuration [6f64938]
> - Updated root resolution logic in setup script to support multiple
version control systems [6f64938]
> <!-- Macroscope's changelog ends here -->
>
> <!-- Macroscope's review summary starts here -->
>
> <sup><a href="https://app.macroscope.com">Macroscope</a> summarized
5ab010b.</sup>
> <!-- Macroscope's review summary ends here -->
>
> <!-- macroscope-ui-refresh -->
<!-- Macroscope's pull request summary ends here -->
… generic AppDataUpdate intent Defense-in-depth master switch + per-component declarative external committer authorization for the QR-invite flow (MLS External Commits, RFC 9420 §12.4.3.2). - ComponentId::EXTERNAL_COMMIT_POLICY = 0x800C in xmtp_mls_common. - Type-dispatch entry: ComponentType::Bytes for the new id. - crates/xmtp_mls/src/groups/external_commit_policy.rs (new file): - load_external_commit_policy(group) -> Option<ExternalCommitPolicyV1> - is_external_commit_allowed(group) -> bool - external_committer_permissions_for(group, id) -> Option<ComponentPermissions> - MlsGroup::set_external_commit_policy(policy) — granular setter writes the full ExternalCommitPolicyV1 via the generic AppDataUpdate intent (AppDataUpdateOp::Replace, since the component is full-replace Bytes-typed). - MlsGroup::set_allow_external_commit(bool) — sugar wrapper. Depends on L-0 (generic IntentKind::AppDataUpdate) for the intent machinery. No new IntentKind variant added — the generic path covers this and every future full-replace component setter. No bootstrap synth: absent EXTERNAL_COMMIT_POLICY = disabled default. First setter call creates the entry via AppDataUpdateOperation::Update upsert semantics. Zero risk of cross-client serialization drift.
… + encrypted GroupInfo)
e8ceed3 to
57e7e30
Compare
Summary
Adds the invite-creation half of the QR-invite flow.
MlsGroup::create_external_invite(opts) -> CreateExternalInviteOutputgroup_id_hash = sha256(group_id()), generates fresh symmetric key + nonce, exports a current-epoch GroupInfo (with ratchet tree bundled), wraps it viawrap_group_info_with_nonce, and returns serializedExternalInvitePayload+EncryptedGroupInfoBlobprotos.epochandgroup_state_hashfields are populated here (the caller has the MlsGroup; the pure-codec helpers in L-5 leave them zero).set_external_commit_policy).Stack
Test plan
Note
Add
MlsGroup::create_external_inviteto produce QR-invite payload and encrypted GroupInfoMlsGroup::create_external_invitein external_invite.rs that exports the current epochGroupInfo, encrypts it viawrap_group_info, and builds anExternalInvitePayloadgoverned by the group'sEXTERNAL_COMMIT_POLICY.MlsGroup::set_external_commit_policyandset_allow_external_commitin mod.rs to write policy into groupAppData, generating a symmetric key and external group ID when enabling.ValidatedCommit::from_external_commitin validated_commit.rs and a correspondingSender::NewMemberCommitpath in mls_sync.rs so external commits authored by non-members are validated against policy and processed.wrap_welcome/unwrap_welcomewith builder-stylewrap_payload_hpke/wrap_payload_symmetricAPIs in payload_encryption.rs, addingWELCOME_HPKE_LABELandXMTP_EXTERNAL_INVITE_LABELfor domain separation.sync_api_clientfromClientBuilder,XmtpMlsLocalContext, and all bindings; replaces dualapi_clients()with singleapi_client()and introducesWorkerConfigfor per-worker interval, jitter, and enablement control exposed to WASM, Node, iOS, and Android bindings.xmtp_loggingcrate with a platform-awareXmtpLoggingBuilder, runtimeLoggingHandle, OTLP telemetry support, and rolling-file logging for native targets; removes logging fromxmtp_common.sync_api_clientis a breaking API change for all bindings and any callers usingapi_clients();WorkerKind::Eventis removed from the enum.Macroscope summarized 57e7e30.