Skip to content

fix(xmtp_mls): forward db-disconnect signal through SyncSummary and sibling errors#3716

Open
insipx wants to merge 1 commit into
mainfrom
insipx/disconnect-classification
Open

fix(xmtp_mls): forward db-disconnect signal through SyncSummary and sibling errors#3716
insipx wants to merge 1 commit into
mainfrom
insipx/disconnect-classification

Conversation

@insipx
Copy link
Copy Markdown
Contributor

@insipx insipx commented Jun 3, 2026

Summary

A dropped DB pool (PoolNeedsConnection) that surfaced during a sync was being misclassified as not-a-disconnect, so the worker supervisor never dropped the pool to reconnect — it retried forever on a dead pool. The worst path (GroupError::SyncFailedToWait) is also is_retryable == true, making it a true hot-loop.

Root cause: SyncSummary — the aggregate sync error — had no NeedsDbReconnect impl at all. Any PoolNeedsConnection landing in its publish_errors / post_commit_errors / other / per-message process.errored got wrapped into GroupError::Sync / SyncFailedToWait / DeviceSyncError::Sync / CommitLogError::SyncError, all of which hit _ => false in needs_db_reconnect. The disconnect signal was permanently buried.

This is the same bug class the existing disconnect_propagation_tests were written to prevent — the SyncSummary path just wasn't covered.

Fix — forward the disconnect signal everywhere it was being dropped

  • SyncSummary: NeedsDbReconnect (new) — scans all four error sources, including the per-message process.errored failures (which is_retryable deliberately skips, but which can still carry a pool drop).
  • GroupMessageProcessingError: NeedsDbReconnect (new) — forwards the directly-wrapped DB variants (Storage/Db/Identity/Client/ClearPendingCommit) and the OpenMLS state-I/O wrappers (OpenMlsProcessMessage, OpenMlsProcessMessageWithAppData, MergeStagedCommit), which carry a drop as StorageError(SqlKeyStoreError::Connection(_)) — these are the most likely per-message disconnect carriers.
  • GroupError::needs_db_reconnectSync / SyncFailedToWait now forward to SyncSummary; SqlKeyStore(Connection(_)) forwarded.
  • TaskWorkerError::needs_db_reconnect — replaced the structural Group(GroupError::Storage)-only match (which missed Db/MlsStore/Sync) with proper Group(e) => e.needs_db_reconnect() / DeviceSync(e) => … forwarding.
  • CommitLogErrorSyncError / KeystoreError(Connection(_)) forwarded.
  • DeviceSyncErrorSync forwarded.

Safety: no false positives

Every new arm bottoms out at db_needs_connection(), which matches only PoolNeedsConnection (and the pre-existing Pool(_)). A non-pool-drop connection error like DisconnectInTransaction stays classified false, so a healthy pool is never torn down needlessly. Each new test includes a benign-error negative case to pin this.

Test Plan

  • just check crate xmtp_mls — compiles clean
  • just lint-rust — clippy / fmt / hakari clean
  • just test v3 disconnect — 11/11 disconnect-propagation tests pass (3 new: sync_summary_forwards_disconnect, group_message_processing_error_forwards_openmls_disconnect, task_worker_error_forwards_disconnect)
  • just test v3 worker — 31/31 worker tests pass (no regressions)

Reviewed by an adversarial agent pass focused on false-positives, completeness, and is_retryable/needs_db_reconnect consistency; the one residual gap it found (OpenMLS wrapper variants) is fixed in this PR.

🤖 Generated with Claude Code

Note

Forward db-disconnect signals through SyncSummary and sibling error types

  • SyncSummary::needs_db_reconnect() now scans all aggregated error sources (publish, post-commit, process, other) and returns true if any carry a dropped-pool signal.
  • GroupError, CommitLogError, and DeviceSyncError delegate to SyncSummary::needs_db_reconnect() for their Sync/SyncFailedToWait variants.
  • GroupMessageProcessingError gains a new NeedsDbReconnect impl that checks OpenMLS storage errors, merge errors, and other storage-bearing variants.
  • TaskWorkerError now delegates directly to e.needs_db_reconnect() for Group and DeviceSync variants instead of only checking the inner Storage sub-variant.
  • New and updated tests in worker.rs assert propagation through all affected error wrappers.

Macroscope summarized fae72a0.

@insipx insipx requested a review from a team as a code owner June 3, 2026 23:06
@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

Claude finished @insipx's task —— View job


Code Review - June 3, 2026 23:07 UTC

Summary

This PR fixes a critical hot-loop bug where DB pool disconnects during sync operations were being misclassified, preventing the worker supervisor from reconnecting. The fix is well-designed, comprehensive, and properly tested.

Positive Observations

1. Complete Coverage of Disconnect Paths

  • SyncSummary::needs_db_reconnect (summary.rs:33-58) correctly scans all four error sources: publish_errors, post_commit_errors, other, and critically process.errored — the last was the missing piece that existing tests didn't cover
  • GroupMessageProcessingError::needs_db_reconnect (mls_sync.rs:270-303) now forwards both direct DB variants AND the OpenMLS wrapper variants (OpenMlsProcessMessage, OpenMlsProcessMessageWithAppData, MergeStagedCommit) which carry disconnects as StorageError(SqlKeyStoreError::Connection(_))
  • The OpenMLS wrapper forwarding is crucial — these are the most likely per-message disconnect carriers during state I/O

2. No False Positives

  • Every new needs_db_reconnect arm bottoms out at db_needs_connection(), which only matches PoolNeedsConnection (and pre-existing Pool(_))
  • Helper function sql_disconnect (mls_sync.rs:284-286) ensures only connection errors that represent a dropped pool trigger reconnection
  • Non-pool connection errors like DisconnectInTransaction correctly stay classified as false

3. Proper Forwarding Chain

  • GroupError::needs_db_reconnect (error.rs:645-667): Now forwards Sync/SyncFailedToWait to SyncSummary and handles SqlKeyStore(Connection(_))
  • TaskWorkerError::needs_db_reconnect (tasks.rs:42-58): Replaced structural matching with proper delegation to Group(e) => e.needs_db_reconnect() and DeviceSync(e) => e.needs_db_reconnect(), covering all nested disconnect paths
  • CommitLogError::needs_db_reconnect (commit_log.rs:139-168): Forwards SyncError, GroupError, nested FailedToSendReadd/FailedReadds, and KeystoreError(Connection(_))
  • DeviceSyncError::needs_db_reconnect (device_sync/mod.rs:204-220): Forwards all wrapped variants including Sync

4. Excellent Test Coverage (worker.rs:365-630)

  • 11 total tests, 3 new: sync_summary_forwards_disconnect, group_message_processing_error_forwards_openmls_disconnect, task_worker_error_forwards_disconnect
  • Each test includes negative cases (benign errors) to prevent false positives
  • Tests cover the exact bug scenario: disconnect in SyncSummaryGroupError::Sync → worker loop

5. Consistency with is_retryable

  • The PR author's claim that is_retryable and needs_db_reconnect are orthogonal is correct — a pool drop should trigger reconnection regardless of retry classification
  • However, the combination of SyncFailedToWait being is_retryable == true AND needs_db_reconnect == false (pre-fix) was indeed a hot-loop scenario — now fixed

Minor Observations

  1. Documentation Quality: The inline comments explaining the forwarding logic are excellent (e.g., summary.rs:34-40, mls_sync.rs:271-283). These will help future maintainers understand why each variant is matched.

  2. Conservative Approach: The fix only adds forwarding where a disconnect can actually occur, rather than blindly forwarding all nested errors. This prevents unnecessary pool teardowns.

  3. Test Organization: The disconnect propagation tests are well-organized in a dedicated module (worker.rs:365-630) and use helper functions (disconnect_storage(), benign_storage()) for clarity.

Potential Improvements (Optional)

  1. Test Assertion Messages: Some assertions could benefit from more descriptive failure messages (e.g., worker.rs:516, 520, 526), though the existing messages are adequate.

  2. Code Duplication: The sql_disconnect helper in GroupMessageProcessingError::needs_db_reconnect (mls_sync.rs:284-286) could potentially be extracted to a shared utility, but this is minor given it's only 3 lines.

Verdict

✅ Approved - This is a high-quality fix for a critical production issue. The implementation is:

  • Correct: Properly identifies all disconnect paths without false positives
  • Complete: Covers all four SyncSummary error sources and all OpenMLS wrapper variants
  • Well-tested: 11/11 tests passing, including 3 new tests for the exact bug scenario
  • Maintainable: Excellent inline documentation and clear forwarding logic

No blocking issues found. The PR is ready to merge.


@macroscopeapp
Copy link
Copy Markdown
Contributor

macroscopeapp Bot commented Jun 3, 2026

Approvability

Verdict: Approved

Straightforward bug fix that propagates database disconnect signals through error wrapper types to prevent worker hot-loops. Changes are limited to error handling trait implementations with extensive test coverage, and the author owns all modified files.

You can customize Macroscope's approvability policy. Learn more.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

Codecov Report

❌ Patch coverage is 90.43478% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.12%. Comparing base (2934d52) to head (fae72a0).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
crates/xmtp_mls/src/groups/mls_sync.rs 64.70% 6 Missing ⚠️
crates/xmtp_mls/src/groups/commit_log.rs 33.33% 2 Missing ⚠️
crates/xmtp_mls/src/groups/error.rs 33.33% 2 Missing ⚠️
crates/xmtp_mls/src/worker/tasks.rs 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3716      +/-   ##
==========================================
+ Coverage   84.03%   84.12%   +0.08%     
==========================================
  Files         399      406       +7     
  Lines       59150    59407     +257     
==========================================
+ Hits        49707    49976     +269     
+ Misses       9443     9431      -12     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant