fix(xmtp_mls): forward db-disconnect signal through SyncSummary and sibling errors#3716
fix(xmtp_mls): forward db-disconnect signal through SyncSummary and sibling errors#3716insipx wants to merge 1 commit into
Conversation
…ibling error variants
|
Claude finished @insipx's task —— View job Code Review - June 3, 2026 23:07 UTCSummaryThis PR fixes a critical hot-loop bug where DB pool disconnects during sync operations were being misclassified, preventing the worker supervisor from reconnecting. The fix is well-designed, comprehensive, and properly tested. Positive Observations1. Complete Coverage of Disconnect Paths
2. No False Positives
3. Proper Forwarding Chain
4. Excellent Test Coverage (worker.rs:365-630)
5. Consistency with
Minor Observations
Potential Improvements (Optional)
Verdict✅ Approved - This is a high-quality fix for a critical production issue. The implementation is:
No blocking issues found. The PR is ready to merge. |
ApprovabilityVerdict: Approved Straightforward bug fix that propagates database disconnect signals through error wrapper types to prevent worker hot-loops. Changes are limited to error handling trait implementations with extensive test coverage, and the author owns all modified files. You can customize Macroscope's approvability policy. Learn more. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3716 +/- ##
==========================================
+ Coverage 84.03% 84.12% +0.08%
==========================================
Files 399 406 +7
Lines 59150 59407 +257
==========================================
+ Hits 49707 49976 +269
+ Misses 9443 9431 -12 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Summary
A dropped DB pool (
PoolNeedsConnection) that surfaced during a sync was being misclassified as not-a-disconnect, so the worker supervisor never dropped the pool to reconnect — it retried forever on a dead pool. The worst path (GroupError::SyncFailedToWait) is alsois_retryable == true, making it a true hot-loop.Root cause:
SyncSummary— the aggregate sync error — had noNeedsDbReconnectimpl at all. AnyPoolNeedsConnectionlanding in itspublish_errors/post_commit_errors/other/ per-messageprocess.erroredgot wrapped intoGroupError::Sync/SyncFailedToWait/DeviceSyncError::Sync/CommitLogError::SyncError, all of which hit_ => falseinneeds_db_reconnect. The disconnect signal was permanently buried.This is the same bug class the existing
disconnect_propagation_testswere written to prevent — the SyncSummary path just wasn't covered.Fix — forward the disconnect signal everywhere it was being dropped
SyncSummary: NeedsDbReconnect(new) — scans all four error sources, including the per-messageprocess.erroredfailures (whichis_retryabledeliberately skips, but which can still carry a pool drop).GroupMessageProcessingError: NeedsDbReconnect(new) — forwards the directly-wrapped DB variants (Storage/Db/Identity/Client/ClearPendingCommit) and the OpenMLS state-I/O wrappers (OpenMlsProcessMessage,OpenMlsProcessMessageWithAppData,MergeStagedCommit), which carry a drop asStorageError(SqlKeyStoreError::Connection(_))— these are the most likely per-message disconnect carriers.GroupError::needs_db_reconnect—Sync/SyncFailedToWaitnow forward toSyncSummary;SqlKeyStore(Connection(_))forwarded.TaskWorkerError::needs_db_reconnect— replaced the structuralGroup(GroupError::Storage)-only match (which missedDb/MlsStore/Sync) with properGroup(e) => e.needs_db_reconnect()/DeviceSync(e) => …forwarding.CommitLogError—SyncError/KeystoreError(Connection(_))forwarded.DeviceSyncError—Syncforwarded.Safety: no false positives
Every new arm bottoms out at
db_needs_connection(), which matches onlyPoolNeedsConnection(and the pre-existingPool(_)). A non-pool-drop connection error likeDisconnectInTransactionstays classifiedfalse, so a healthy pool is never torn down needlessly. Each new test includes a benign-error negative case to pin this.Test Plan
just check crate xmtp_mls— compiles cleanjust lint-rust— clippy / fmt / hakari cleanjust test v3 disconnect— 11/11 disconnect-propagation tests pass (3 new:sync_summary_forwards_disconnect,group_message_processing_error_forwards_openmls_disconnect,task_worker_error_forwards_disconnect)just test v3 worker— 31/31 worker tests pass (no regressions)Reviewed by an adversarial agent pass focused on false-positives, completeness, and
is_retryable/needs_db_reconnectconsistency; the one residual gap it found (OpenMLS wrapper variants) is fixed in this PR.🤖 Generated with Claude Code
Note
Forward db-disconnect signals through
SyncSummaryand sibling error typesSyncSummary::needs_db_reconnect()now scans all aggregated error sources (publish, post-commit, process, other) and returns true if any carry a dropped-pool signal.GroupError,CommitLogError, andDeviceSyncErrordelegate toSyncSummary::needs_db_reconnect()for theirSync/SyncFailedToWaitvariants.GroupMessageProcessingErrorgains a newNeedsDbReconnectimpl that checks OpenMLS storage errors, merge errors, and other storage-bearing variants.TaskWorkerErrornow delegates directly toe.needs_db_reconnect()forGroupandDeviceSyncvariants instead of only checking the innerStoragesub-variant.Macroscope summarized fae72a0.