Skip to content

fix: retry failed program subscriptions#1268

Merged
thlorenz merged 10 commits into
masterfrom
thlorenz/failed-prog-subs-poison-future
Jun 5, 2026
Merged

fix: retry failed program subscriptions#1268
thlorenz merged 10 commits into
masterfrom
thlorenz/failed-prog-subs-poison-future

Conversation

@thlorenz
Copy link
Copy Markdown
Collaborator

@thlorenz thlorenz commented Jun 1, 2026

Summary

Fix program subscription bookkeeping so transient subscription failures remain retryable instead of being recorded as successful locally.

Details

magicblock-chainlink

  • Record SubMux program subscriptions only after the underlying fan-out subscription succeeds.
  • Keep the Laser program stream manager's previous subscribed-program set when an incremental stream update fails, allowing the same program subscription to be retried.
  • Add regression coverage for retrying failed program subscriptions in both SubMux and Laser stream-manager paths.
  • Extend existing test mocks with targeted failure hooks needed to validate retry behavior.

Summary by CodeRabbit

  • Bug Fixes

    • Corrected subscription state rollback so failed subscription updates no longer leave stale program IDs.
  • Tests

    • Added tests verifying retries and that subsequent attempts write updated subscription requests.
    • Enhanced test mocks to simulate and observe forced failures and count subscription/write attempts.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR improves subscription retry logic across program subscription paths by ensuring in-memory state remains consistent when subscription operations fail. Mock factories (MockStreamFactory and ChainPubsubClientMock) now support injecting transient failures on specific write and subscribe calls. StreamManager::add_program_subscription rolls back the subscribed programs set on update failure. SubMuxClient::subscribe_program delays recording the program ID until after the subscription task succeeds. Tests verify that retries after failures succeed with correct state and updated attempt counters.

Suggested reviewers

  • GabrielePicco
  • snawaz
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch thlorenz/failed-prog-subs-poison-future

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@thlorenz
Copy link
Copy Markdown
Collaborator Author

thlorenz commented Jun 1, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@thlorenz
Copy link
Copy Markdown
Collaborator Author

thlorenz commented Jun 1, 2026

@coderabbitai resume

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@magicblock-chainlink/src/submux/mod.rs`:
- Around line 1074-1078: The code currently calls .lock().expect("program_subs
lock poisoned") on program_subs which can panic; change this to handle
PoisonError and propagate it to the caller instead of panicking—acquire the
mutex with self.program_subs.lock() and map any PoisonError into your function's
Result (or a dedicated error variant) before calling .insert(program_id), e.g.,
use match or .map_err(|e| /* convert e */)? to convert the poison into an Err
return so insert is only called on a successful lock; update the function
signature to return Result if needed and follow the same error conversion
pattern used elsewhere for mutex poisoning.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7863f2f3-aa69-4f4c-98aa-83890882405b

📥 Commits

Reviewing files that changed from the base of the PR and between b2ab11b and bb32b7f.

📒 Files selected for processing (4)
  • magicblock-chainlink/src/remote_account_provider/chain_laser_actor/mock.rs
  • magicblock-chainlink/src/remote_account_provider/chain_laser_actor/stream_manager.rs
  • magicblock-chainlink/src/remote_account_provider/chain_pubsub_client.rs
  • magicblock-chainlink/src/submux/mod.rs

Comment thread magicblock-chainlink/src/submux/mod.rs Outdated
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

✅ Actions performed

Reviews resumed.

@thlorenz thlorenz marked this pull request as ready for review June 1, 2026 09:57
@thlorenz thlorenz requested a review from GabrielePicco June 1, 2026 09:58
* master:
  chore: remove per-program fetch metric (#1261)
  Allow same-slot delegated clone override (#1269)
Copy link
Copy Markdown
Collaborator

@GabrielePicco GabrielePicco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general, but I believe this introduce a reconnect race that can leave a reconnected client missing the program subscription. Should be fixed before merging

Comment thread magicblock-chainlink/src/submux/mod.rs Outdated
thlorenz and others added 2 commits June 4, 2026 13:47
* master:
  chore: simplify, rename out_of_order_slot and add a comment (#1284)
  fix: execute post-delegation actions after clone (#1278)
  Handle oversized single-stage committor transactions (#1277)
  Reduce committor RPC confirmation calls (#1271)
  fix: preserve streams on optimize failure (#1273)
  perf: reduce TableMania ALT polling (#1276)
  feat: bundle session-key program in mb-test-validator (#1275)
  fix: limit pending intent recovery window (#1274)
@thlorenz thlorenz force-pushed the thlorenz/failed-prog-subs-poison-future branch from 2fb5dd2 to fd10fed Compare June 4, 2026 05:56
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
magicblock-chainlink/src/remote_account_provider/chain_laser_actor/mock.rs (1)

205-208: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid panic in mock subscribe when write-failure injection is enabled.

Line 207 unwraps handle.write(request). With fail_next_handle_writes(n), this path can now panic instead of returning an error, which breaks intended retry/failure-path testing.

Suggested fix
-        handle.write(request).await.unwrap();
+        handle.write(request).await.map_err(|err| {
+            RemoteAccountProviderError::GrpcSubscriptionUpdateFailed(
+                "mock subscribe".to_string(),
+                0,
+                err.to_string(),
+            )
+        })?;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@magicblock-chainlink/src/remote_account_provider/chain_laser_actor/mock.rs`
around lines 205 - 208, The mock subscribe currently uses
handle.write(request).await.unwrap(), which can panic when
fail_next_handle_writes(n) injects write failures; change this to propagate the
write error instead of panicking by handling the Result from
handle.write(request).await and returning Err(...) (or forwarding with ? ) from
the subscribe implementation so tests exercise retry/failure paths; update the
subscribe/mock function around the handle.write(request).await call (and any
surrounding return type) to return a proper Result and propagate the write error
rather than calling unwrap.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@magicblock-chainlink/src/remote_account_provider/chain_laser_actor/mock.rs`:
- Around line 205-208: The mock subscribe currently uses
handle.write(request).await.unwrap(), which can panic when
fail_next_handle_writes(n) injects write failures; change this to propagate the
write error instead of panicking by handling the Result from
handle.write(request).await and returning Err(...) (or forwarding with ? ) from
the subscribe implementation so tests exercise retry/failure paths; update the
subscribe/mock function around the handle.write(request).await call (and any
surrounding return type) to return a proper Result and propagate the write error
rather than calling unwrap.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f90637b7-d9b4-4533-868f-72996fd709eb

📥 Commits

Reviewing files that changed from the base of the PR and between bb32b7f and 6792bb0.

📒 Files selected for processing (3)
  • magicblock-chainlink/src/remote_account_provider/chain_laser_actor/mock.rs
  • magicblock-chainlink/src/remote_account_provider/chain_laser_actor/stream_manager.rs
  • magicblock-chainlink/src/submux/mod.rs

Copy link
Copy Markdown
Collaborator

@GabrielePicco GabrielePicco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@thlorenz thlorenz merged commit 5fbbbb6 into master Jun 5, 2026
37 checks passed
@thlorenz thlorenz deleted the thlorenz/failed-prog-subs-poison-future branch June 5, 2026 02:57
thlorenz added a commit that referenced this pull request Jun 5, 2026
* master:
  fix: reject short account responses (#1290)
  fix: wait for pubsub listeners before reconnect (#1253)
  fix: retry failed program subscriptions (#1268)
  release: 0.12.0 (#1299)
  Ignore compute unit price in processor fees (#1298)
  Recover recent pending intents on restart (#1296)
  chore: adjust log level (#1297)
  release: v0.11.4 (#1292)
  fix(scheduler): remove block subscription in the scheduler (#1293)
  fix(committor): race-condition on cleanup (#1291)
  Handle ALT extend invalid instruction data (#1287)
  fix: use provided compute limits instead of defaults (#1289)
  feat: snapshot accountsdb even in the replica mode (#1282)
  feat: added vrf ephemeral test queue, delegation record and metadata for mb-test-validator (#1281)
  fix: use wire size (1232), not encoded size (1644), for tx fit checks! (#1285)
  chore: simplify, rename out_of_order_slot and add a comment (#1284)
  fix: execute post-delegation actions after clone (#1278)
  Handle oversized single-stage committor transactions (#1277)
  Reduce committor RPC confirmation calls (#1271)
  fix: preserve streams on optimize failure (#1273)
thlorenz added a commit that referenced this pull request Jun 5, 2026
* master:
  fix: reject short account responses (#1290)
  fix: wait for pubsub listeners before reconnect (#1253)
  fix: retry failed program subscriptions (#1268)
  release: 0.12.0 (#1299)
  Ignore compute unit price in processor fees (#1298)
  Recover recent pending intents on restart (#1296)
  chore: adjust log level (#1297)
  release: v0.11.4 (#1292)
  fix(scheduler): remove block subscription in the scheduler (#1293)
  fix(committor): race-condition on cleanup (#1291)
  Handle ALT extend invalid instruction data (#1287)
  fix: use provided compute limits instead of defaults (#1289)
  feat: snapshot accountsdb even in the replica mode (#1282)
  feat: added vrf ephemeral test queue, delegation record and metadata for mb-test-validator (#1281)
  fix: use wire size (1232), not encoded size (1644), for tx fit checks! (#1285)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants