Skip to content

fix(data): preserve merkle resume after preflight#94

Open
mickvandijke wants to merge 4 commits into
mainfrom
fix/merkle-preflight-resume
Open

fix(data): preserve merkle resume after preflight#94
mickvandijke wants to merge 4 commits into
mainfrom
fix/merkle-preflight-resume

Conversation

@mickvandijke
Copy link
Copy Markdown
Contributor

@mickvandijke mickvandijke commented May 21, 2026

Summary

This PR changes merkle uploads so payment is based only on chunks that still need to be stored.

Code changes:

  • Adds a merkle upload preflight planner that probes each chunk through the existing quote path. AlreadyStored responses are treated as stored chunks, while normal quote responses become the to_upload set for merkle payment.
  • If every chunk is already stored, upload returns successfully without fetching merkle quotes, submitting payment, or PUTing chunks again.
  • If only some chunks are already stored, merkle payment/proofs are built only for the remaining chunks, and upload progress/final counts include the skipped stored chunks.
  • Re-runs the existing payment-mode threshold after preflight. In Auto mode, if the remaining chunks drop below the merkle threshold, the upload switches to single-node payment for only those remaining chunks. Forced Merkle still follows the existing Merkle-mode minimum.
  • Preserves merkle resume behavior by loading cached merkle receipts before preflight fallback decisions. If preflight fails but cached proofs cover the file, the upload resumes with cached proofs instead of paying again through the single-node path.
  • Updates external-signer file preparation to use the same preflight behavior and to fall back to wave-batch preparation when merkle candidate collection has insufficient peers.
  • Replaces address HashSet filtering with an order-preserving content/address selector and adds a length check before merkle upload so chunk contents cannot silently desync from proof addresses.

Verification

  • cargo fmt --all -- --check
  • cargo check --all-targets --all-features
  • cargo clippy --all-targets --all-features -- -D warnings
  • cargo test -p ant-core --lib

Preflight merkle uploads with the existing AlreadyStored quote signal, skip chunks already on the network, and fall back to single-node payment when the remaining chunk count is below the merkle threshold.

Carry already-stored chunk counts through external-signer finalization so upload totals remain accurate.
Prefer valid cached merkle proofs when preflight quote checks fail so retries do not repay through the single-node path.

Fall back external-signer merkle preparation to wave-batch on insufficient peers, and keep chunk contents aligned with merkle addresses before upload.
Copy link
Copy Markdown

@dirvine dirvine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: fix(data): preserve merkle resume after preflight

I have reviewed the full diff (+851/−216 across data.rs, file.rs, merkle.rs). All CI checks pass (Format, Clippy, Unit Tests ×2, E2E ×2, Merkle E2E ×2, Docs, Build ×2, Security Audit).

Strengths

Architecture: The preflight planner (plan_merkle_upload) slots correctly into the existing quote pipeline — reusing get_store_quotes and the AlreadyStored error path means zero new network protocol surface. The MerkleUploadPlan struct cleanly separates concerns: planning, filtering, and execution.

Correctness of the race window: Between preflight and payment, a chunks status can change, but the design handles both directions safely:

  • Stored → not stored: payment proceeds normally, PUT succeeds.
  • Not stored → stored: payment succeeds, network handles AlreadyExists gracefully on PUT.
    Worst case is a small overpayment, never an underpayment.

Edge cases covered well:

  • Empty to_upload → returns success immediately with correct chunks_stored: chunk_count
  • Threshold re-check after preflight → switches to single-node if remaining chunks drop below merkle threshold
  • Cache invalidation → cached proofs validated against the actual remaining chunk set before reuse
  • Length check in merkle_upload_chunks → prevents silent address/content desync

Order preservation: chunk_contents_for_upload_addresses uses VecDeque to handle duplicate-content chunks correctly while preserving address order. Good.

Progress reporting: The stored_offset parameter threads already-stored counts through to progress events, so UI progress bars stay accurate for the total file, not just the payment batch.

Minor suggestions (non-blocking)

  1. chunk_contents_for_upload_addresses — HashMap overhead with many duplicates: If the same chunk content appears hundreds of times (e.g. all-zero padding), building a HashMap of VecDeques is O(n) memory. Fine for typical file chunks, but worth noting if someone uploads a sparse file with many identical chunks.

  2. plan_merkle_upload logs at info! for every chunk in debug mode: The per-chunk debug! logs are fine, but the info!("Checking {total_chunks} merkle chunks...") at the start and completion summary at the end are well-placed. No change needed.

  3. Public API surface: PreparedUpload now carries already_stored_addresses: Vec<[u8; 32]> and total_chunks: usize. These are needed for correct finalization, but make sure any bindings/wasm wrappers propagate them through to the final FileUploadResult. The existing field docs in the struct are clear enough.

Verdict

This is a well-structured, correctly implemented optimization that meaningfully reduces user costs for re-upload scenarios. The resume-preservation logic is handled carefully (cache validity check before discard). I am happy to approve.

Approved ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants