fix(data): preserve merkle resume after preflight#94
Conversation
Preflight merkle uploads with the existing AlreadyStored quote signal, skip chunks already on the network, and fall back to single-node payment when the remaining chunk count is below the merkle threshold. Carry already-stored chunk counts through external-signer finalization so upload totals remain accurate.
Prefer valid cached merkle proofs when preflight quote checks fail so retries do not repay through the single-node path. Fall back external-signer merkle preparation to wave-batch on insufficient peers, and keep chunk contents aligned with merkle addresses before upload.
dirvine
left a comment
There was a problem hiding this comment.
Review: fix(data): preserve merkle resume after preflight
I have reviewed the full diff (+851/−216 across data.rs, file.rs, merkle.rs). All CI checks pass (Format, Clippy, Unit Tests ×2, E2E ×2, Merkle E2E ×2, Docs, Build ×2, Security Audit).
Strengths
Architecture: The preflight planner (plan_merkle_upload) slots correctly into the existing quote pipeline — reusing get_store_quotes and the AlreadyStored error path means zero new network protocol surface. The MerkleUploadPlan struct cleanly separates concerns: planning, filtering, and execution.
Correctness of the race window: Between preflight and payment, a chunks status can change, but the design handles both directions safely:
- Stored → not stored: payment proceeds normally, PUT succeeds.
- Not stored → stored: payment succeeds, network handles
AlreadyExistsgracefully on PUT.
Worst case is a small overpayment, never an underpayment.
Edge cases covered well:
- Empty
to_upload→ returns success immediately with correctchunks_stored: chunk_count - Threshold re-check after preflight → switches to single-node if remaining chunks drop below merkle threshold
- Cache invalidation → cached proofs validated against the actual remaining chunk set before reuse
- Length check in
merkle_upload_chunks→ prevents silent address/content desync
Order preservation: chunk_contents_for_upload_addresses uses VecDeque to handle duplicate-content chunks correctly while preserving address order. Good.
Progress reporting: The stored_offset parameter threads already-stored counts through to progress events, so UI progress bars stay accurate for the total file, not just the payment batch.
Minor suggestions (non-blocking)
-
chunk_contents_for_upload_addresses— HashMap overhead with many duplicates: If the same chunk content appears hundreds of times (e.g. all-zero padding), building a HashMap ofVecDeques is O(n) memory. Fine for typical file chunks, but worth noting if someone uploads a sparse file with many identical chunks. -
plan_merkle_uploadlogs atinfo!for every chunk in debug mode: The per-chunkdebug!logs are fine, but theinfo!("Checking {total_chunks} merkle chunks...")at the start and completion summary at the end are well-placed. No change needed. -
Public API surface:
PreparedUploadnow carriesalready_stored_addresses: Vec<[u8; 32]>andtotal_chunks: usize. These are needed for correct finalization, but make sure any bindings/wasm wrappers propagate them through to the finalFileUploadResult. The existing field docs in the struct are clear enough.
Verdict
This is a well-structured, correctly implemented optimization that meaningfully reduces user costs for re-upload scenarios. The resume-preservation logic is handled carefully (cache validity check before discard). I am happy to approve.
Approved ✅
Summary
This PR changes merkle uploads so payment is based only on chunks that still need to be stored.
Code changes:
AlreadyStoredresponses are treated as stored chunks, while normal quote responses become theto_uploadset for merkle payment.HashSetfiltering with an order-preserving content/address selector and adds a length check before merkle upload so chunk contents cannot silently desync from proof addresses.Verification
cargo fmt --all -- --checkcargo check --all-targets --all-featurescargo clippy --all-targets --all-features -- -D warningscargo test -p ant-core --lib