Skip to content

refactor: migrate to asap_sketchlib module-restructure layout#335

Closed
GordonYuanyc wants to merge 13 commits into
mainfrom
refactor/sketchlib-upgrade
Closed

refactor: migrate to asap_sketchlib module-restructure layout#335
GordonYuanyc wants to merge 13 commits into
mainfrom
refactor/sketchlib-upgrade

Conversation

@GordonYuanyc

Copy link
Copy Markdown
Contributor

Summary

  • Bump asap_sketchlib pin to the refactor/module-restructure branch and adapt to the new module layout.
  • Accumulators now hold sketches::* types directly instead of going through the wire-format facade; serialization is centralized in a new sketchlib_runtime module.
  • Wire byte shape is preserved (CountMinSketch / KLL DTOs unchanged on the wire).

Status

Draft — pinned to a branch in asap_sketchlib, not a tagged release. Holding as draft until the upstream branch lands on its target.

Verification

  • cargo build clean
  • cargo test -p query_engine_rust sketch → 41 passed, 0 failed
  • cargo clippy -p query_engine_rust --lib clean apart from 2 pre-existing needless_range_loop style warnings in sketchlib_runtime.rs

Test plan

  • CI green
  • Re-pin asap_sketchlib to a merged commit on its target branch before marking ready
  • (Optional) clear the two clippy style warnings in sketchlib_runtime.rs

🤖 Generated with Claude Code

GordonYuanyc and others added 4 commits May 11, 2026 20:52
asap_sketchlib reorganized in the refactor/module-restructure branch:
- Wire-format-aligned types (CountMinSketch, CountSketch, KllSketch,
  HydraKllSketch, SetAggregator, DeltaResult, CmsHeapItem, etc.) moved
  out of `sketches::*` submodules and now live in `wrapper::*`,
  re-exported at the crate root.
- `sketches::delta_set_aggregator::{serialize,deserialize}_msgpack` shims
  were dropped in favor of the `MessagePackCodec` trait on `DeltaResult`.

Changes:
- Pin asap_sketchlib to the refactor/module-restructure branch.
- Switch all `use asap_sketchlib::sketches::*` imports to the
  crate-root re-exports (`use asap_sketchlib::CountMinSketch`, etc.).
- `delta_set_aggregator_accumulator` uses
  `DeltaResult::{from_msgpack,to_msgpack}` directly through the
  `MessagePackCodec` trait instead of the removed module-level shims.

No behavior change — the wire format and runtime semantics of the
underlying sketches are identical (locked in by sketchlib-go parity
goldens in asap_sketchlib's tests/sketches_go_parity_probe.rs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up the asap_sketchlib `bfe8f37` cleanup that deletes
`mod wrapper` and moves all wire-format-aligned sketch types
(`CountMinSketch`, `CountSketch`, `KllSketch`, `HllSketch`,
`DdSketch`, `HydraKllSketch`, `CountMinSketchWithHeap`,
`SetAggregator`, `DeltaResult`) into
`asap_sketchlib::message_pack_format::portable::*`. The crate-root
re-exports that this crate imports are preserved unchanged, so no
source changes are needed on this side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CountMinSketchAccumulator and DatasketchesKLLAccumulator previously
held the wire-format facade types (`asap_sketchlib::CountMinSketch`,
`asap_sketchlib::KllSketch`), each of which is a thin shim around an
underlying `sketches::*` runtime sketch. With Go byte parity proven
for the underlying types in `asap_sketchlib::tests::sketches_go_parity_probe`,
the facade layer is no longer required.

- New `precompute_operators::sketchlib_runtime` module exposes thin
  adapters (`cms_*` / `kll_*`) that translate the accumulator surface
  (string keys, `Vec<Vec<f64>>` matrices, Go-compatible MessagePack
  envelopes) onto `sketches::CountMin<Vector2D<f64>, FastPath,
  DefaultXxHasher>` and `sketches::KLL<f64>` directly.
- `CountMinSketchAccumulator.inner` becomes
  `sketches::CountMin<.., FastPath>`; `DatasketchesKLLAccumulator.inner`
  becomes `sketches::KLL<f64>`. JSON / msgpack / merge paths route
  through the new helpers.
- Wire format stays unchanged: the `CountMinSketchWire { sketch,
  row_num, col_num }` and `KllSketchData { k, sketch_bytes }`
  envelopes are still emitted, just constructed inline instead of
  through the facade's `to_msgpack`.
- A handful of `inner.k` field accesses change to `inner.k()` (KLL's
  `k` is private in `sketches::*` and exposed via accessor).
- Bump `asap_sketchlib` pin to `aea7d05` for the new
  `sketches::KLL::k()` accessor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@GordonYuanyc GordonYuanyc marked this pull request as ready for review May 12, 2026 04:48
GordonYuanyc and others added 3 commits May 12, 2026 16:31
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The refactor/module-restructure branch has been merged and deleted on
asap_sketchlib. Point at the default branch and lock to current main
HEAD so a future cargo fetch doesn't fail on a missing branch ref.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zzylol
zzylol previously approved these changes May 15, 2026

@milindsrivastava1997 milindsrivastava1997 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cms_from_matrix doesn't validate that rows/cols match the matrix shape (sketchlib_runtime.rs:61-70)

rows/cols come from the JSON/binary envelope while matrix is parsed separately. If they mismatch (malformed payload), from_fn silently uses the envelope dimensions and pads/truncates with 0.0 — invisible corruption. A debug_assert! or early return error at the call sites in deserialize_from_json and deserialize_from_bytes would catch this.

@milindsrivastava1997 milindsrivastava1997 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cms_to_msgpack swallows serialization errors (sketchlib_runtime.rs:82-86)

rmp_serde::to_vec(&wire).unwrap_or_default() returns an empty Vec<u8> on failure. The caller serialize_to_bytes propagates that silently, so a downstream deserializer will fail with a confusing "buffer too short" error instead of pointing at the actual failure site. Change the return type to Result or use expect — silent empty bytes are worse than a panic.

@milindsrivastava1997 milindsrivastava1997 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regression test context comments removed (count_min_sketch_accumulator.rs, test_update_and_query_use_same_key_encoding)

The removed multi-line comment explained a real past bug — key encoding mismatch between _update and query_key that caused silent lookup failures. That context is exactly what makes the test legible to a future reader wondering why this edge case is tested at all. Worth restoring.

@milindsrivastava1997 milindsrivastava1997 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GordonYuanyc some review comments from Claude, pls check

@GordonYuanyc GordonYuanyc marked this pull request as draft May 19, 2026 18:01
Address review feedback on PR #335:

- cms_from_matrix now returns Result and rejects a matrix whose shape
  disagrees with the envelope's rows/cols instead of silently padding or
  truncating with 0.0 (invisible corruption). Propagated through
  deserialize_from_json, deserialize_from_bytes and cms_from_msgpack.
- cms_to_msgpack uses expect instead of unwrap_or_default, so a
  serialization bug panics loudly rather than emitting empty bytes that
  surface downstream as a misleading "buffer too short".
- Restore the regression-test context comments documenting the past
  key-encoding mismatch in test_update_and_query_use_same_key_encoding.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@GordonYuanyc GordonYuanyc marked this pull request as ready for review May 23, 2026 05:31
GordonYuanyc and others added 2 commits May 23, 2026 05:35
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
cms_from_matrix already returns the right Result shape; the extra Ok/?
trips clippy::needless_question_mark under -D warnings.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@GordonYuanyc

Copy link
Copy Markdown
Contributor Author

Thank you for the suggestion! Updated!

@milindsrivastava1997 milindsrivastava1997 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overview

This PR adapts asap-query-engine to the new asap_sketchlib module layout, where types that were previously accessed via deep paths (sketches::countminsketch::CountMinSketch, sketches::kll::KllSketch, etc.) are now re-exported at the crate root. A new sketchlib_runtime shim module is introduced to centralize wire-format encoding/decoding and to hold the concrete generic type aliases (RuntimeCountMin, RuntimeKll). The wire byte format is documented as unchanged. The PR is in draft, pinned to a non-merged upstream branch.


Bugs / Correctness

1. Silent failure in kll_to_msgpack vs. panic in cms_to_msgpack (inconsistency + latent bug)

sketchlib_runtime.rs:937

// cms_to_msgpack — correct:
rmp_serde::to_vec(&wire).expect("CountMinSketchWire msgpack serialization is infallible")

// kll_to_msgpack — wrong:
rmp_serde::to_vec(&wire).unwrap_or_default()

The CMS path correctly panics because a well-formed struct has no unrepresentable state. The same reasoning applies to KllSketchData. Using unwrap_or_default() for KLL silently produces empty bytes, which will be fed downstream and surface as a misleading "buffer too short" error at the reader — the exact failure mode the CMS comment warns against.

Also affects kll_sketch_bytes (line 927): if sk.serialize_to_bytes() fails, the JSON output becomes {"sketch": ""} — silently corrupted data.

Fix: Replace both with .expect("KllSketchData msgpack serialization is infallible").


2. _update on CountMinSketchAccumulator is now dead code

count_min_sketch_accumulator.rs:189

fn _update(&mut self, key: &KeyByLabelValues, value: f64) {
    cms_update(&mut self.inner, &key.to_semicolon_str(), value);
}

accumulator_factory.rs:112 now calls sketchlib_runtime::cms_update directly, bypassing _update. The method is private with no other callers visible in this diff. If confirmed dead, it should be removed.


3. cms_update zero/negative guard — behavioral change to verify

sketchlib_runtime.rs:801-804

pub fn cms_update(sk: &mut RuntimeCountMin, key: &str, value: f64) {
    if value <= 0.0 {
        return;
    }
    ...
}

The old CountMinSketch::update path did not have this explicit guard in the QE layer. If the upstream CountMin::insert_many accepts zero or negative values, this guard silently drops those updates and changes semantics. Needs a comment explaining the invariant, or a test verifying this was always the intended contract.


4. unsafe impl Send/Sync on DatasketchesKLLAccumulator may now be unnecessary

datasketches_kll_accumulator.rs:542-543

unsafe impl Send for DatasketchesKLLAccumulator {}
unsafe impl Sync for DatasketchesKLLAccumulator {}

The original unsafe impls were needed because KllSketch wrapped a C++ library. RuntimeKll = KLL<f64> is a pure-Rust type. If the compiler would auto-derive Send + Sync for it, these are unnecessary boilerplate (though harmless). Worth checking with static_assertions::assert_impl_all!(RuntimeKll: Send, Sync) — if that compiles, the unsafe impls can be dropped.


Code Quality

5. Abstraction boundary is incomplete — accumulator code still reaches into inner

The sketchlib_runtime shim is supposed to isolate the upstream sketch internals, but callers still access inner.k(), inner.count(), inner.rows(), inner.cols() directly across multiple accumulator files. Either:

  • Complete the isolation: add kll_count(), kll_k(), cms_rows(), cms_cols() helpers to sketchlib_runtime, or
  • Drop the shim abstraction and just use the new module-path imports directly.

The current half-measure means future upstream API changes still break N callsites across multiple files.


6. Error type inconsistency across sketchlib_runtime functions

  • cms_merge_refs, kll_merge_refsBox<dyn std::error::Error + Send + Sync>
  • cms_from_matrix, cms_from_msgpack, kll_from_msgpackBox<dyn std::error::Error>

Pick one convention for the whole module. The Send + Sync variant is safer for async contexts.


7. cms_matrix silent zero-fill on storage.get() returning None

sketchlib_runtime.rs:818-821

if let Some(v) = storage.get(r, c) {
    *cell = *v;
}
// else: leaves 0.0 silently

If storage.get() can return None for valid (r, c) within the declared dimensions (e.g. sparse storage), this is a silent correctness issue. If it cannot, unwrap() / expect() would be more appropriate.


Test Coverage

8. No unit tests for error paths in merge functions

kll_merge_refs checks for k mismatch and cms_merge_refs checks for dimension mismatch — both return errors. Neither error path has a test in this diff. These are the failure modes most likely to surface in prod; each deserves a dedicated test case.

9. Go-wire compatibility test pointer missing

The PR notes asap_sketchlib::tests::sketches_go_parity_probe covers cross-language byte parity upstream. A comment in sketchlib_runtime.rs pointing to it would help future readers understand why there's no explicit cross-language test in this crate.


Nits

  • The two pre-existing needless_range_loop clippy warnings in cms_matrix are trivially fixable now since the function is new: iterate with .iter().enumerate() instead of index ranges.
  • windows-sys downgrade (0.61.20.48.0) is an indirect dep change from the upstream bump — fine, but worth a sanity check if this ever targets Windows.

Summary

The migration is mechanically correct and wire format is preserved. Issues to fix before marking ready:

Priority Issue
Bug kll_to_msgpack / kll_sketch_bytes should panic, not silently return empty bytes
Bug Dead _update method should be removed
Verify cms_update zero/negative guard — confirm behavioral equivalence with old path
Verify unsafe impl Send/Sync may be removable now that FFI is gone
Design Decide: complete the sketchlib_runtime abstraction boundary, or flatten it
Tests Add negative cases: KLL k-mismatch merge, CMS dimension-mismatch merge

- kll_sketch_bytes / kll_to_msgpack now panic via .expect() instead of
  unwrap_or_default(), matching the cms_* path: silent empty bytes
  surface downstream as a misleading "buffer too short" (review #1).
- cms_matrix uses .expect() instead of silently leaving 0.0 when
  storage.get() returns None, since None is only out-of-bounds and we
  iterate within reported dimensions (review #7).
- _update on CountMinSketchAccumulator is now #[cfg(test)] with a comment
  documenting that production goes straight through cms_update; it exists
  only to exercise the same key-encoding path in a regression test (#2).
- Document the Box<dyn Error> vs Box<dyn Error + Send + Sync> split in
  sketchlib_runtime so the inconsistency reads as deliberate (#6).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@GordonYuanyc GordonYuanyc marked this pull request as draft June 2, 2026 01:00
GordonYuanyc and others added 2 commits June 2, 2026 07:37
- #3: document the `value <= 0.0` guard in `cms_update` as an invariant
  (count-min cells are monotonic non-negative counters; a negative update
  would corrupt the queried key and every colliding key) and add the
  regression test `cms_update_drops_non_positive_values`.
- #4: drop the hand-written `unsafe impl Send/Sync` for
  `DatasketchesKLLAccumulator` — `RuntimeKll` (`KLL<f64>`) is pure-Rust
  with no FFI handle, so the bounds auto-derive. New test
  `accumulator_is_send_sync` statically pins this against future fields.
- #8: add the missing KLL merge error-path tests — k-mismatch and empty
  input — alongside the existing CMS dimension-mismatch coverage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#5)

Resolves the half-applied abstraction the reviewer flagged: the module no
longer pretends to isolate the upstream sketch types. Accumulators now call
the concrete `RuntimeCountMin`/`RuntimeKll` API directly for trivial
operations, and `sketchlib_runtime` keeps only the logic worth sharing.

Inlined (pure single-call delegations) into the accumulators:
- cms_new      -> RuntimeCountMin::with_dimensions(..)
- cms_estimate -> self.inner.estimate(&DataInput::String(..))
- kll_new      -> RuntimeKll::init_kll(k as i32)
- kll_update   -> self.inner.update(&value)

Kept in sketchlib_runtime (real logic, not pass-throughs):
- wire codec: cms_/kll_ to/from_msgpack, cms_matrix, cms_from_matrix
- validated merge: cms_merge_refs, kll_merge_refs
- guarded ops shared by >1 caller / carrying an invariant: cms_update
  (non-negative guard), kll_quantile (empty-sketch default), kll_sketch_bytes

Module doc rewritten to state plainly that this is not an isolation layer,
so the direct `self.inner.rows()/.cols()/.k()` accesses are intentional
rather than a leak.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@GordonYuanyc

Copy link
Copy Markdown
Contributor Author

Closing this draft PR.

After rebasing onto current main, I don’t think this refactor is the right direction anymore. main already uses crates.io asap_sketchlib = "0.2.2", and the current accumulator code relies on the portable asap_sketchlib::CountMinSketch / KllSketch APIs.

Those wrapper types are important because they preserve the MessagePack/wire-format boundary used for Rust/Go sketch interoperability. This PR moves parts of that responsibility into ASAPQuery by using lower-level runtime sketch types, which is probably not a cleaner ownership boundary.

So I’m closing this draft and keeping the current main approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants