refactor: migrate to asap_sketchlib module-restructure layout#335
refactor: migrate to asap_sketchlib module-restructure layout#335GordonYuanyc wants to merge 13 commits into
Conversation
asap_sketchlib reorganized in the refactor/module-restructure branch:
- Wire-format-aligned types (CountMinSketch, CountSketch, KllSketch,
HydraKllSketch, SetAggregator, DeltaResult, CmsHeapItem, etc.) moved
out of `sketches::*` submodules and now live in `wrapper::*`,
re-exported at the crate root.
- `sketches::delta_set_aggregator::{serialize,deserialize}_msgpack` shims
were dropped in favor of the `MessagePackCodec` trait on `DeltaResult`.
Changes:
- Pin asap_sketchlib to the refactor/module-restructure branch.
- Switch all `use asap_sketchlib::sketches::*` imports to the
crate-root re-exports (`use asap_sketchlib::CountMinSketch`, etc.).
- `delta_set_aggregator_accumulator` uses
`DeltaResult::{from_msgpack,to_msgpack}` directly through the
`MessagePackCodec` trait instead of the removed module-level shims.
No behavior change — the wire format and runtime semantics of the
underlying sketches are identical (locked in by sketchlib-go parity
goldens in asap_sketchlib's tests/sketches_go_parity_probe.rs).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up the asap_sketchlib `bfe8f37` cleanup that deletes `mod wrapper` and moves all wire-format-aligned sketch types (`CountMinSketch`, `CountSketch`, `KllSketch`, `HllSketch`, `DdSketch`, `HydraKllSketch`, `CountMinSketchWithHeap`, `SetAggregator`, `DeltaResult`) into `asap_sketchlib::message_pack_format::portable::*`. The crate-root re-exports that this crate imports are preserved unchanged, so no source changes are needed on this side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CountMinSketchAccumulator and DatasketchesKLLAccumulator previously
held the wire-format facade types (`asap_sketchlib::CountMinSketch`,
`asap_sketchlib::KllSketch`), each of which is a thin shim around an
underlying `sketches::*` runtime sketch. With Go byte parity proven
for the underlying types in `asap_sketchlib::tests::sketches_go_parity_probe`,
the facade layer is no longer required.
- New `precompute_operators::sketchlib_runtime` module exposes thin
adapters (`cms_*` / `kll_*`) that translate the accumulator surface
(string keys, `Vec<Vec<f64>>` matrices, Go-compatible MessagePack
envelopes) onto `sketches::CountMin<Vector2D<f64>, FastPath,
DefaultXxHasher>` and `sketches::KLL<f64>` directly.
- `CountMinSketchAccumulator.inner` becomes
`sketches::CountMin<.., FastPath>`; `DatasketchesKLLAccumulator.inner`
becomes `sketches::KLL<f64>`. JSON / msgpack / merge paths route
through the new helpers.
- Wire format stays unchanged: the `CountMinSketchWire { sketch,
row_num, col_num }` and `KllSketchData { k, sketch_bytes }`
envelopes are still emitted, just constructed inline instead of
through the facade's `to_msgpack`.
- A handful of `inner.k` field accesses change to `inner.k()` (KLL's
`k` is private in `sketches::*` and exposed via accessor).
- Bump `asap_sketchlib` pin to `aea7d05` for the new
`sketches::KLL::k()` accessor.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The refactor/module-restructure branch has been merged and deleted on asap_sketchlib. Point at the default branch and lock to current main HEAD so a future cargo fetch doesn't fail on a missing branch ref. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
milindsrivastava1997
left a comment
There was a problem hiding this comment.
cms_from_matrix doesn't validate that rows/cols match the matrix shape (sketchlib_runtime.rs:61-70)
rows/cols come from the JSON/binary envelope while matrix is parsed separately. If they mismatch (malformed payload), from_fn silently uses the envelope dimensions and pads/truncates with 0.0 — invisible corruption. A debug_assert! or early return error at the call sites in deserialize_from_json and deserialize_from_bytes would catch this.
milindsrivastava1997
left a comment
There was a problem hiding this comment.
cms_to_msgpack swallows serialization errors (sketchlib_runtime.rs:82-86)
rmp_serde::to_vec(&wire).unwrap_or_default() returns an empty Vec<u8> on failure. The caller serialize_to_bytes propagates that silently, so a downstream deserializer will fail with a confusing "buffer too short" error instead of pointing at the actual failure site. Change the return type to Result or use expect — silent empty bytes are worse than a panic.
milindsrivastava1997
left a comment
There was a problem hiding this comment.
Regression test context comments removed (count_min_sketch_accumulator.rs, test_update_and_query_use_same_key_encoding)
The removed multi-line comment explained a real past bug — key encoding mismatch between _update and query_key that caused silent lookup failures. That context is exactly what makes the test legible to a future reader wondering why this edge case is tested at all. Worth restoring.
milindsrivastava1997
left a comment
There was a problem hiding this comment.
@GordonYuanyc some review comments from Claude, pls check
Address review feedback on PR #335: - cms_from_matrix now returns Result and rejects a matrix whose shape disagrees with the envelope's rows/cols instead of silently padding or truncating with 0.0 (invisible corruption). Propagated through deserialize_from_json, deserialize_from_bytes and cms_from_msgpack. - cms_to_msgpack uses expect instead of unwrap_or_default, so a serialization bug panics loudly rather than emitting empty bytes that surface downstream as a misleading "buffer too short". - Restore the regression-test context comments documenting the past key-encoding mismatch in test_update_and_query_use_same_key_encoding. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
cms_from_matrix already returns the right Result shape; the extra Ok/? trips clippy::needless_question_mark under -D warnings. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Thank you for the suggestion! Updated! |
milindsrivastava1997
left a comment
There was a problem hiding this comment.
Code Review
Overview
This PR adapts asap-query-engine to the new asap_sketchlib module layout, where types that were previously accessed via deep paths (sketches::countminsketch::CountMinSketch, sketches::kll::KllSketch, etc.) are now re-exported at the crate root. A new sketchlib_runtime shim module is introduced to centralize wire-format encoding/decoding and to hold the concrete generic type aliases (RuntimeCountMin, RuntimeKll). The wire byte format is documented as unchanged. The PR is in draft, pinned to a non-merged upstream branch.
Bugs / Correctness
1. Silent failure in kll_to_msgpack vs. panic in cms_to_msgpack (inconsistency + latent bug)
sketchlib_runtime.rs:937
// cms_to_msgpack — correct:
rmp_serde::to_vec(&wire).expect("CountMinSketchWire msgpack serialization is infallible")
// kll_to_msgpack — wrong:
rmp_serde::to_vec(&wire).unwrap_or_default()The CMS path correctly panics because a well-formed struct has no unrepresentable state. The same reasoning applies to KllSketchData. Using unwrap_or_default() for KLL silently produces empty bytes, which will be fed downstream and surface as a misleading "buffer too short" error at the reader — the exact failure mode the CMS comment warns against.
Also affects kll_sketch_bytes (line 927): if sk.serialize_to_bytes() fails, the JSON output becomes {"sketch": ""} — silently corrupted data.
Fix: Replace both with .expect("KllSketchData msgpack serialization is infallible").
2. _update on CountMinSketchAccumulator is now dead code
count_min_sketch_accumulator.rs:189
fn _update(&mut self, key: &KeyByLabelValues, value: f64) {
cms_update(&mut self.inner, &key.to_semicolon_str(), value);
}accumulator_factory.rs:112 now calls sketchlib_runtime::cms_update directly, bypassing _update. The method is private with no other callers visible in this diff. If confirmed dead, it should be removed.
3. cms_update zero/negative guard — behavioral change to verify
sketchlib_runtime.rs:801-804
pub fn cms_update(sk: &mut RuntimeCountMin, key: &str, value: f64) {
if value <= 0.0 {
return;
}
...
}The old CountMinSketch::update path did not have this explicit guard in the QE layer. If the upstream CountMin::insert_many accepts zero or negative values, this guard silently drops those updates and changes semantics. Needs a comment explaining the invariant, or a test verifying this was always the intended contract.
4. unsafe impl Send/Sync on DatasketchesKLLAccumulator may now be unnecessary
datasketches_kll_accumulator.rs:542-543
unsafe impl Send for DatasketchesKLLAccumulator {}
unsafe impl Sync for DatasketchesKLLAccumulator {}The original unsafe impls were needed because KllSketch wrapped a C++ library. RuntimeKll = KLL<f64> is a pure-Rust type. If the compiler would auto-derive Send + Sync for it, these are unnecessary boilerplate (though harmless). Worth checking with static_assertions::assert_impl_all!(RuntimeKll: Send, Sync) — if that compiles, the unsafe impls can be dropped.
Code Quality
5. Abstraction boundary is incomplete — accumulator code still reaches into inner
The sketchlib_runtime shim is supposed to isolate the upstream sketch internals, but callers still access inner.k(), inner.count(), inner.rows(), inner.cols() directly across multiple accumulator files. Either:
- Complete the isolation: add
kll_count(),kll_k(),cms_rows(),cms_cols()helpers tosketchlib_runtime, or - Drop the shim abstraction and just use the new module-path imports directly.
The current half-measure means future upstream API changes still break N callsites across multiple files.
6. Error type inconsistency across sketchlib_runtime functions
cms_merge_refs,kll_merge_refs→Box<dyn std::error::Error + Send + Sync>cms_from_matrix,cms_from_msgpack,kll_from_msgpack→Box<dyn std::error::Error>
Pick one convention for the whole module. The Send + Sync variant is safer for async contexts.
7. cms_matrix silent zero-fill on storage.get() returning None
sketchlib_runtime.rs:818-821
if let Some(v) = storage.get(r, c) {
*cell = *v;
}
// else: leaves 0.0 silentlyIf storage.get() can return None for valid (r, c) within the declared dimensions (e.g. sparse storage), this is a silent correctness issue. If it cannot, unwrap() / expect() would be more appropriate.
Test Coverage
8. No unit tests for error paths in merge functions
kll_merge_refs checks for k mismatch and cms_merge_refs checks for dimension mismatch — both return errors. Neither error path has a test in this diff. These are the failure modes most likely to surface in prod; each deserves a dedicated test case.
9. Go-wire compatibility test pointer missing
The PR notes asap_sketchlib::tests::sketches_go_parity_probe covers cross-language byte parity upstream. A comment in sketchlib_runtime.rs pointing to it would help future readers understand why there's no explicit cross-language test in this crate.
Nits
- The two pre-existing
needless_range_loopclippy warnings incms_matrixare trivially fixable now since the function is new: iterate with.iter().enumerate()instead of index ranges. windows-sysdowngrade (0.61.2→0.48.0) is an indirect dep change from the upstream bump — fine, but worth a sanity check if this ever targets Windows.
Summary
The migration is mechanically correct and wire format is preserved. Issues to fix before marking ready:
| Priority | Issue |
|---|---|
| Bug | kll_to_msgpack / kll_sketch_bytes should panic, not silently return empty bytes |
| Bug | Dead _update method should be removed |
| Verify | cms_update zero/negative guard — confirm behavioral equivalence with old path |
| Verify | unsafe impl Send/Sync may be removable now that FFI is gone |
| Design | Decide: complete the sketchlib_runtime abstraction boundary, or flatten it |
| Tests | Add negative cases: KLL k-mismatch merge, CMS dimension-mismatch merge |
- kll_sketch_bytes / kll_to_msgpack now panic via .expect() instead of unwrap_or_default(), matching the cms_* path: silent empty bytes surface downstream as a misleading "buffer too short" (review #1). - cms_matrix uses .expect() instead of silently leaving 0.0 when storage.get() returns None, since None is only out-of-bounds and we iterate within reported dimensions (review #7). - _update on CountMinSketchAccumulator is now #[cfg(test)] with a comment documenting that production goes straight through cms_update; it exists only to exercise the same key-encoding path in a regression test (#2). - Document the Box<dyn Error> vs Box<dyn Error + Send + Sync> split in sketchlib_runtime so the inconsistency reads as deliberate (#6). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- #3: document the `value <= 0.0` guard in `cms_update` as an invariant (count-min cells are monotonic non-negative counters; a negative update would corrupt the queried key and every colliding key) and add the regression test `cms_update_drops_non_positive_values`. - #4: drop the hand-written `unsafe impl Send/Sync` for `DatasketchesKLLAccumulator` — `RuntimeKll` (`KLL<f64>`) is pure-Rust with no FFI handle, so the bounds auto-derive. New test `accumulator_is_send_sync` statically pins this against future fields. - #8: add the missing KLL merge error-path tests — k-mismatch and empty input — alongside the existing CMS dimension-mismatch coverage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#5) Resolves the half-applied abstraction the reviewer flagged: the module no longer pretends to isolate the upstream sketch types. Accumulators now call the concrete `RuntimeCountMin`/`RuntimeKll` API directly for trivial operations, and `sketchlib_runtime` keeps only the logic worth sharing. Inlined (pure single-call delegations) into the accumulators: - cms_new -> RuntimeCountMin::with_dimensions(..) - cms_estimate -> self.inner.estimate(&DataInput::String(..)) - kll_new -> RuntimeKll::init_kll(k as i32) - kll_update -> self.inner.update(&value) Kept in sketchlib_runtime (real logic, not pass-throughs): - wire codec: cms_/kll_ to/from_msgpack, cms_matrix, cms_from_matrix - validated merge: cms_merge_refs, kll_merge_refs - guarded ops shared by >1 caller / carrying an invariant: cms_update (non-negative guard), kll_quantile (empty-sketch default), kll_sketch_bytes Module doc rewritten to state plainly that this is not an isolation layer, so the direct `self.inner.rows()/.cols()/.k()` accesses are intentional rather than a leak. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Closing this draft PR. After rebasing onto current Those wrapper types are important because they preserve the MessagePack/wire-format boundary used for Rust/Go sketch interoperability. This PR moves parts of that responsibility into ASAPQuery by using lower-level runtime sketch types, which is probably not a cleaner ownership boundary. So I’m closing this draft and keeping the current |
Summary
asap_sketchlibpin to therefactor/module-restructurebranch and adapt to the new module layout.sketches::*types directly instead of going through the wire-format facade; serialization is centralized in a newsketchlib_runtimemodule.Status
Draft — pinned to a branch in
asap_sketchlib, not a tagged release. Holding as draft until the upstream branch lands on its target.Verification
cargo buildcleancargo test -p query_engine_rust sketch→ 41 passed, 0 failedcargo clippy -p query_engine_rust --libclean apart from 2 pre-existingneedless_range_loopstyle warnings insketchlib_runtime.rsTest plan
asap_sketchlibto a merged commit on its target branch before marking readysketchlib_runtime.rs🤖 Generated with Claude Code