Skip to content

refactor: consolidate duplicated code (~150 LOC reduction)#71

Merged
tonyalaribe merged 6 commits into
masterfrom
refactor/consolidate-loc-reduction
Jun 15, 2026
Merged

refactor: consolidate duplicated code (~150 LOC reduction)#71
tonyalaribe merged 6 commits into
masterfrom
refactor/consolidate-loc-reduction

Conversation

@tonyalaribe

@tonyalaribe tonyalaribe commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Audit-driven consolidation pass to reduce duplicated code and boilerplate, per the repo's "minimal code / zero boilerplate" priority. ~150 fewer lines of code, no behavior change. Durability paths (WAL append, flush, Delta commit) are untouched except message-preserving error wrapping.

A multi-agent audit surfaced 99 candidate opportunities; these were verified against the code and distilled to the low-risk, net-positive set below. Rejected ideas (e.g. a generic get_or_compute cache that would add ~110 LOC, merging perform_delta_update/delete) were considered and rejected (net-negative LOC or correctness risk) so they aren't re-attempted.

Changes

Area Change
errors (new src/errors.rs) arrow_err/exec_err/wal_err helpers replace 34 inline .map_err closures across 7 files
functions reg_from! macro for UDF registration; inline 6 thin create_*_udf wrappers; drop duplicate scalar_to_string (use extract_utf8_string)
wal 3 deserialize fns → generic decode_payload<T>; drop 2 *_public wrappers (inner fns now pub(crate))
object_store_cache record_miss_with_fetch/record_meta_miss_with_fetch collapse repeated stat blocks
optimizers / config shared extract_utf8_string; GIB constant
tests shared insert_at/insert_for (e2e harness), array_get_str + minio_test_config (test_helpers) replace 10 duplicated local helpers

Verification

  • cargo build + cargo build --tests clean (only pre-existing vendored-walrus dead-code warnings)
  • cargo fmt + cargo clippy --tests clean
  • ✅ Lib unit tests pass, incl. WAL decode_payload round-trips (test_batch_queue_under_load is a pre-existing flake — fails identically on master)
  • test_custom_functions passes (exercises the UDF macro + extract_utf8_string + array_get_str)
  • ✅ Two rounds of automated code review; both findings (a duplicate string extractor and duplicated test-config setup) fixed

Note

MinIO/Docker integration + e2e suites compile cleanly but were not run locally — please run make test-minio / cargo test --test e2e in CI to fully validate the test-helper changes.

This PR intentionally excludes the unrelated cargo fmt reflows that stable rustfmt produced on files outside this change (the repo formats with nightly options).

Audit-driven consolidation pass. No behavior change; durability paths
(WAL append, flush, Delta commit) only gain message-preserving error
wrapping.

- errors: add arrow_err/exec_err/wal_err helpers, replace 34 inline
  .map_err closures across database, mem_buffer, dml, buffered_write_layer,
  plan_cache, stats_table, functions
- functions: reg_from! macro for UDF registration; inline 6 thin create_*_udf
  wrappers; drop duplicate scalar_to_string in favor of extract_utf8_string
- wal: collapse 3 deserialize fns into generic decode_payload<T>; drop the
  two *_public wrappers (inner fns now pub(crate))
- object_store_cache: record_miss_with_fetch/record_meta_miss_with_fetch
  collapse repeated stat-update blocks
- optimizers: shared extract_utf8_string; config: GIB constant
- tests: shared insert_at/insert_for (e2e harness), array_get_str and
  minio_test_config (test_helpers) replacing 10 duplicated local helpers

docs/plans: consolidation audit + rejected-ideas record
@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown

Code Review — PR #71: refactor: consolidate duplicated code (~150 LOC reduction)

Overall: this is a clean, well-scoped refactor. The claimed behavior-preservation holds up on close inspection — every error message, type, and stat counter is reproduced exactly. The delta is net-negative LOC with no correctness regressions found. A few minor notes below.


✅ What's correct

src/errors.rs — error helpers

  • arrow_err is correctly a bare fn (not a closure factory): ArrowError maps 1:1 to DataFusionError::ArrowError(Box::new(e), None), so using it as .map_err(arrow_err) is idiomatic and exact.
  • exec_err/wal_err returning impl Fn(E) -> DataFusionError is the right idiom for higher-order map_err helpers. All message strings ("Failed to execute Delta UPDATE: {e}", "WAL append_delete failed: {e}", etc.) are preserved verbatim.

decode_payload<T: Decode<()>> in src/wal.rs

The three original deserialize functions were byte-for-byte identical except return type. The generic replacement is fully correct — same bincode::decode_from_slice call, same BINCODE_CONFIG, same ?-propagated WalError. All three payload types (DeletePayload, UpdatePayload, UpdateWithSourcePayload) derive Decode, satisfying the bound. The turbofish syntax is used correctly at all call sites.

reg_from! macro in src/functions.rs

The $(,)? trailing-comma arm and $+ one-or-more expansion are proper Rust macro style. I manually cross-checked all 25 UDF registrations — all are preserved in the new form (some via reg_from!, the rest kept as standalone register_udf / UDAF calls). No silent drops.

record_miss_with_fetch / record_meta_miss_with_fetch in src/object_store_cache.rs

Four identical stat-increment blocks (two on stats, two on metadata_stats) are correctly identified and collapsed. Counter semantics (misses += 1; inner_gets += 1) are preserved exactly.

pub(crate) visibility on serialize_record_batch / deserialize_record_batch

Safe — these are not part of any external public API. The deleted _public wrappers were pure pass-throughs with no extra logic.

GIB constant

Confirmed defined at src/config.rs:10 as const GIB: usize = 1024 * 1024 * 1024. The substitution is exact.


Minor notes

exec_err context is &'static str

pub fn exec_err<E: Display>(ctx: &'static str) -> impl Fn(E) -> DataFusionError {

This compiles fine for all current call sites (all use string literals). It's a reasonable choice, but does mean the helper can't be used with dynamically-built context strings. Not a bug — just worth noting if a caller ever needs it.

extract_utf8_string visibility is pub fn (not pub(crate))

The function at optimizers/mod.rs is declared pub fn extract_utf8_string. If optimizers is re-exported publicly this becomes part of the library API. If this is a binary-only crate it doesn't matter in practice, but a pub(crate) annotation would express the intent more precisely. Low priority.

array_get_str is a strict superset of all replaced helpers

Some of the replaced local get_str helpers in test files only handled StringViewArray + StringArray. The new shared array_get_str also handles LargeStringArray. This is additive and safe — just confirming it's intentional.

Whitespace reformatting

The diff contains a large number of alignment-only changes (struct field alignment, etc.). These are cosmetic and have zero functional impact, but they increase diff noise. The PR note about nightly cargo fmt explains this — no action needed, just a heads-up for reviewers scanning the diff.


CI reminder (from PR description)

MinIO/Docker integration + e2e suites compile cleanly but were not run locally

The test-helper changes (insert_at/insert_for consolidation, minio_test_config) are the highest-risk surface for integration test breakage. Recommend confirming make test-minio and cargo test --test e2e pass in CI before merging.


Summary: No correctness issues found. The refactor achieves its stated goal cleanly. The two minor notes (exec_err static lifetime, extract_utf8_string visibility) are non-blocking style observations. Ready to merge once CI integration tests confirm green.

@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown

Code Review — refactor: consolidate duplicated code (~150 LOC reduction)

Overall: This is a well-scoped, low-risk refactor that delivers what it promises. The changes are mechanical equivalences, the PR description is thorough, and the rejected-ideas doc is a nice touch. A few things worth noting before merge.


✅ What's working well

src/errors.rs
arrow_err, exec_err, and wal_err are clean and correct. All error variants and message text are preserved exactly ("WAL {op} failed: {e}" produces the same strings as the former format!("WAL append_delete failed: {e}") inline closures). The &'static str constraint on exec_err/wal_err is appropriate — every call site uses a string literal.

decode_payload<T: Decode<()>>
Elegant consolidation. The three near-identical bincode functions were a textbook generics opportunity. Promoting serialize_record_batch/deserialize_record_batch to pub(crate) and deleting the _public wrapper pair is the right call.

extract_utf8_string in optimizers/mod.rs
Good home for it — already imported in functions.rs via the optimizers crate path. Dropping the private scalar_to_string duplicate and the extract_utf8_literal duplicate in tantivy_rewriter.rs is correct.

Test helper consolidation
array_get_str, minio_test_config, insert_at/insert_for — all straightforward. The insert_atinsert_for delegation avoids the second copy of the SQL string. The data-dir path for test_dml_operations is preserved (/tmp/timefusion-dml-{id} still uses the dml prefix via the local create_test_config wrapper). ✓


⚠️ Issues worth addressing

1. GIB constant definition not visible in the diff
config.rs is changed to use GIB but only one @@ hunk appears in the diff (the usage site). If GIB is defined in a separate location in that file it should show as a second hunk. Please confirm the constant is actually present in the compiled output — cargo build passing is a necessary but not sufficient signal if the constant was already defined elsewhere in the file from a prior commit.

2. buffer_consistency_test.rs not updated
The consolidation plan (workstream 4, step 2) lists buffer_consistency_test as a file to migrate to array_get_str, alongside the four files that were updated. The test file still contains its own local get_str helper. Either apply the migration or note the intentional exclusion.

3. array_get_str silently returns default on null cells
Arrow's .value(idx) on a null cell returns the type's zero value (empty string for string arrays) rather than panicking. For a test helper that's acceptable, but it means a null assertion bug will silently pass as "" rather than failing loudly. A one-line null check with panic! would make test failures more diagnosable:

assert!(!arr.is_null(idx), "array_get_str: cell {idx} is null");

4. docs/plans/ in version control
The rejected-ideas note is genuinely useful, but committing it as a planning document to docs/plans/ is unusual. The information would be just as durable in the PR description (which is already thorough) or a GitHub wiki page, without adding a file that will be stale once the codebase evolves. Minor concern — not a blocker.


🔍 Nit-level observations

  • reg_from! macro comment says "UDFs built straight from a unit/default struct" but several entries use ::new() constructors (ToCharUDF::new(), etc.). The comment is slightly misleading though the code is correct.
  • exec_err/wal_err return closures (impl Fn) which are heap-allocated when stored. In error paths this is fine; just noting it differs from arrow_err which is a bare function pointer and avoids any allocation.
  • The minio_base_config refactor changes timefusion_table_prefix from format!("test-{}-{}", test_name, uuid) to format!("test-{table_id}") where id = format!("{}-{}", test_name, uuid_prefix). End result is identical. ✓

Verdict

Approve with minor follow-ups. The three items above are: (1) verify GIB is defined, (2) complete or explicitly skip buffer_consistency_test, (3) optionally add a null guard to array_get_str. None are blocking if CI passes clean.

test_variant_column_round_trips_as_json read the attributes Variant
column as String, but bare Variant columns surface PG jsonb OID 3802
(required by monoscope's hasql decoder; enforced by
jsonb_oid_test::bare_variant_column_returns_jsonb_oid). String FromSql
rejects jsonb, so the read failed. Decode as serde_json::Value instead.

Pre-existing failure on master since the 2026-06-12 jsonb OID change;
the test assertion was not updated then.
@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown

Code Review: refactor/consolidate-loc-reduction

Overall: This is a high-quality, well-planned refactoring. The changes are mechanical and low-risk, the plan document is a thoughtful artifact, and the rejected-approaches section will save future reviewers from re-litigating bad ideas. A few items worth discussing before merging.


What's good

  • src/errors.rs is clean and correct. The three helpers preserve original variant and message text exactly — I checked every call site manually.
    • wal_err("source serialize")"WAL source serialize failed: {e}"
    • exec_err("Failed to execute Delta UPDATE")"Failed to execute Delta UPDATE: {e}"
  • decode_payload<T: Decode<()>> is a straightforward and safe generic collapse; all three payload types are structurally identical from bincode's perspective so the abstraction holds.
  • reg_from! macro is appropriately scoped. Keeping the factory-fn UDFs (create_time_bucket_udf, create_jsonb_array_elements_udf, etc.) outside the macro is the right call since they carry real construction logic.
  • Test helper consolidation is excellent — 5 near-identical get_str helpers and 3 identical insert_at bodies collapsed cleanly.

Issues to address

1. extract_utf8_string is in the wrong module

optimizers/mod.rs is a semantically odd home for a pure scalar-value utility. The consequence is that functions.rs now imports from optimizers, which is a backwards dependency: functions are typically more primitive than optimizers.

Suggestion: move it to a src/scalar_utils.rs (or a crate::utils submodule) and have both optimizers and functions import from there. Small change, avoids this cross-cutting coupling.

2. Missing doc comment on record_meta_miss_with_fetch

record_miss_with_fetch has a doc comment; record_meta_miss_with_fetch has none. The existing doc also says "data / metadata stats" which is confusing given there are separate data and metadata stat variants. Suggest:

/// Cache miss for a data range — increments misses + inner_gets in data stats.
async fn record_miss_with_fetch(&self) { ... }

/// Cache miss for a metadata range — increments misses + inner_gets in metadata stats.
async fn record_meta_miss_with_fetch(&self) { ... }

3. Un-flagged behavior changes in integration_test.rs

The PR description says "no behavior change" but integration_test.rs has two non-trivial test changes:

  • Type of row.get(0) changed from String (decoded then serde_json::from_str) to serde_json::Value (tokio-postgres binary jsonb decode). This is a correct improvement, but it's a wire-protocol assumption — the test now relies on tokio-postgres decoding OID 3802 as serde_json::Value directly. If a driver version bump drops that FromSql impl the error will be cryptic. The added comment documents the contract, which helps.

  • New assertion added to the Sort+Limit path:

    assert_eq!(parsed["http"]["status"], 200, "Sort+Limit path must round-trip variant as jsonb");

    This is a genuine test improvement, but it's scope creep from a "no behavior change" refactor. Worth noting in the PR description so reviewers don't miss it.

4. GIB constant — confirm location

The diff shows config.rs using GIB but the constant definition isn't in the diff. Please confirm it's defined in non-test code (not only in #[cfg(test)]). If it's test-only, the release build breaks. The PR says cargo build passes so this is likely fine, but worth a quick callout.


Minor nits (non-blocking)

  • array_get_str panics on unexpected types — consistent with all the local helpers it replaces. Correct for a test utility.
  • create_test_config wrapper in test_dml_operations.rs is correctly kept as a thin wrapper (distinct /tmp/timefusion-dml-{id} path prefix). The plan doc's note about this divergence was handled correctly.
  • Plan doc rejected-items section is genuinely useful. The generic get_or_compute rejection reason ("net −110 LOC") buries the lede — the real issue is hiding per-cache semantic differences. Already documented but could be sharper.

Verdict

Items 1–4 above; 1–3 are the substantive ones. Happy to approve once those are addressed (or discussed if there's a reason extract_utf8_string intentionally lives in optimizers).

100ms TTL made the put->get hit window too tight: a loaded CI runner
could stall past expiry, turning the first expected hit into a miss
(intermittent 'First get should hit' failures). Widen TTL to 2s and
expiry sleeps to 2.5s; same assertions, robust timing.
@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown

Code Review — PR #71: refactor: consolidate duplicated code (~150 LOC reduction)

Summary: Clean, mechanical consolidation pass. The six workstreams are independent, well-scoped, and the diff is easy to verify change-by-change. No behavior changes in production paths. A few observations below.


What works well

  • src/errors.rsarrow_err / exec_err / wal_err are clear, correctly typed, and preserve the original DataFusionError variants (ArrowError, Execution, External respectively). The format strings produce identical output to what they replace (verified by inspecting the {ctx}: {e} pattern against each removed inline closure).
  • decode_payload<T: Decode<()>> — the generic is tight: it's bounded exactly as bincode v2 requires and replaces three structurally-identical functions. Correct.
  • extract_utf8_string promotion — consolidating three local copies (scalar_to_string in functions.rs, extract_utf8_literal in tantivy_rewriter.rs, the inline match in optimizers/mod.rs) into one pub fn in optimizers/mod.rs is the right home for it.
  • reg_from! macro — uses $udf:expr not $T:ty, so it handles both T::new() and T::default() callers correctly. Registration order change is safe (all UDFs have distinct names, no inter-registration dependencies).
  • array_get_str — adds LargeStringArray coverage that some of the replaced local helpers were missing. Strictly more correct.
  • Test isolation preservedtest_dml_operations.rs keeps its create_test_config wrapper that sets the /tmp/timefusion-dml-{id} path prefix distinct from other tests. Good.
  • Rejected ideas documented — the docs/plans/ file explaining what was evaluated and why specific consolidations were declined is valuable. Prevents the same ground from being re-audited.

Observations / minor issues

1. delta_checkpoint_cache_test.rs — flaky fix adds ~5s of wall-clock sleep

// Before: 150ms sleep × 2 = ~300ms
// After:  2500ms sleep × 2 = 5000ms

The fix is correct (100ms TTL was too tight under CI load), but the test now holds a serial slot for 5+ seconds. If tokio::time::pause() + tokio::time::advance() are viable for this cache's TTL mechanism, that would avoid real-time sleeps entirely. If the underlying TTL relies on std::time::Instant (common in foyer/moka), virtual time won't help — in that case the wider margin is the right call. Worth a quick check.

2. exec_err requires &'static str context

pub fn exec_err<E: Display>(ctx: &'static str) -> impl Fn(E) -> DataFusionError {

All current call sites pass string literals, so this compiles fine. Future callers that need a dynamic context string (e.g., including a column name) would have to fall back to DataFusionError::Execution(format!(...)) directly. This is an acceptable tradeoff — just worth knowing. If this becomes awkward, consider impl Into<String> for ctx.

3. GIB constant not visible in the diff

src/config.rs now references GIB but the constant definition isn't shown in the diff. Assuming it's defined earlier in the same file (outside the shown hunk) and cargo build passes, this is fine — but worth double-checking that the constant is in config.rs and not expected from a missing import.

4. integration_test.rs — jsonb deserialization behavior change

// Before: round-trip via String → serde_json::from_str
// After:  direct tokio-postgres binary jsonb decoding → serde_json::Value
let parsed: serde_json::Value = row.get(0);

This is technically a test behavior change (not just cleanup), though it's correct if tokio_postgres has the serde_json feature enabled. The tighter assertion added on the Sort+Limit path (assert_eq!(parsed["http"]["status"], 200)) is strictly better. Confirming tokio-postgres in Cargo.toml has with-serde_json-1 or equivalent would be reassuring.

5. docs/plans/ file committed to repo

Shipping a planning/audit document as a source artifact is unconventional but explicitly intentional per the PR description. The "do not re-attempt" list is the valuable part. Consider whether docs/decisions/ (ADR-style) is the right long-term home, or if this document should live in the wiki. Either way, committing it is better than losing it.


No concerns on

  • Security: pure refactor, no new I/O, crypto, or auth paths touched.
  • WAL durability: error paths are wrapped not changed; append_delete/update/update_with_source still propagate errors to callers.
  • Visibility: serialize/deserialize_record_batch moved to pub(crate) (was effectively crate-internal via _public wrappers anyway).

Overall: Approve with the observations above noted for follow-up. The delta_checkpoint_cache_test timing concern and the GIB constant visibility are the only things worth a quick look before merge; everything else is either clearly correct or already covered by CI.

@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown

Code Review — refactor: consolidate duplicated code (~150 LOC reduction)

Overall: This is a clean, well-scoped refactor. The deduplication is genuine, the equivalences are correct, and the PR description is admirably transparent about what was rejected and why. A few items worth discussion below.


✅ What's well done

src/errors.rs — The three helpers (arrow_err, exec_err, wal_err) are exactly the right shape: simple, named, and the resulting error messages are character-for-character identical to what they replace. &'static str for context is a good constraint that prevents accidentally passing computed strings.

src/wal.rsdecode_payload<T> — Elegant. The three original functions were textually identical except for T; the generic unification is correct because bincode::decode_from_slice is generic over the output type anyway. Making serialize_record_batch/deserialize_record_batch pub(crate) and dropping the _public shims is the right move — the shims were pure indirection.

src/functions.rsreg_from! macro — Cuts register_custom_functions from ~40 register_udf lines to ~20 without sacrificing readability. The macro is correctly scoped (module-local, no #[macro_export]). UDF registration order is preserved where it matters (Variant expr planner before JSON planner stays at the top).

Test deduplicationarray_get_str in test_utils.rs is strictly better than the per-file get_str copies: it adds LargeStringArray support that two of the original copies (in test_custom_functions.rs and test_postgres_json_functions.rs) were missing. minio_test_config eliminates four identical 12-line config builders.


🟡 Minor items

record_miss_with_fetch / record_meta_miss_with_fetch naming (src/object_store_cache.rs:643-657)

The method names imply the methods perform a fetch, but they only record stats for a miss-that-triggered-a-fetch. A reader encountering self.record_miss_with_fetch().await in the cache hot path might think the fetch is happening inside the method. Consider record_cache_miss_stats() or just record_miss_stats() to make the scope clear.

extract_utf8_string lives in src/optimizers/mod.rs (src/optimizers/mod.rs:723-728)

This is a scalar-value utility with no semantic tie to the optimizer layer — it's now imported by functions.rs, tantivy_rewriter.rs, and optimizers/mod.rs itself. A future reader looking for "where do I put string-scalar utilities?" may not guess to look in optimizers. This is a minor cohesion nitpick, but worth a comment if a src/utils.rs ever materialises.

TTL test is now 5 seconds slower (tests/delta_checkpoint_cache_test.rs:1022-1048)

The rationale (CI runners can stall past a 100ms TTL) is valid. A 2s TTL + 2.5s sleep is very conservative though — that's ~5× the original wait. If this test is in the hot path for local dev, it might be worth exploring tokio::time::pause() + tokio::time::advance() for time-sensitive tests instead of wall-clock sleeps. Not a blocker.

integration_test.rs — added assertion is a behavior change (tests/integration_test.rs:395)

// Before (just validated JSON was parseable):
serde_json::from_str::<serde_json::Value>(&got).unwrap_or_else(...)

// After (asserts a specific value):
assert_eq!(parsed["http"]["status"], 200, "Sort+Limit path must round-trip variant as jsonb");

This is strictly a test improvement, but it's bundled with the refactor. Worth calling out so reviewers don't miss that this test now verifies round-trip correctness on the Sort+Limit path, not just parsability.


🔍 One thing to verify before merge

The reg_from! macro unifies UDF registration, but create_jsonb_array_elements_udf() stays separate because it has non-trivial construction logic. Confirm that JsonBuildArrayUDF::new() / ToJsonUDF::new() etc. return the same UDF names as the old thin wrappers (create_json_build_array_udfScalarUDF::from(JsonBuildArrayUDF::new())). If any wrapper overrode the name or added aliases, the macro silently drops that. Given the tests pass this is likely fine, but worth a quick ScalarUDFImpl::name() scan on each type.


Summary

No correctness issues found. The two naming suggestions (record_miss_with_fetch, extract_utf8_string location) are polish-level, not blockers. The TTL test slowdown is a reasonable trade-off. Approve with confidence once CI is green on make test-minio and e2e.

@tonyalaribe tonyalaribe merged commit 8bbe86a into master Jun 15, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant