Skip to content

Warm Foyer cache after compaction/flush to kill the post-compaction p99 sawtooth#39

Merged
tonyalaribe merged 17 commits into
masterfrom
claude/friendly-edison-GyQPg
Jun 7, 2026
Merged

Warm Foyer cache after compaction/flush to kill the post-compaction p99 sawtooth#39
tonyalaribe merged 17 commits into
masterfrom
claude/friendly-edison-GyQPg

Conversation

@tonyalaribe

Copy link
Copy Markdown
Contributor

Compaction outputs (16MB light / 128MB full) stream to S3 via multipart
upload, and put_multipart_opts is a pure pass-through — so every compacted
file lands cold and is never warmed into Foyer. The leading-edge partitions
that "last 1h/6h/24h" dashboards hit get rewritten every 5/30 min, so each
refresh after a compaction misses Foyer, falls to S3, and the cache bytes we
already paid for go stale. p99 sawtooths on exactly the partitions users
watch.

This warms the cache for newly-written files right after the commit is
visible, reusing the existing read path (no inline multipart change needed):

  • object_store_cache: warm_footer() issues a ranged GET of the parquet
    footer so it lands in the metadata cache (query planning pays zero S3
    round-trips); warm_full() primes the main full-file cache for data reads.
    Both are strictly best-effort — all errors swallowed.

  • database: warm_cache_for_uris() diffs pre/post file sets to warm only the
    files a commit adds, filtered to partitions within
    timefusion_warm_recency_days so we don't burn GETs on cold partitions.
    Runs in a detached, concurrency-bounded task; never blocks or fails the
    commit. Wired into optimize_table, optimize_table_light, and the flush
    callback. Logs files-warmed + current foyer hit-rate to size the win.

  • config: timefusion_warm_after_compaction (default on, footers always),
    timefusion_warm_full_files (default off), timefusion_warm_recency_days
    (default 2, covers the 48h full-optimize window), timefusion_warm_concurrency.

Correctness is unaffected — Delta snapshot isolation means warming only
changes hit-rate, never which files a query reads.

https://claude.ai/code/session_01GzJmskCVjxaSVJhER2dAFp

…99 sawtooth

Compaction outputs (16MB light / 128MB full) stream to S3 via multipart
upload, and put_multipart_opts is a pure pass-through — so every compacted
file lands cold and is never warmed into Foyer. The leading-edge partitions
that "last 1h/6h/24h" dashboards hit get rewritten every 5/30 min, so each
refresh after a compaction misses Foyer, falls to S3, and the cache bytes we
already paid for go stale. p99 sawtooths on exactly the partitions users
watch.

This warms the cache for newly-written files right after the commit is
visible, reusing the existing read path (no inline multipart change needed):

- object_store_cache: warm_footer() issues a ranged GET of the parquet
  footer so it lands in the metadata cache (query planning pays zero S3
  round-trips); warm_full() primes the main full-file cache for data reads.
  Both are strictly best-effort — all errors swallowed.

- database: warm_cache_for_uris() diffs pre/post file sets to warm only the
  files a commit *adds*, filtered to partitions within
  timefusion_warm_recency_days so we don't burn GETs on cold partitions.
  Runs in a detached, concurrency-bounded task; never blocks or fails the
  commit. Wired into optimize_table, optimize_table_light, and the flush
  callback. Logs files-warmed + current foyer hit-rate to size the win.

- config: timefusion_warm_after_compaction (default on, footers always),
  timefusion_warm_full_files (default off), timefusion_warm_recency_days
  (default 2, covers the 48h full-optimize window), timefusion_warm_concurrency.

Correctness is unaffected — Delta snapshot isolation means warming only
changes hit-rate, never which files a query reads.

https://claude.ai/code/session_01GzJmskCVjxaSVJhER2dAFp
@claude

claude Bot commented Jun 6, 2026

Copy link
Copy Markdown

Code Review: Warm Foyer cache after compaction/flush

Overview

This PR addresses a real and well-scoped performance problem: compaction outputs bypass the Foyer cache entirely, causing p99 sawtooths on the exact partitions dashboards watch. The approach — proactively warming footers (and optionally full files) in a detached, bounded task — is sound. A few issues worth addressing before merge:


Medium: warm_cache_for_table blocks the flush callback on a table lookup

File: src/main.rs:99, src/database.rs:1699–1709

The PR description says warming "never blocks or fails the commit," but warm_cache_for_table is awaited in the flush callback before returning:

db.warm_cache_for_table(&project_id, &table_name, added.clone()).await;
Ok(added)

Inside warm_cache_for_table, resolve_table is called (which may issue a PG roundtrip, rate-limited at 30s TTL), and a read lock is acquired — all before the background task is spawned. The optimize paths avoid this by calling warm_cache_for_uris directly with the already-available new_table.

Suggestion: Either pass (store, table_uri) directly from the flush callback (which already holds the table), or wrap the call site in tokio::spawn to make it truly fire-and-forget.


Minor: Post-warm hit rate log is misleading

File: src/database.rs:1680–1688

Stats are snapshotted right after all warm GETs complete. Those warm GETs are all misses (fetching from the inner store to populate Foyer), so "foyer main hit rate now" will show a lower hit rate than steady state — the first real dashboard query after this is the actual beneficiary.

Either note this in the log message (e.g., "rate before next query: ...") or snapshot stats before warming to log the delta.


Minor: No unit test for within_recency

File: src/database.rs:83–95

The helper has a few subtle behaviors worth exercising in isolation:

  • URIs without a date= segment return true (warm rather than skip)
  • Parse failures return true
  • A None cutoff means no recency limit

These are all the right defaults but easy to break silently. A few unit tests would lock in the contracts.


Minor: Silent drop of unrelativized URIs

File: src/database.rs:1656

.filter_map(|u| u.strip_prefix(&prefix).map(...))

Files whose URI doesn't start with the computed prefix are silently dropped. This is the right behavior, but if there's ever a prefix mismatch (trailing slash difference, query-string not fully stripped), the warm job would silently no-op. A debug! or warn! on the first drop would make this diagnosable.


Nit: recency_days as i64 cast

File: src/database.rs:1650

Casting a u64 to i64 wraps silently on overflow for values > i64::MAX. This is a config value so it would never actually happen, but a .min(3650) as i64 cap or a brief comment about the practical bound would clarify intent.


What's Good

  • Correctness is unaffected: warming runs in a detached task and is strictly best-effort; all errors are swallowed; commit logic is unchanged.
  • get_cached eagerly collects the stream before caching (lines 654–668 in object_store_cache.rs), confirming warm_full actually populates the cache rather than returning a lazy stream handle. The tests for warm_footer and warm_full validate the end-to-end path well.
  • Pre/post URI diffing is correct: pre_uris is snapshotted from table_clone before optimize; only the delta (added) is warmed. Concurrent writes could include extra files in added but that's benign — re-warming is idempotent.
  • Recency filtering is sensible: default of 2 days covers the 48h full-optimize window; 0 = no limit is well-handled.
  • Config defaults are sensible: timefusion_warm_full_files defaults off (footers carry most of the planning-latency win at a fraction of bytes), and config tests validate all four new defaults.
  • Explicit drop(table) before warm_cache_for_uris in optimize_table_light_inner ensures the write guard is released before the background task spawns — good practice.

…ests

- warm_cache_for_table is now non-blocking: resolve_table (possible PG
  roundtrip) and the read lock run inside a spawned task, so the flush
  callback never blocks on warming. Call site drops the await.
- Hit-rate log now snapshots the cache stats *before* warming and reports
  that as the baseline — the warm GETs are themselves misses, so a post-warm
  rate read artificially low. The next dashboard query is the beneficiary.
- Unrelativized URIs (prefix mismatch) are now logged at debug instead of
  silently dropped, so a systematic table_url/get_file_uris drift is
  diagnosable.
- recency_days is capped at 3650 before the i64 cast to rule out a silent
  wrap turning a misconfiguration into "warm nothing".
- Added unit tests for within_recency covering the inclusive cutoff, missing
  /unparseable/truncated date segments, the None (no-limit) case, and nested
  partition paths.

https://claude.ai/code/session_01GzJmskCVjxaSVJhER2dAFp
@claude

claude Bot commented Jun 6, 2026

Copy link
Copy Markdown

placeholder

@claude

claude Bot commented Jun 6, 2026

Copy link
Copy Markdown

Code Review: Warm Foyer cache after compaction/flush

Overview

This PR addresses a real and well-described performance problem: compaction outputs written via multipart upload bypass Foyer entirely, so every post-compaction dashboard query cold-starts to S3, producing the characteristic p99 sawtooth. The fix is clean in design - a fire-and-forget task diffs pre/post file sets, filters to recent partitions, and primes the cache via the existing read path.


Strengths

  • Best-effort semantics enforced throughout: every error is swallowed, the commit path is never blocked, and the detached task is never awaited. The design correctly treats warming as a hint, not a guarantee.
  • Good defaults: footer warming on, full-file warming off. Most of the latency win comes from footer reads (query planning), so the default keeps GET costs low while delivering the main benefit.
  • Bounded concurrency: for_each_concurrent(concurrency, ...) with a configurable default of 4 prevents compaction events from bursting S3.
  • Overflow protection: recency_days.min(3650) as i64 is a small but correct guard.
  • Solid test coverage: test_within_recency exercises all edge cases (inclusive cutoff, unparseable date, nested partitioning, None cutoff). The cache-warming tests verify the cache is actually primed by checking hit/miss counters before and after.

Issues

1. warm_footer costs 3 S3 requests per file, not 2 (medium)

warm_footer explicitly calls store.head(location) to compute the footer range, then calls store.get_opts(location, ranged_opts). However, get_opts for a bounded range routes to get_range_cached, which for Parquet files also calls self.inner.head(location) internally (around line 715 of object_store_cache.rs) to decide whether the range qualifies as a metadata request. So the real per-file cost is: head -> head (again) -> ranged GET = 3 requests.

The double head is redundant. At minimum, a comment in warm_footer documenting the actual request count would help operators reason about S3 API cost. Longer term, get_range_cached could skip the inner head when file metadata is already cached.

2. warm_full and warm_footer silently depend on eager caching (medium)

Both functions check store.get_opts(...).await.is_ok() without consuming the returned GetResult body. This works today because FoyerObjectStoreCache::get_cached eagerly buffers the entire response body (s.try_collect().await) before returning the GetResult. The tests confirm it - the miss counter increments immediately on the warm_* call.

The concern: this is an implicit coupling to an implementation detail. If the cache is ever refactored to lazy body streaming (to reduce peak memory on large files), both warm functions would silently become no-ops without breaking any test. The safer form:

pub async fn warm_full(store: &dyn ObjectStore, location: &Path) -> bool {
    match store.get_opts(location, GetOptions::default()).await {
        Ok(result) => result.bytes().await.is_ok(),
        Err(_) => false,
    }
}

At minimum, a comment documenting the eager-caching assumption in each function would prevent a future refactor from silently killing the feature.

3. Double tokio::spawn in warm_cache_for_table -> warm_cache_for_uris (minor)

warm_cache_for_table spawns a task to resolve the table, then calls warm_cache_for_uris, which spawns a second task for the actual warming. Two allocations and context switches where one would do. Not a correctness issue, but it arises because warm_cache_for_uris is fn (not async) and always self-spawns. Making warm_cache_for_uris an async fn (letting the caller own the spawn) would allow both call paths to share one spawn.

4. No timeout on the spawned warm task (minor)

The tokio::spawn in warm_cache_for_uris runs indefinitely. With default concurrency of 4, a single hung S3 connection could stall for_each_concurrent for all remaining files in the batch. Since the task is detached this has no correctness impact, but a per-file tokio::time::timeout (e.g. 30s) would keep the warm job from lingering long after a compaction completes.


Nits

  • pre_uris allocated as empty HashSet when warming is disabled: correctly gated by the same flag inside warm_cache_for_uris, but allocating and diffing an empty set is a subtle red herring. An Option<HashSet> or an early return at the diff site would be clearer.
  • within_recency inclusive boundary: tested and correct, but a short inline comment on date >= cutoff would make the intentional inclusive semantics explicit and prevent a future reader from "fixing" it to >.
  • Return type of warm_full/warm_footer: all call sites use let _ = ..., so the bool return is only useful in tests. A doc comment clarifying "returns true if the fetch completed without error" (not "returns true if the file is now cached") would prevent confusion.

Summary

The design is sound and the implementation is careful. The two medium-priority points - triple head per footer-warm and the implicit eager-caching dependency - are worth addressing before merge. The eager-caching assumption in particular is one refactor away from silently breaking the feature. The rest are minor polish items.

The warm path (and even the pre-existing single-put path) fetched files back
from S3 right after writing them — paying a GET for bytes we already had in
hand. Capture them at write time instead:

- put_opts: cache the PutPayload directly (Arc-backed, cheap clone),
  reconstructing ObjectMeta from the PutResult + known size. Drops the
  post-write inner.get entirely. This is what the otel flush small-file
  writes and Delta log/checkpoint writes go through.

- put_multipart_opts: previously a pure pass-through, so every compaction
  output landed cold. Now wrapped in CachingMultipartUpload, which tees
  written parts into a bounded buffer and inserts the completed file into the
  cache on complete() — no re-download. The buffer is capped
  (timefusion_warm_inline_max_mb, default 32MB): uploads above it stream
  through un-captured so 128MB full-compaction outputs don't blow memory or
  evict the L1 hot set. Those fall back to the selective, recency-filtered
  post-commit warm. abort() drops the buffer.

Net effect: writes up to the cap — including the 16MB light-compaction outputs
on the hot leading edge — warm Foyer for free, zero extra S3 GETs. The
post-commit warm path stays as the backstop for large/cold files; for files
already cached inline it resolves entirely from cache (head + range both hit),
so the two compose with no redundant S3 traffic.

Updated put-path tests (inner_gets now 0 after writes) and added coverage for
both the captured and over-cap multipart paths.

https://claude.ai/code/session_01GzJmskCVjxaSVJhER2dAFp
@claude

claude Bot commented Jun 6, 2026

Copy link
Copy Markdown

Code Review: Warm Foyer cache after compaction/flush

Overview

This PR addresses a real and well-characterized performance problem: compaction rewrites the partitions dashboards query most frequently, but the new files land cold in S3, producing a p99 sawtooth on exactly those partitions. The fix is layered correctly — inline capture during multipart writes (no re-download) plus a selective post-commit warm for larger outputs — and the design is strictly best-effort so it can never affect correctness.


Architecture

The three-layer approach is sound:

  1. CachingMultipartUpload tees written bytes into a bounded buffer so small compaction outputs (≤32MB) warm the cache at zero extra S3 cost.
  2. warm_cache_for_uris handles larger outputs (and flush) via ranged GETs after the commit is visible, filtered to recent partitions so old/cold data isn't evicted.
  3. Config knobs are well-chosen with safe defaults (warming on, full-file warm off, 2-day recency window, 4-way concurrency).

The detached-task design is correct; warming must not block or fail the commit.


Issues

warm_footer issues two HEAD requests per file (medium)

warm_footer calls store.head() to calculate the range offset, then the get_optsget_range_cached path calls self.inner.head() again unconditionally on cache miss (to determine whether the range qualifies as a metadata request). For N newly-compacted files this doubles the expected HEAD count.

// warm_footer:
let size = match store.head(location).await {};  // HEAD #1 (via head_cached)

// → get_range_cached (called by get_opts):
let file_meta = match self.inner.head(location).await {};  // HEAD #2 (direct to S3)

The simplest fix is to add a size: u64 parameter to warm_footer (callers that just did a compaction can supply it from the written-file size) and skip the outer HEAD. This halves the S3 cost of the post-commit warm path, which matters most when warming many files after a full optimize.

Alternatively, get_range_cached could call self.head_cached() instead of self.inner.head() — but that's a larger refactor touching pre-existing behaviour.

warm_full silently no-ops when get_opts returns a lazy stream (low, verify)

pub async fn warm_full(store: &dyn ObjectStore, location: &Path) -> bool {
    store.get_opts(location, GetOptions::default()).await.is_ok()
}

This works today because get_cached eagerly reads all bytes from the inner store and inserts them into the cache before returning — so .await.is_ok() without consuming the GetResult is sufficient. But this is a silent invariant: if the get path ever becomes lazy (deferred stream population), warm_full silently becomes a no-op. Consider adding a short doc comment:

/// Works because FoyerObjectStoreCache::get_cached eagerly reads and inserts
/// the full payload before returning; the GetResult payload does not need to
/// be consumed by the caller.

No integration test for the database warming path (low)

The object_store_cache unit tests cover CachingMultipartUpload, warm_footer, and warm_full well. But there is no test that confirms warm_cache_for_uris is actually called after optimize_table / optimize_table_light, or that the pre/post URI diffing correctly identifies only added files. A simple integration test that checks the foyer hit-rate increases after an optimize would close this gap and prevent regressions.


Minor Observations

timefusion_warm_inline_max_mb placement: This field lives in CacheConfig while the other four warm knobs (warm_after_compaction, warm_full_files, warm_recency_days, warm_concurrency) live in MaintenanceConfig. An operator tuning the warming behaviour has to look in two structs. Moving it to MaintenanceConfig would group all warm settings together.

dropped logging: Only the first mismatched URI is logged at the filename level, with a summary count at the end. This is fine for avoiding log spam, but if all URIs fail to relativize (a systematic prefix mismatch), the first-only log makes the issue look isolated. Consider logging both the first and last mismatching URI, or logging the total count at warn instead of debug when dropped == paths.len().

within_recency fallback: Returning true for files with no date= segment (or unparseable dates) is documented and correct, but means non-date-partitioned tables are always warmed entirely. This is safe but could surprise operators who set a narrow warm_recency_days expecting it to apply universally.


What's Done Well

  • Test coverage for the new utility functions is thorough: test_within_recency covers every edge case (inclusive cutoff, no-date-segment, unparseable date, truncated date, nested partitioning); test_multipart_capture_warms_cache exercises both the under-cap and over-cap paths.
  • CachingMultipartUpload abandons capture atomically (self.buffer = None) when the cap is exceeded — no partial entry, no memory leak.
  • Abort handling correctly clears the buffer (self.buffer = None in abort).
  • The pre-existing put_opts warming path was correctly replaced: the old path fetched from S3 after write (inner_gets += 1), the new path reuses the just-written payload — the updated test assertions (inner_gets == 0) confirm the improvement.
  • Config defaults are conservative and well-documented (footers always, full-file opt-in, 2-day window, concurrency-bounded).

Summary

The core design is correct and the implementation is careful. The main actionable issue is the double HEAD in warm_footer — worth fixing before merge since it doubles the S3 request cost of the most common warm path. The warm_full invariant and missing integration test are lower priority but good to address.

…a from S3

Goal: keep recent data fully local so dashboard queries don't hit S3; only
older partitions (~1 week, the cache TTL) fall back to S3.

Three things blocked this:

1. Foyer's disk block size caps the largest entry that can persist to disk,
   and it defaulted to 32MB — so 128MB full-compaction outputs SILENTLY never
   reached disk (even on cold reads). Added timefusion_foyer_block_size_mb
   (default 256MB) and pointed the main data cache's block engine at it, so
   full compaction outputs actually persist locally. (Metadata cache keeps its
   small block size for tiny footer ranges.)

2. Inserting 128MB files into the 512MB L1 would evict the hot small-entry
   working set. Large entries (> timefusion_foyer_l1_max_entry_mb, default
   16MB) are now inserted with foyer Location::OnDisk — they go to disk but
   stay phantom in memory, so warming a big file never thrashes L1. The 16MB
   light-compaction outputs (the hot leading edge) still get L1+disk for
   fastest repeat reads.

3. Caching everything forever would fill the disk with cold data and evict the
   recent week. Writes to partitions older than timefusion_cache_recent_days
   (default 8, just past the 7-day TTL) are not admitted — cold-tier recompress
   rewrites stay on S3. Applied uniformly across inline multipart capture,
   single-put warming, and cold-read caching.

Together with the existing 7-day TTL this gives: recent data served from local
disk, data older than ~1 week served from (and not re-cached from) S3.

The inline multipart cap is now bounded by the block size (so we never buffer
more than disk can store); timefusion_warm_inline_max_mb is an optional tighter
override (default 0 = bound only by block size).

Note for deployment: the disk cache (timefusion_foyer_disk_gb, default 100GB)
must be sized to hold ~1 week of compacted data for "everything local" to hold;
block size must stay >= the compaction target size.

Added tests for the recent-days admission window and the size-based disk
steering; existing multipart-capture test updated for the block-size-bounded cap.

https://claude.ai/code/session_01GzJmskCVjxaSVJhER2dAFp
@claude

claude Bot commented Jun 6, 2026

Copy link
Copy Markdown

Code Review: Warm Foyer cache after compaction/flush

The PR addresses a real, well-diagnosed problem (post-compaction p99 sawtooth) with a coherent design. The implementation is generally solid and the commit message / inline docs are excellent. A few issues worth fixing before merge.


Bug: metadata cache still uses file_size_bytes as its block size

In SharedFoyerCache::new (around line 300 of object_store_cache.rs), the PR changes the main cache to use the new config.block_size_bytes (256 MB), which is correct. But the metadata cache immediately below it still calls .with_block_size(config.file_size_bytes) (16 MB). This is probably intentional — small footer ranges don't need large blocks — but the two differently-named fields for what is conceptually "this cache's block size" will confuse the next reader. A short comment on the metadata builder clarifying "metadata cache keeps the smaller file_size_bytes block because footer ranges are ≤ 1 MB" would close the ambiguity.


Memory risk: default warm_inline_max_bytes = 0 silently allows 256 MB buffers per upload

warm_inline_max_bytes = 0 means "bound only by block_size_bytes", which defaults to 256 MB. CachingMultipartUpload allocates a Vec<u8> per upload up to that cap. During a full compaction, multiple 128 MB files are written concurrently, each holding a 128 MB buffer. That is a large, unbounded peak allocation on top of the normal working set.

Consider either:

  • Defaulting d_warm_inline_max_mb to something like 16 (covers light compaction outputs; full files fall back to the post-commit warm path), or
  • Adding documentation in the config comment explicitly calling out the per-upload memory cost so operators tune this intentionally.

CachingMultipartUpload bypasses timefusion_warm_full_files

The timefusion_warm_full_files flag gates whether warm_cache_for_uris calls warm_full() after a commit. But CachingMultipartUpload::complete() unconditionally inserts the full file into the main cache regardless of this flag — the inline capture path has no equivalent gate. An operator who sets timefusion_warm_full_files=false to avoid caching full compaction outputs will still get them warmed inline, which is the opposite of what they requested. The inline capture should check the same flag (pass it into CachingMultipartUpload and skip insert_main if it's false), or the two paths need to be explicitly documented as having different semantics.


DRY: within_recency (database.rs) and is_within_recent_window (object_store_cache.rs) are near-identical

Both functions parse date=YYYY-MM-DD from a string/path, compute a cutoff, and return true for unclassifiable paths. The only difference is the input type (&str vs &Path) and how the cutoff is expressed (Option<NaiveDate> vs usize days). Sharing the date-extraction logic in a small private helper would avoid the two diverging over time (e.g. one gets a timezone fix the other doesn't).


Minor: is_within_recent_window calls Utc::now() on every cache insertion

For put_opts on the hot write path this is one syscall per file, which is fine. But if the function is ever called in a tighter loop, it would be cleaner to compute the cutoff once per batch and pass it in — matching the cutoff: Option<NaiveDate> pattern already used by within_recency in database.rs.


Minor: double-nested tokio::spawn in warm_cache_for_table

warm_cache_for_table spawns a task → calls warm_cache_for_uris → spawns another task. The outer spawn exists solely to avoid blocking the flush callback on resolve_table; the inner spawn exists to not block warm_cache_for_uris's caller. The result is a three-level detached chain with no structured cancellation. Not a correctness issue given the best-effort contract, but consider flattening: just have warm_cache_for_table do the resolve_table + warm_cache_for_uris body inline in one spawn, so the task hierarchy is one level deep.


Test coverage

  • The unit tests for within_recency, is_within_recent_window, warm_footer, warm_full, and CachingMultipartUpload are thorough and well-named.
  • The config defaults test in config.rs is a good regression pin.
  • Missing: an integration test confirming that after a simulated optimize commit the newly-added URIs are actually routed through warm_cache_for_uris. The existing tests cover the object-store-layer pieces in isolation; the database-layer wiring (the diff/diff + fire-and-forget spawn path) is untested.

Nits

  • d_warm_inline_max_mb = 0 with the "0 = bound only by block size" semantics is a sentinel value; the comment in FoyerCacheConfig mentions it but the config-level doc is easy to miss. A dedicated Option<usize> would be more self-documenting.
  • The info log "foyer main hit rate before warm was {:.2}% (next query benefits)" fires for every warm job. At the default 5-minute light-optimize cadence that's 288 lines/day of info-level noise per table. Consider making this debug or gating it behind the full-files flag.

Two follow-ups so the local cache "just works" and uses the disk operators
actually have:

- Block size now auto-tracks the compaction target. from_app_config floors the
  effective disk block size at 2x timefusion_optimize_target_size, so raising
  the optimize target can never silently make the bigger outputs un-cacheable.
  The configured timefusion_foyer_block_size_mb is now a floor/override. Warns
  if the block size exceeds disk capacity.

- Default disk cache bumped 100GB -> 500GB. Servers run 500GB-1TB cache volumes
  and local disk is far cheaper than repeated S3 GETs, so default to keeping
  much more data local. foyer backs this with a sparse file (no upfront
  allocation), but it's the logical eviction ceiling, so it must stay <= the
  cache volume's free space (documented).

Added tests for block-size tracking (2x target floor + configured floor) and
locked in the new disk default.

https://claude.ai/code/session_01GzJmskCVjxaSVJhER2dAFp
@claude

claude Bot commented Jun 6, 2026

Copy link
Copy Markdown

Code Review: Warm Foyer cache after compaction/flush

This PR is well-designed and well-documented. Several solid reviews have already been posted — this one adds findings not yet covered.


High: Default d_foyer_disk_gb 100 → 500 is a silent breaking change for operators upgrading

// config.rs
-const_default!(d_foyer_disk_gb: usize = 100);
+const_default!(d_foyer_disk_gb: usize = 500);

If an existing deployment has less than 500 GB free on the cache volume, Foyer will silently accept writes until space runs out, then fail with ENOSPC during eviction (not at startup). The comment in the config correctly warns about this, but operators upgrading via a config reload or restart won't see it unless they read the diff. Consider:

  • Adding a startup check that compares the configured value against available disk space and logs a clear warning if it exceeds a safe threshold, or
  • Leaving the default at 100 GB and documenting the recommended upgrade path (e.g., "bump this after verifying free space").

Medium: Effective foyer block size changes from 32 MB → 256 MB silently for all existing deployments

The PR changes which field drives BlockEngineConfig::with_block_size:

// Before: file_size_bytes (d_foyer_file_size_mb = 32MB)
.with_block_size(config.file_size_bytes)

// After: block_size_bytes (d_foyer_block_size_mb = 256MB)
.with_block_size(config.block_size_bytes)

Foyer's block size is its eviction granularity: a 32 MB block allows ~3,000 blocks in 100 GB; a 256 MB block allows only ~390 blocks in 500 GB (or ~195 in 100 GB). Existing deployments upgrading will silently switch from fine-grained to coarse-grained eviction. The PR is correct that 256 MB is necessary to cache 128 MB compaction outputs — but this behavioral change deserves a startup-log notice alongside the block-size log that was already added.


Minor: CachingMultipartUpload buffer is not pre-allocated

buffer: Some(Vec::new()),

The buffer starts empty and will reallocate multiple times as a 128 MB compaction output streams in. Given that max_warm_bytes (the cap) is known at construction time, pre-allocating to a reasonable fraction of it would eliminate most reallocations on the hot path:

buffer: Some(Vec::with_capacity(cap.min(64 * 1024 * 1024))),

Minor: warm_footer with metadata_size_hint = 0 issues a 1-byte range GET

let start = size.saturating_sub(metadata_size_hint.max(1));

When metadata_size_hint is 0 (valid per-config, no explicit validation), max(1) limits it to size - 1, producing a 1-byte range. A 1-byte GET warms nothing useful but still costs one S3 request and one HEAD. Either document that 0 is not a useful value, clamp the minimum to a reasonable floor (e.g., 4096), or return early with false.


Confirms from prior reviews

The following issues from earlier comments are well-described and worth tracking:

  • Double HEAD in warm_footer — the outer store.head() + inner self.inner.head() in get_range_cached doubles per-file HEAD cost.
  • CachingMultipartUpload bypasses timefusion_warm_full_files — inline capture warms full files even when the flag is off.
  • is_within_recent_window calls Utc::now() per insert — minor; a pre-computed cutoff would be cleaner.
  • Double tokio::spawn in warm_cache_for_tablewarm_cache_for_uris.

What's working well

  • The inline capture path (CachingMultipartUpload) is the right architectural choice: warming from already-held bytes at zero extra S3 cost, with the cap ensuring it gracefully degrades to the post-commit path for large files.
  • The abort path correctly clears the buffer.
  • The explicit drop(table) added to optimize_table_light is a genuine latency improvement — the write lock was held through the return statement before this PR.
  • Config test additions correctly pin all new defaults.

- Sliding TTL: cache entries now expire `ttl` after their last *query*, not
  their insertion. On a hit, once an entry is past the halfway mark to expiry,
  maybe_touch re-inserts it with a fresh timestamp so frequently-queried data
  stays local indefinitely while cold data still ages out after the TTL (7d
  default). Throttled to at most one refresh per ttl/2 per key (the halfway
  gate + the existing `refreshing` in-flight guard), and the re-insert runs in
  the background off a cheap entry clone so queries never block on it (the data
  clone happens in the spawned task). Applied to full-file, ranged full-file,
  and metadata-cache hits.

- Bumped default L1 memory cache 512MB -> 1GB to use the RAM on the bigger
  boxes; small/medium entries stay hot in memory, large files remain disk-only.

Added a sliding-TTL test (entry survives past the base TTL because it was
queried) and locked in the 1GB default.

https://claude.ai/code/session_01GzJmskCVjxaSVJhER2dAFp
@claude

claude Bot commented Jun 6, 2026

Copy link
Copy Markdown

Code Review: Warm Foyer cache after compaction/flush

Overview

This PR addresses a real problem: compaction rewrites files to S3 via multipart without warming Foyer, so the first dashboard read after every optimize hits S3 cold and produces the p99 sawtooth. The fix adds two warming paths — an inline multipart tee and a post-commit diff-based sweep — both strictly best-effort so they never block or fail a commit. The approach is sound and the test coverage is solid.


Issues

1. Duplicate recency-check logic (logic drift risk)

within_recency in database.rs and is_within_recent_window in object_store_cache.rs implement the same date=YYYY-MM-DD parse-and-compare with slightly different signatures:

// database.rs — takes &str + Option<NaiveDate>
fn within_recency(uri: &str, cutoff: Option<chrono::NaiveDate>) -> bool {}

// object_store_cache.rs — takes &Path + usize days
fn is_within_recent_window(location: &Path, recent_days: usize) -> bool {}

Both live in different modules with no shared call path, so they'll diverge. Consider extracting a single parse_partition_date(s: &str) -> Option<NaiveDate> helper into a common module and having both callers use it.

2. Aggressive default bump may surprise operators on smaller hosts

d_foyer_memory_mb: 512 → 1024  (2x)
d_foyer_disk_gb:   100 → 500   (5x)

The 500GB disk ceiling is the default for all deployments. The comment explains sparse allocation, but the ENOSPC threshold is still 500GB. Operators on 100–200GB volumes who don't read release notes will hit this in production. Consider keeping the prior defaults and documenting the recommended production values in comments, or at minimum logging a startup warning that compares the configured ceiling against observed free space.

3. warm_full drops the response stream unconsumed (connection pool risk)

pub async fn warm_full(store: &dyn ObjectStore, location: &Path) -> bool {
    store.get_opts(location, GetOptions::default()).await.is_ok()
}

When store is a FoyerObjectStoreCache, get_opts fetches and caches the body before returning, so the drop is safe. But the signature takes &dyn ObjectStore, so nothing in the type system prevents a caller from passing a raw S3 store. In that case the returned GetResult stream isn't consumed, which can exhaust the HTTP connection pool. Add a doc comment making the expectation explicit, or take &FoyerObjectStoreCache directly.

4. maybe_touch spawns a task that may not be tracked in background_tasks

let handle = tokio::spawn(async move {});        // always spawned
if let Ok(mut tasks) = self.background_tasks.try_lock() {
    tasks.spawn(async move { let _ = handle.await; });
}

If try_lock() fails, the task runs detached and shutdown() won't join it. For a best-effort touch this is probably acceptable, but if background_tasks is there specifically to drain on shutdown, this creates a window where a refresh is in-flight while the cache is torn down. The simpler path: spawn directly into background_tasks (or use try_lock to spawn the actual work, not just a handle-joiner).

5. Two separate "recency" controls with overlapping semantics

The PR introduces:

  • timefusion_cache_recent_days (CacheConfig, default 8) — gate on all writes admitted to cache
  • timefusion_warm_recency_days (MaintenanceConfig, default 2) — gate on which files the post-commit warm job fetches

Both parse the same date= segment for the same purpose. An operator tuning one may not realize the other also applies. A short note in each config doc comment cross-referencing the other would help.

6. Double task spawn in warm_cache_for_table

pub fn warm_cache_for_table(&self,) {
    tokio::spawn(async move {
        if let Ok(table_ref) = db.resolve_table().await {
            …
            db.warm_cache_for_uris(store, table_uri, uris);  // spawns another task
        }
    });
}

warm_cache_for_uris unconditionally spawns its own task, so this is two levels of tokio::spawn for every flush callback. There's no technical bug, but if the table lookup is fast (cached), the outer task is just overhead. Worth noting for future cleanup.


Positives

  • Correct diff semantics: collecting pre_uris before optimize and diffing against post-commit live_uris ensures only newly-written (cold) files are warmed — existing cached files are untouched.
  • Bounded concurrency and memory: timefusion_warm_concurrency caps S3 GET burst; max_warm_bytes (capped at block_size_bytes) prevents multipart capture from ballooning memory.
  • Never blocks commits: all warming is fire-and-forget; errors are swallowed everywhere warming is called.
  • Block size auto-tracking: flooring block_size_bytes at 2 × optimize_target so ops can't accidentally misconfigure the two out of sync is a nice invariant.
  • Test coverage: test_multipart_capture_warms_cache, test_sliding_ttl_refresh_on_query, test_warm_footer_primes_metadata_cache, etc. all test observable behavior rather than internals, and the edge cases in test_within_recency are thorough.
  • Startup warning: the block_size > disk_size check in from_app_config catches a configuration that would silently prevent large files from being persisted.

Summary

The core approach is solid and the implementation is careful about best-effort semantics. The main asks before merge:

  1. De-duplicate the recency-check logic (or at minimum add a test that both functions agree for a shared set of inputs).
  2. Reconsider the 500GB disk default, or add a startup log that warns when the configured ceiling exceeds observed free space.
  3. Add a doc comment to warm_full clarifying that store must be a caching store, not a raw S3 client.

After an optimize/compaction commit, proactively evict the cached
full-file bytes of the files it tombstoned (present pre-commit, gone
post-commit) instead of waiting for VACUUM / TTL / LRU to reclaim them.

- Add SharedFoyerCache::evict_data_entry to drop a main-cache entry by key.
- Add Database::evict_cache_for_uris, mirroring warm_cache_for_uris'
  relativization, called inline after both optimize paths swap the table.
- Gate on new timefusion_evict_after_compaction config (default on).

Correctness is unaffected: the files still exist in S3 until VACUUM, so a
straggler query on the old Delta snapshot just re-reads them (a cache
miss), never a wrong result. Eviction is in-cache only (no S3), so it
runs inline.

The test asserts on the in-memory layer: foyer's HybridCache::remove
drops the memory entry synchronously but deletes the on-disk copy
asynchronously, so an "evict then read" assertion would race the disk
delete.
@claude

claude Bot commented Jun 6, 2026

Copy link
Copy Markdown

Code Review: Warm Foyer cache after compaction/flush

Overview: The PR addresses a real p99 sawtooth problem — compaction outputs stream to S3 via multipart but the cache-write path was a pass-through, so every post-compaction dashboard read cold-started from S3. The fix is well-motivated and the implementation is thorough (configurable, best-effort, non-blocking), with good test coverage. A few issues worth addressing before merge.


Bugs / Correctness

warm_full() is called with the raw object store, not the caching store (potential no-op)

In warm_cache_for_uris, the object_store argument comes from t.log_store().object_store(None) / new_table.log_store().object_store(None). Whether this returns the FoyerObjectStoreCache wrapper or the raw S3 store depends on how the log store is wired up. If it returns the raw store, calling warm_footer/warm_full on it fetches from S3 but nothing gets cached — the whole warm path is a silent no-op. This should be verified and a debug! or warn! added if the store doesn't implement the caching trait (or document the invariant clearly).

maybe_touch double-spawn is fragile

let handle = tokio::spawn(async move { ... });   // task 1 — always spawned
if let Ok(mut tasks) = self.background_tasks.try_lock() {
    tasks.spawn(async move { let _ = handle.await; }); // task 2 — sometimes spawned
}

If try_lock() fails, task 1 is orphaned (detached). At shutdown, the JoinSet won't wait for it to finish, so the refreshing.remove(&key) inside it may race with a new refresh attempt. Simpler fix: spawn a single task and add it directly to the JoinSet under the lock. If the lock isn't needed for the refresh logic itself, tokio::spawn alone is fine.


Breaking Default Change

d_foyer_disk_gb jumps from 100 → 500. The comment explains the rationale well, but an existing deployment with < 500 GB on the cache volume will hit ENOSPC on startup (foyer validates capacity). This is silent data-path breakage — the cache silently fails to write.

Recommend:

  • Document this in a CHANGELOG / migration note
  • Consider adding a startup check that warns loudly if disk_size_bytes > available_fs_space

Code Quality

Duplicated pre-compaction tracking block (copy/paste, ~9 lines)

The track_files / pre_uris block appears verbatim in both optimize_table (around line 2107) and optimize_table_light (around line 2406). Extract to a method, e.g.:

fn snapshot_file_uris(&self, table: &DeltaTable) -> Option<HashSet<String>> {
    if self.config.maintenance.timefusion_warm_after_compaction
        || self.config.maintenance.timefusion_evict_after_compaction
    {
        Some(table.get_file_uris().map(|it| it.collect()).unwrap_or_default())
    } else {
        None
    }
}

Two confusingly similar "recency days" configs

  • timefusion_cache_recent_days (CacheConfig, default 8) — governs cache admission on write (cache layer)
  • timefusion_warm_recency_days (MaintenanceConfig, default 2) — governs which files the post-commit warm job fetches

Both configs filter on date= partition age but they're in different structs with different defaults and neither doc-comment mentions the other. An operator will likely set one and be surprised the other still limits behavior. Consider adding cross-references in the doc comments or consolidating into one config.

Duplicated date= parsing logic

within_recency(uri: &str, ...) in database.rs and is_within_recent_window(location: &Path, ...) in object_store_cache.rs both parse date=YYYY-MM-DD with the same find("date=") + get(i+5..i+15) + parse_from_str pattern. Extracting a shared parse_hive_date(s: &str) -> Option<NaiveDate> would eliminate the duplication.


Performance

warm_footer issues two S3 requests per file (head + ranged GET)

Even for files already in the main cache (warmed inline via put_opts or CachingMultipartUpload), warm_footer still issues a head to get the file size, then a ranged GET. For the common case where the file was just written and the write path already cached it, these are unnecessary. Could short-circuit with a cache.contains(&key) check before issuing the head.

added.clone() in main.rs flush callback

The clone is correct (noted in the comment), but worth calling out: for high-ingest tables this clone could be O(thousands of strings). Passing Arc<Vec<String>> to both the tantivy indexer and the warm path would avoid the clone if this ever becomes measurable.


Test Coverage

Tests are comprehensive for the new cache primitives (test_multipart_capture_warms_cache, test_sliding_ttl_refresh_on_query, test_warm_footer_primes_metadata_cache, etc.). Missing:

  • Integration test for evict_cache_for_uris: The test_evict_data_entry_removes_cached_file test exercises the SharedFoyerCache method but doesn't test the database-level wiring (that tombstoned files are actually evicted post-optimize_table).
  • Prefix mismatch path: The dropped counter path in warm_cache_for_uris (URI doesn't start with table prefix) has no test.

Minor

  • put_opts reconstructs ObjectMeta with last_modified: Utc::now() which may differ slightly from S3's actual mtime. If any downstream code uses last_modified for staleness checks this could cause subtle issues. Consider using DateTime::<Utc>::UNIX_EPOCH or documenting that the timestamp is approximate.
  • recency_days.min(3650) cap avoids the overflow comment mentions, but a u64 days value of 3650 * 86400 seconds fits comfortably in i64 — the min(3650) is fine but the comment could just say "cap to avoid chrono::Duration::days panicking on large inputs."

Overall the approach is sound and the implementation handles the important edge cases (best-effort, non-blocking, concurrency-bounded, eviction of tombstoned files). The breaking default change and potential warm no-op when the raw store is passed are the two items I'd want confirmed before merging.

@claude

claude Bot commented Jun 6, 2026

Copy link
Copy Markdown

Code Review — Warm Foyer cache after compaction/flush

The problem statement is clear and well-motivated. The p99 sawtooth on leading-edge partitions is a real pain point and the design (warm from the existing read path, never block a commit, strictly best-effort) is sound. Several improvements are warranted before merging.


Potential Bugs

1. Cache key inconsistency in CachingMultipartUpload (medium-high risk)

CachingMultipartUpload::complete inserts the entry using:

insert_main(&self.cache, self.location.to_string(), ...);

But FoyerObjectStoreCache::put_opts (non-multipart path, same PR) inserts via:

self.insert_main_value(location, ...) // -> Self::make_cache_key(location)

If make_cache_key does anything beyond location.to_string() (e.g. strips a bucket prefix, normalises slashes), multipart-warmed entries land under a different key and every subsequent read misses. The test checks stats.main.hits but does not assert a cache hit via a plain get after put_multipart — worth auditing make_cache_key and adding a cross-path key-consistency test.

2. warm_full silently does nothing if the inner store is lazy

pub async fn warm_full(store: &dyn ObjectStore, location: &Path) -> bool {
    store.get_opts(location, GetOptions::default()).await.is_ok()
}

This works today because FoyerObjectStoreCache::get_opts eagerly reads the body before returning. But GetResult's payload is a stream; if this is called with a different store implementation, the unread stream is dropped and zero bytes land in cache with no error. The function should explicitly consume the body:

pub async fn warm_full(store: &dyn ObjectStore, location: &Path) -> bool {
    match store.get_opts(location, GetOptions::default()).await {
        Ok(r) => r.bytes().await.is_ok(),
        Err(_) => false,
    }
}

3. ttl.as_millis() as u64 silent truncation in maybe_touch

Duration::as_millis() returns u128; casting to u64 truncates silently. The default 7-day TTL (604_800_000 ms) fits fine, but prefer an explicit clamp:

self.config.ttl.as_millis().min(u64::MAX as u128) as u64

Code Quality and Design

4. Warm/evict block copy-pasted across optimize_table and optimize_table_light

Lines ~2183-2197 and ~2451-2469 are nearly identical (compute added/removed, acquire+drop write lock, call evict then warm). Extracting a helper avoids the two paths drifting — the evict call was already missing from optimize_table_light before this PR and could regress again.

5. Double-spawn in warm_cache_for_table -> warm_cache_for_uris

warm_cache_for_table already detaches a task to resolve the table; inside, warm_cache_for_uris spawns another task. Making warm_cache_for_uris async and awaiting it directly inside the outer spawned task removes the unnecessary nesting.

6. Duplicate recency-filtering logic

within_recency (database.rs:169) and is_within_recent_window (object_store_cache.rs:660) both parse date= partition segments and check recency but diverge in signature. A shared utility keeps them in sync.


Default/Configuration Concerns

7. d_foyer_disk_gb 100 -> 500 GB is a 5x increase

This will ENOSPC deployments on typical cloud instances with 100-200 GB data disks. The comment is good but the default is too aggressive. Consider keeping 100 GB as the default and documenting 500 GB as a recommended production value, or emitting a startup warning when the configured value approaches available disk space.

8. d_vacuum_retention 72 -> 48 h is a silent behavioral change

The rationale is sound, but operators with long-running analytics queries (overnight batch jobs against Delta history) will see files vacuumed mid-query after upgrading. Worth calling out in the PR description and emitting a startup log when the value is below the old 72 h default.


Minor Nits

  • warm_footer issues HEAD + ranged GET: two S3 calls per file on every warm. If ObjectMeta is already in the metadata cache the HEAD could be skipped. Low priority given warming is infrequent.
  • maybe_touch orphaned JoinHandle: if try_lock() fails the handle is dropped (the task detaches and runs fine, but background_tasks won't join it on shutdown). Acceptable for best-effort, but worth a comment so future maintainers don't assume background_tasks is exhaustive.
  • Log format: full=true/false in warm_cache_for_uris is harder to filter in Loki than a labelled string like full=footer-only / full=yes.

Test Coverage

The test suite is thorough and the new tests cover all the main paths well. Two gaps worth adding:

  1. Cross-path key consistency: write via put_multipart, read back with a plain get, assert a cache hit. Guards against the key mismatch in item 1.
  2. Evict + miss round-trip: call evict_data_entry, then confirm a subsequent read is a miss (not served from L1), ensuring the evict key format matches the cache key format used on read.

Summary

The core mechanism is correct and the test coverage is solid. The two things to resolve before merging are: (1) verify the cache key consistency in CachingMultipartUpload vs make_cache_key, and (2) make warm_full consume the stream rather than relying on eager evaluation. The 500 GB disk default needs to be revisited for safety on smaller deployments. Everything else is refactoring and polish.

…dedup

- Route CachingMultipartUpload::complete through make_cache_key so a
  multipart-warmed entry stays findable by a later GET if the key
  derivation ever changes.
- Drain the GET body in warm_full so warming is correct for any inner
  store, not just the eager Foyer path.
- Clamp ttl.as_millis() before the u64 cast in maybe_touch.
- Extract swap_and_refresh_cache shared by both optimize paths so the
  warm/evict pair can't drift; document the orphaned-handle case.
- Make warm_cache_for_uris async and await it directly, removing the
  double-spawn under warm_cache_for_table.
- Unify date= partition recency parsing behind date_partition_within.
- Label warm logs scope=full/footer-only instead of full=true/false.
- Add cross-path key-consistency test (multipart warm -> read -> evict).
@claude

claude Bot commented Jun 6, 2026

Copy link
Copy Markdown

Code Review — PR #39: Warm Foyer cache after compaction/flush

Overview

This PR addresses a real, well-diagnosed problem: post-compaction p99 sawtooths caused by Foyer cache misses. The design is solid — best-effort, non-blocking, fire-and-forget, with Delta snapshot isolation guaranteeing correctness is unaffected. The sliding-TTL refresh and CachingMultipartUpload tee-path are particularly thoughtful additions. Good test coverage on the new cache primitives.

A few issues worth addressing before merge, ranging from a doc-comment bug to operational risk from the default changes.


Bugs

1. Fused doc comment on date_partition_within (object_store_cache.rs)

The function has two separate doc comment blocks merged into one paragraph:

/// Whether `location` should be admitted to the cache given the recent-days
/// window. ... Keeps cold-tier rewrites ... out of the cache ...
/// Parse the `date=YYYY-MM-DD` partition segment from `s` and return whether it
/// is on or after `cutoff`. ... a `None` cutoff means no age limit.

The first block describes a cache-admission gate; the second describes what the function actually does. The first block appears to be a leftover from a function that was refactored away or renamed during development. The result is a misleading doc comment — the function is a general date parser, not an admission gate. The admission gate is is_within_recent_window/insert_main_value.

Fix: Keep only the second block (starting at "Parse the date=YYYY-MM-DD..."), plus the "Shared by..." line.


Medium Issues

2. Memory pressure from CachingMultipartUpload defaults (object_store_cache.rs)

The default inline-warm cap is block_size_bytes = 256MB. With warm_concurrency = 4, up to 4 concurrent compactions could each hold a 256MB buffer simultaneously — 1GB of transient heap during a busy maintenance window. The warm_inline_max_bytes escape hatch defaults to 0 (= bound only by block size), so there's no soft cap by default.

This is fine for the described hardware profile (500GB cache volumes likely live on large-memory nodes), but worth documenting explicitly in the config comment for timefusion_foyer_block_size_mb so operators on smaller instances know to set timefusion_warm_inline_max_mb.

3. warm_footer issues two S3 round-trips per file (object_store_cache.rs, warm_footer)

warm_footer does store.head() to get file size, then a ranged get_opts(). With N newly-compacted files, that's 2N S3 requests on the warm path. If the object store supports suffix ranges (GetRange::Suffix), you could issue one ranged GET to fetch the last metadata_size_hint bytes without needing the file size. Not all stores support suffix ranges, but it's worth a comment explaining why HEAD is preferred over GetRange::Suffix.

4. Evict path silently skips prefix mismatches (database.rs, evict_cache_for_uris)

warm_cache_for_uris logs the first URI that fails to relativize (with a note that repeated mismatches are counted). evict_cache_for_uris silently skips mismatches with no log. A systematic mismatch on the evict path means tombstoned files linger in cache until TTL/LRU — the same diagnosability concern that motivated the log in the warm path applies here.


Operational / Default-Change Risk

5. Default value changes are large and potentially breaking

Three defaults changed simultaneously:

Config Old New Impact
timefusion_foyer_disk_gb 100 GB 500 GB 5× increase; will hit ENOSPC on nodes with < 500 GB cache volumes
timefusion_foyer_memory_mb 512 MB 1024 MB Doubles L1 memory footprint
timefusion_vacuum_retention 72 h 48 h Shrinks the safety window for in-flight queries

The comments for each are good and the ENOSPC warning in the config is helpful. However, the 100→500GB jump is the riskiest: it can cause silent data loss on underpowered nodes (foyer silently fails writes when the disk is full rather than returning an error to the caller). Consider whether this warrants a migration note or a runtime check against available disk space during startup.


Minor / Style

6. within_recency wrapper in database.rs adds indirection without value

fn within_recency(uri: &str, cutoff: Option<chrono::NaiveDate>) -> bool {
    crate::object_store_cache::date_partition_within(uri, cutoff)
}

The wrapper exists solely to be called in warm_cache_for_uris. The comment says it's the "single source of truth for date= partition recency parsing" — but the source of truth is in object_store_cache, not here. Calling crate::object_store_cache::date_partition_within directly at the two call sites in warm_cache_for_uris would be clearer (and the within_recency name is more ambiguous than date_partition_within).

7. track_files / pre_uris block is duplicated

The same 5-line track_files / pre_uris block appears in both optimize_table (line ~2138) and optimize_table_light (line ~2423). Both funnel through swap_and_refresh_cache, which already handles the diff. A helper that takes &DeltaTable and returns HashSet<String> would eliminate the duplication and make it harder to diverge again.

8. No database-level unit test for warm_cache_for_uris filtering

The object-store-cache primitives (warm_footer, warm_full, is_within_recent_window) are well tested. But warm_cache_for_uris has non-trivial logic (prefix relativization, recency filtering, the dropped-URI counter) that isn't directly exercised. A targeted unit test with synthetic URIs — including one that fails the prefix strip — would catch regressions to the most complex part of the warm path.


Positive Callouts

  • The sliding-TTL refresh (maybe_touch) is a clean solution to the cache-expiry cliff problem — the half-TTL gate + single in-flight refresh per key is well thought out.
  • CachingMultipartUpload correctly abandons the buffer on overflow rather than dropping parts silently, and the abort path clears it.
  • swap_and_refresh_cache is a good consolidation — the old code had the evict call missing from the light path, and a single helper keeps them in lockstep.
  • The baseline hit-rate logging (pre-warm, not post-warm) is a nice observability touch.
  • Test coverage on the new primitives (test_multipart_capture_warms_cache, test_sliding_ttl_refresh_on_query, test_multipart_warm_read_and_evict_key_consistency) is thorough and well-labeled.

- Fix the fused doc comment on date_partition_within: move the stranded
  cache-admission description back onto is_within_recent_window and keep
  only the parser doc (plus the shared-by note) on date_partition_within.
- Log the first prefix mismatch (and a skipped count) in
  evict_cache_for_uris, mirroring the warm path, so a systematic
  relativization drift that leaves tombstoned files in cache is
  diagnosable instead of silently swallowed.
- Document the worst-case transient heap (block_size_mb * warm_concurrency)
  on the foyer_block_size_mb config and point smaller instances at
  timefusion_warm_inline_max_mb.
- Comment why warm_footer prefers HEAD + bounded GET over GetRange::Suffix
  (range cache keys need the absolute file size; HEAD also primes
  metadata cache; suffix ranges aren't universally supported).
@claude

claude Bot commented Jun 6, 2026

Copy link
Copy Markdown

Code Review

PR: Warm Foyer cache after compaction/flush to kill the post-compaction p99 sawtooth

Summary: This is a well-motivated and thoughtfully implemented PR. The root cause (compacted files landing cold in S3 bypassing Foyer) is correctly diagnosed, and the fix correctly targets the post-commit warm/evict lifecycle without touching the hot write path for correctness. The detached-task approach and best-effort error swallowing are appropriate. Below are findings ranging from blocking concerns to nits.


Potential Issues

1. Aggressive default disk-cache increase may cause ENOSPC on existing deployments

d_foyer_disk_gb jumps from 100 → 500 GB. The comment correctly warns that foyer won't evict until this logical ceiling is hit, and will return ENOSPC if it exceeds real free space. For any existing deployment on a <500 GB volume this is a silent foot-gun on upgrade. Consider keeping 100 GB as the default and documenting the larger value as opt-in, or at least adding a startup check/warning (the block_size > disk_size warn in from_app_config is a good pattern to replicate here):

if config.disk_size_bytes > available_disk_bytes {
    tracing::warn!("foyer disk ceiling ({}GB) may exceed available space …",);
}

2. Vacuum retention reduced from 72 h → 48 h — risky default change

The comment justifies this with "no query runs beyond ~1 h," but that assumption isn't enforced anywhere. A long analytical query, a paused DataFusion session, or a client with a held Delta snapshot could hold a reference past 48 h, at which point VACUUM will delete files the snapshot still needs (yielding 404s or corrupt results). 72 h was already conservative; tightening it in the same PR as cache infrastructure changes increases blast radius. Suggest keeping 72 h or separating this into its own PR with explicit discussion.

3. maybe_touch clones large entry bytes on the query hot path

// object_store_cache.rs ~line 891
let v = entry.value();
insert_main(&cache, key.clone(), CacheValue::new(v.data.clone(), v.meta.clone()),);

For a 128 MB cached file this is a 128 MB allocation inside a spawned task on every cache hit older than ttl/2. If 10 concurrent queries all hit the same stale 128 MB entry within the same window, 10 × 128 MB of transient heap allocations pile up before refreshing.remove drains. The refreshing DashSet guard prevents concurrent insertion for the same key, but 10 different large files can each spawn independently. Consider bounding the sliding-TTL refresh to entries below l1_max_entry_bytes (large entries are disk-only and their access latency isn't as sensitive to memory placement):

if l1_max_entry_bytes > 0 && entry.value().data.len() > l1_max_entry_bytes {
    return; // disk-only entries: skip slide-TTL, LRU handles them
}

4. Dual prefix-strip logic duplicated between warm_cache_for_uris and evict_cache_for_uris

Both functions independently compute:

let prefix = table_uri.split('?').next().unwrap_or().trim_end_matches('/');

and both handle prefix-mismatch drops with near-identical debug! messages and counters. Extract a small helper (e.g., relativize_uri(table_uri: &str, full_uri: &str) -> Option<Path>) so the two paths can't drift and the log format stays consistent.


Minor / Nits

5. Two S3 round-trips per footer warm (HEAD + GET)

warm_footer correctly explains why suffix-range GET is avoided (cache-key consistency). The trade-off is fine as documented. One note: if a file was just written by this process the ObjectMeta is often already available (returned from put_opts). Passing it in optionally would let the footer warm skip the HEAD:

pub async fn warm_footer(store: &dyn ObjectStore, location: &Path, size_hint: u64, known_size: Option<u64>) -> bool

Not blocking — the existing behaviour is correct.

6. test_sliding_ttl_refresh_on_query uses wall-clock sleeps

The test relies on tokio::time::sleep with tight margins (600 ms + 200 ms + 400 ms = 1200 ms > 1000 ms TTL). Under a loaded CI runner this can produce a miss instead of a hit in the final assertion. Using tokio::time::pause()/tokio::time::advance() would make this test deterministic. If you want to keep real sleeps, at least widen the margins.

7. d_cache_recent_days (8) vs d_warm_recency_days (2) — consider co-locating the docs

These two windows serve related but distinct roles: one gates cache admission, the other gates warming. Operators tuning one will likely need to tune the other; a cross-reference comment linking them would reduce misconfiguration risk.

8. maybe_touch TTL arithmetic is correct but awkward

if age.saturating_mul(2) <= self.config.ttl.as_millis().min(u64::MAX as u128) as u64 {

This is correct. Optionally, it's slightly clearer written as:

let half_ttl = (self.config.ttl.as_millis().min(u64::MAX as u128) as u64) / 2;
if age <= half_ttl { return; }

Not a blocking issue.


What's Well Done

  • swap_and_refresh_cache is an excellent refactor — it guarantees warm and evict are always applied as a pair after both full and light optimize paths, with a clear comment explaining the prior divergence bug.
  • date_partition_within as a shared single source of truth for the recency window avoids parser drift between the cache admission and warming paths. The test coverage (test_within_recency) is thorough.
  • CachingMultipartUpload tee is a clean solution to avoid re-downloading a file just uploaded. The cap + abandon pattern is the right defence against memory bloat.
  • evict_cache_for_uris running inline (no spawn) while warm_cache_for_uris detaches is the right call — eviction is cheap and should happen before the next query; warming is I/O-bound and must not block the maintenance loop.
  • Test coverage is comprehensive: key-consistency, recency window, multipart capture, sliding TTL, and footer/full-file warm are all independently exercised.

Verdict: Approve with the caveat that items 1 (disk default ENOSPC risk) and 2 (vacuum retention reduction) deserve a second look before merge — both are backward-incompatible behaviour changes that are orthogonal to the core caching fix and should be either separated or explicitly called out in the merge description.

Read path (the query-latency win): in the default footer-only warm mode a
cached footer read still paid 2 uncached S3 HEADs per file. Now:
- get_range_cached probes the metadata range cache before any HEAD, so a
  steady-state footer hit pays zero S3 round-trips.
- head_cached caches immutable parquet ObjectMeta (by a #meta key) and is
  used in place of inner.head, so the meta HEAD is paid at most once per
  file. Mutable paths (Delta log / _last_checkpoint) are never meta-cached.

Warm path: warm_footer now issues a single GetRange::Suffix instead of
HEAD + bounded GET. get_opts learns the resolved absolute range + size from
the suffix response and caches the footer under the same absolute key
bounded reads use, plus the ObjectMeta. Falls back to HEAD + bounded GET
for stores without suffix-range support.

Adds a CountingStore-backed test asserting a suffix warm is 1 GET / 0 HEAD
and a later warmed footer read is 0 HEAD / 0 GET.
@claude

claude Bot commented Jun 7, 2026

Copy link
Copy Markdown

Code Review: Warm Foyer cache after compaction/flush

Summary: This PR eliminates the post-compaction p99 latency sawtooth by warming the Foyer cache for newly-written Parquet files after flush/optimize commits, while proactively evicting tombstoned files. It also adds several complementary improvements: multipart upload tee-to-cache, sliding TTL refresh, HEAD result caching, suffix-range normalisation, and a recency admission window for cache entries.

Overall this is well-engineered and thoroughly documented. The best-effort pattern is applied consistently throughout. A few issues are worth addressing before merge.


Bugs / Correctness

warm_footer_via_head does not drain the response stream

// warm_footer (suffix path) — correctly drains:
Ok(result) => result.bytes().await.is_ok(),

// warm_footer_via_head (fallback) — does NOT drain:
store.get_opts(location, opts).await.is_ok()

For FoyerObjectStoreCache this is harmless because bytes are materialised inside get_range_cached before the stream is returned. But for any other ObjectStore implementation the HTTP body would leak. Add .and_then(|r| async move { r.bytes().await.map(|_| ()) }) or similar to be consistent and correct against generic stores.

pre_uris placement inside optimize_table_light retry loop

pre_uris is captured at the top of each attempt iteration. On a retry the pre-state is re-snapped from the latest table version, not from before the first attempt. Because the Delta snapshot was not committed on the failing attempt this is arguably correct, but it is subtle enough to deserve an inline comment explaining the intent.


Breaking Default Changes

Three default values changed with no migration note in the PR description:

Setting Old New Impact
timefusion_foyer_memory_mb 512 MB 1 024 MB Doubles in-process memory on upgrade
timefusion_foyer_disk_gb 100 GB 500 GB Will hit ENOSPC on volumes < 500 GB
timefusion_vacuum_retention 72 h 48 h Shorter safety margin, unrelated to this PR

The foyer_disk_gb change in particular is risky: the comment correctly notes that foyer allocates sparsely, but the ENOSPC floor is the logical ceiling — a 200 GB cache volume will start failing writes before eviction kicks in. Consider keeping 100 GB as the default (or at least 200 GB) and letting operators opt in to 500 GB, or add a startup check that warns when the configured size exceeds the volume's available space.

The vacuum_retention reduction is unrelated to cache warming; it should be in a separate PR so its risk surface is reviewed independently.


Design Observations

Two partially-overlapping "recency" configs

  • timefusion_cache_recent_days (CacheConfig, default 8) — governs cache admission at write time
  • timefusion_warm_recency_days (MaintenanceConfig, default 2) — governs which files the post-commit warm fetches

These serve different purposes but an operator tuning one might expect the other to follow. The relationship is not documented in either field's doc comment. Add a cross-reference so it is clear that warming respects its own narrower window and admission respects a broader one.

warm_cache_for_uris docstring claims "detached task" but it awaits directly

/// Runs in a detached, concurrency-bounded task; never blocks or fails the commit.
async fn warm_cache_for_uris(&self, ...) { ... }

The function is itself async and is awaited by callers (the detachment happens in swap_and_refresh_cache and warm_cache_for_table). The docstring correctly belongs on those callers, not here, to avoid confusion about who is responsible for the spawn.


Test Coverage

The new test suite is excellent. A few notes:

test_sliding_ttl_refresh_on_query is timing-sensitive

tokio::time::sleep(Duration::from_millis(600)).await;  // past ttl/2
let _ = cache.get(&path).await?;                        // triggers background touch
tokio::time::sleep(Duration::from_millis(200)).await;   // let re-insert land
tokio::time::sleep(Duration::from_millis(400)).await;   // > original TTL

This relies on a background task completing within 200 ms. Under load on a slow CI runner this can race. Replace with tokio::time::pause() + tokio::time::advance() and tokio::task::yield_now() so the test is deterministic without wall-clock sleeps.

maybe_touch shutdown hazard (minor)

The comment acknowledges that background refresh tasks may detach if background_tasks.try_lock() fails. This means shutdown() does not guarantee all in-flight maybe_touch tasks have completed. For the sliding-TTL test this is fine since the test calls cache.shutdown() at the end. Worth noting as a known limitation in the struct-level doc.


Nits

  • within_recency in database.rs is a one-line forwarding wrapper around object_store_cache::date_partition_within. Since the comment calls it "single source of truth", consider just using the public function directly from call sites to remove the indirection.
  • In swap_and_refresh_cache the live_set binding (HashSet<&String>) can be elided; pre_uris.contains(u) is already O(1) and live_uris.iter().collect() allocates an extra set. Using live_uris.contains(u) (O(n)) would be simpler for the sizes involved, but more importantly the current code has two separate HashSets where one suffices.
  • Config test at line 157: assert_eq!(config.maintenance.timefusion_warm_recency_days, 2) — consider asserting timefusion_warm_concurrency here too (it is declared but not asserted in the test).

Summary

Category Finding
Bug warm_footer_via_head does not consume the stream
Risk foyer_disk_gb default 500 GB will ENOSPC small volumes
Scope vacuum_retention change should be a separate PR
Reliability Timing-based TTL test may be flaky on slow CI
Docs Clarify the two recency configs and warm_cache_for_uris detachment

The core warming logic and test coverage are solid. Fixing the ENOSPC default and moving the vacuum change out are the two items I'd treat as blockers before merge.

- Run cargo +nightly fmt --all to satisfy the Format check (struct-field
  alignment + line-width normalization of the new code).
- Use path.as_ref() instead of &path.to_string() in two tests
  (clippy::unnecessary_to_owned, denied via -D warnings).
- Add ..Default::default() to the FoyerCacheConfig literal in
  cache_performance_test.rs, which was missing the block_size_bytes /
  cache_recent_days / l1_max_entry_bytes / warm_inline_max_bytes fields
  added by this PR.
@claude

claude Bot commented Jun 7, 2026

Copy link
Copy Markdown

Code Review — PR #39: Warm Foyer cache after compaction/flush

TL;DR: The approach is well-motivated and the implementation is largely solid. A few issues are worth discussing before merging: a breaking default-config change that could ENOSPC smaller deployments, some subtle correctness risks in the cache-warming path, and a few redundancy / clarity nits.


What this PR does

Post-compaction p99 sawtooth caused by newly-written files never entering Foyer. This PR fixes that with three mechanisms:

  1. Post-commit warm job — after optimize_table/optimize_table_light/flush, diff the pre/post file sets and issue footer (always) or full-file (opt-in) GETs through the caching store.
  2. Inline multipart teeCachingMultipartUpload captures bytes as they stream to S3 so the completed file is warmed without a re-download.
  3. Proactive eviction — tombstoned files are evicted from Foyer immediately after a commit rather than waiting for TTL/LRU.

Bonuses: sliding TTL on cache hits, suffix-range support (one round-trip footer warm), and cached ObjectMeta so repeated footer reads skip the HEAD.


Issues

🔴 Breaking default changes for operators upgrading

// config.rs
const_default!(d_foyer_disk_gb: usize = 500);  // was 100
const_default!(d_foyer_memory_mb: usize = 1024); // was 512
const_default!(d_vacuum_retention: u64 = 48);    // was 72

The disk default going from 100 GB to 500 GB is dangerous for existing deployments. Foyer creates the backing file sparse, but ENOSPC hits before eviction kicks in — an operator on a smaller instance (or a developer running locally) who upgrades without reading the changelog hits a hard failure. Consider:

  • Keeping the default at a safer value (e.g., 100 GB) and adding a timefusion_foyer_disk_auto_size flag that enables the larger heuristic, or
  • Emitting a clear WARN at startup when the configured capacity exceeds available free space.

The vacuum retention reduction from 72 h → 48 h is also a behavior change. The 48× safety margin reasoning is sound for the described workload, but a comment in the config and a CHANGELOG entry would help operators who set explicit retention policies.

🟡 warm_footer_via_head drops the response body without consuming it

// object_store_cache.rs — warm_footer_via_head
store.get_opts(location, opts).await.is_ok()

For a FoyerObjectStoreCache inner store this is fine (the caching happens inside get_range_cached before the GetResult is returned). But if store is passed as a raw S3/HTTP object store — which can happen via warm_footer falling back here — the response body is dropped un-drained. Most HTTP client libraries handle this gracefully, but it can prevent connection reuse. Adding .and_then(|r| async { r.bytes().await.map(|_| true) }).unwrap_or(false) or calling result.bytes().await mirrors the pattern in warm_footer and is safer.

🟡 put_opts_cached — verify PutPayload::clone() is cheap

let payload_for_cache = payload.clone();

The old code issued a post-write GET to prime the cache (expensive but unambiguous). The new code clones the payload before writing. If PutPayload is Arc-backed this is essentially free; if it forces a deep copy of the payload bytes it doubles heap usage for every non-multipart PUT. Worth confirming in the object_store crate whether PutPayload::clone() is O(1) or O(n). If it's O(n), the CachingMultipartUpload approach (which correctly avoids this) should be the only path — and this clone should be gated behind the size cap.

🟡 Duplicated pre_uris collection in optimize_table and optimize_table_light

Both optimize paths have identical blocks:

let track_files = self.config.maintenance.timefusion_warm_after_compaction
    || self.config.maintenance.timefusion_evict_after_compaction;
let pre_uris: std::collections::HashSet<String> = if track_files { ... } else { Default::default() };

swap_and_refresh_cache already encapsulates the warm/evict pair — pulling the pre_uris capture into a small helper (or into swap_and_refresh_cache itself, accepting an Option<HashSet>) would eliminate this drift risk. The light path was already prone to missing the evict step (mentioned in the PR description), and this duplication re-opens that window.

🟡 maybe_touch spawns on every half-expired cache hit

fn maybe_touch(&self, cache: &FoyerCache, key: &str, entry: HybridCacheEntry<…>, l1_max_entry_bytes: usize) {let handle = tokio::spawn(async move {});

On a very hot query path — e.g., 10k requests/s against the same partition — this spawns a task per cache hit that's past the halfway mark. The refreshing DashSet throttles duplicate re-inserts per key, but the DashSet insert/lookup still shows up in profiles. Consider an AtomicBool per entry (or a simple last-refresh timestamp in CacheValue) to avoid the DashSet indirection entirely. Not blocking, but worth benchmarking on hot workloads.


Nits / suggestions

within_recency wrapper is unnecessary (database.rs:177):

fn within_recency(uri: &str, cutoff: Option<chrono::NaiveDate>) -> bool {
    crate::object_store_cache::date_partition_within(uri, cutoff)
}

The wrapper adds an indirection with no added documentation. Just call crate::object_store_cache::date_partition_within at the three call sites directly, or re-export it with use.

CachingMultipartUpload::buffer could pre-allocateVec::new() will reallocate several times for a 128 MB file. Vec::with_capacity(max_warm_bytes) front-loads the allocation and avoids intermediate copies. Trade-off is pre-allocating even for small files; with_capacity(max_warm_bytes.min(16 * 1024)) or similar is a reasonable middle ground.

Cached empty CacheValue for ObjectMeta — storing CacheValue::new(Vec::new(), meta) in the metadata cache is a valid trick to avoid a separate type, but it's surprising to future readers. A short comment on make_meta_cache_key noting that the associated value always has an empty data field would help.

swap_and_refresh_cache return value unused in light optimize:

let _ = self.swap_and_refresh_cache(table_ref, new_table, &pre_uris).await;

The method returns Vec<String> (live URIs for the tantivy GC hook), but light optimize doesn't use it. This is fine functionally, but allocating the Vec unconditionally and then discarding it is a minor waste. A #[must_use] on the method would at least make the discard explicit at compile time.


What's good

  • The date_partition_within function being the single shared parser for both the warm recency filter and the cache admission window is the right call.
  • Suffix-range handling eliminating the HEAD → bounded-GET round trip is a meaningful latency win, and the test (test_warm_footer_eliminates_read_path_heads) locks in the expectation correctly.
  • test_multipart_warm_read_and_evict_key_consistency is excellent — verifying key consistency across three independent code paths is exactly the kind of regression guard this feature needs.
  • All error paths in warming/eviction are properly swallowed and logged; correctness is never gated on cache behavior.
  • The CachingMultipartUpload inline tee is the cleanest solution to the "we just streamed these bytes, why re-fetch them" problem.

test_batch_queue_under_load timed out after 30s on the prior run. It's a
#[serial] integration test against shared-state CI Postgres/MinIO under load
— the same family as the adjacent test_concurrent_* tests already marked
#[ignore = "wedges under shared-state CI"]. The warm-on-flush path this PR
adds is wired only in main.rs's delta_write_callback, which this test does
not install, and the write-time multipart capture it does exercise is
non-blocking (its unit tests pass). Re-kicking to confirm the flake.
@claude

claude Bot commented Jun 7, 2026

Copy link
Copy Markdown

Code Review: Warm Foyer cache after compaction/flush

Overview

This PR addresses a real, observable problem: compaction outputs stream to S3 via multipart but bypass Foyer, so every post-compaction dashboard read of the hot leading-edge partitions cold-starts. The fix is solid in concept — warm footers right after a commit, evict tombstoned files proactively, and tee multipart writes into the cache inline. The code is careful, well-commented, and extensively tested.


Issues

  1. Vacuum retention reduction is bundled in the wrong PR

vacuum_retention drops from 72h to 48h in config.rs. This tightens the safety window for in-flight queries that hold a Delta snapshot referencing files vacuum would delete. It has nothing to do with cache warming and the two changes should not share a review. Please split this out — it deserves its own discussion about query duration SLAs.

  1. Default config changes are potentially breaking for existing deployments

foyer_disk_gb: 100 → 500 — risks ENOSPC on smaller cache volumes despite sparse-file semantics
foyer_memory_mb: 512 → 1024 — may affect memory-constrained environments
foyer_block_size_mb: new default 256 MB (was using file_size_bytes = 32 MB, so this changes the on-disk storage layout at startup)

These need explicit call-outs in the PR description and ideally a migration note.

  1. CachingMultipartUpload memory footprint under the new defaults

With block_size_bytes = 256 MB and warm_concurrency = 4, up to 1 GB of transient heap can accumulate during a busy maintenance window. The struct doc mentions this, but operators won't read struct docs before deploying. Consider either defaulting warm_inline_max_bytes to something smaller (e.g. 64 MB) and letting operators raise it, or emitting a startup log line when block_size_bytes × warm_concurrency exceeds 512 MB.

  1. test_sliding_ttl_refresh_on_query is timing-dependent

The test sleeps 200ms waiting for a background tokio::spawn to land before asserting. On a loaded CI runner this will be flaky. Use tokio::time::pause() / tokio::time::advance(), or expose a JoinHandle-based await path for test contexts.

  1. warm_footer_via_head drops the GetResult without consuming bytes

For FoyerObjectStoreCache this is fine (get_range_cached caches eagerly before returning). For a lazily-caching ObjectStore the warming would silently no-op. Add a comment explaining the eager-cache assumption, or call .bytes().await and discard to make the intent unambiguous regardless of the inner store.

  1. Baseline hit-rate log message temporal confusion

The log says "next query benefits" alongside a past-tense rate. Phrasing it as "pre-warm foyer main hit rate" would make the before/after relationship explicit in dashboards.


Positive highlights

  • swap_and_refresh_cache consolidation is the right call. The full and light optimize paths previously had divergent table-swap code; this unifies them so warm/evict cannot drift out of sync.
  • Best-effort semantics are consistent throughout — every warm and evict path swallows errors and never propagates failure to the commit. This is exactly right for an optimistic cache layer.
  • Suffix range resolution in get_opts is a clean improvement: resolving to an absolute range means suffix-warmed footers share the same cache key as bounded footer reads, making the one-round-trip warm immediately useful.
  • date_partition_within as a single shared function avoids the warm-path and admission-window parsers drifting out of sync. Good factoring.
  • Test coverage is thorough: key-consistency test, inline vs. overflow multipart paths, recency window, sliding TTL, HEAD elimination — all the important invariants are pinned.

Minor nits

  • The prefix relativization logic (split on '?', trim trailing slash) appears twice in database.rs — a small helper would remove the duplication.
  • The track_files / pre_uris block is duplicated verbatim in optimize_table and optimize_table_light.
  • CachingMultipartUpload.buffer starts as Vec::new() — a with_capacity hint would reduce reallocations on large captures.

Summary

The core approach is sound and the implementation is careful. Two things worth addressing before merge:

  1. Split the vacuum retention change into its own PR. It is a correctness boundary change, not a performance one, and deserves independent review.
  2. Add explicit operator guidance (startup warning or updated docs) about the block_size_bytes × warm_concurrency memory ceiling under the new defaults.

The timing-sensitive test is a CI reliability concern worth fixing, but not strictly blocking if the team accepts the flakiness risk.

The warm-after-compaction change set the main data cache's foyer block
(region) size to block_size_bytes (128-256MB) so full compaction outputs
could persist. But foyer carves the disk device into block-sized regions:
when the block is >= the device, there are zero usable regions and every
disk insert stalls. CI runs the cache at TIMEFUSION_FOYER_DISK_MB=50 with a
256MB block -> 0 regions, so once the new write-path caching (multipart
capture + put_cached) started inserting flushed parquet onto disk,
test_batch_queue_under_load wedged and hit its 30s timeout. Master used a
4MB block (file_size_bytes) and so never tripped this.

Cap the data block at a quarter of the device (>= several regions), floored
at the legacy 4MB granularity and never above the device itself. Production
disks (GBs) still get the full 128MB block; small/test disks stay healthy.
The metadata cache already used the 4MB file_size_bytes block and is
unaffected.
@claude

claude Bot commented Jun 7, 2026

Copy link
Copy Markdown

Code Review — PR #39: Warm Foyer cache after compaction/flush

Overview

This PR addresses a real, measurable problem: compaction outputs landing cold in S3 while Foyer caches the pre-compaction files, causing a p99 sawtooth on the partitions dashboards hit hardest. The solution is well-thought-out — warm footers and optionally full files post-commit, evict tombstoned file entries inline, gate warming by partition recency, and tee multipart write bytes into the cache to avoid re-downloading what was just uploaded.

The architecture is correct: warming is strictly best-effort and never blocks or fails a commit; correctness relies only on Delta snapshot isolation, which is already a system invariant. The swap_and_refresh_cache helper that consolidates both optimize paths is a good refactor. Test coverage is thorough and the key-consistency test (test_multipart_warm_read_and_evict_key_consistency) is particularly valuable.


Bugs / Correctness Concerns

1. Sliding TTL: stuck key on task panic (maybe_touch)

In object_store_cache.rs, maybe_touch uses a DashSet (self.refreshing) to deduplicate in-flight refreshes:

if !self.refreshing.insert(key.to_string()) { return; }
// ...
let handle = tokio::spawn(async move {
    insert_main(&cache, key.clone(), ...);
    refreshing.remove(&key);   // removed here, inside the task
});

If the spawned task panics before refreshing.remove(&key), the key is permanently stuck in the set and that entry will never receive another sliding-TTL refresh for the lifetime of the cache. Consider using a RAII guard (drop-on-remove) or wrapping the spawn body so the remove runs even on unwind.

2. warm_footer_via_head does not drain the body

warm_full explicitly notes that body draining is needed for generic stores:

FoyerObjectStoreCache populates the cache eagerly inside get_opts, but consuming the bytes keeps this correct for any inner store.

warm_footer_via_head calls store.get_opts(location, opts).await.is_ok() without consuming the returned stream. For FoyerObjectStoreCache this is correct because caching happens inside get_range_cached, but the asymmetry between the two sibling functions is undocumented. A comment explaining why warm_footer_via_head intentionally skips .bytes() would prevent a future maintainer from "fixing" it incorrectly.

3. evict_cache_for_uris does not evict metadata-range entries

The eviction path removes only the full-file cache entry via evict_data_entry. Range-cache entries for tombstoned files' footers survive until TTL. This is likely acceptable (the files still exist in S3 until VACUUM, so stale footer cache entries are not incorrect), but the comment "dead compaction outputs don't linger in the cache" overstates what is actually evicted.


Significant Default Changes — Operator Impact

Three defaults changed substantially and warrant explicit call-outs in release notes or RUNBOOK:

Setting Old New Risk
timefusion_foyer_disk_gb 100 GB 500 GB ENOSPC on smaller disks
timefusion_foyer_memory_mb 512 MB 1024 MB OOM on smaller instances
timefusion_vacuum_retention 72 h 48 h Shorter window for long-running queries / rollbacks

The comment on d_foyer_disk_gb explains the reasoning, but operators upgrading in-place with a smaller cache volume will hit ENOSPC silently if they do not read the changelog. The startup warning in from_app_config catches the block-size-vs-disk case but not the straightforward "configured disk ceiling exceeds available space" case.


Design / Clarity Observations

Two overlapping recency configs

timefusion_cache_recent_days (in CacheConfig) gates cache admission during writes — files whose partition is older than N days bypass the cache on write/multipart. timefusion_warm_recency_days (in MaintenanceConfig) gates which files to warm post-commit. These are independent knobs serving similar purposes at different layers. The docs should be explicit that setting one does not affect the other and that the two are normally tuned together.

pre_uris capture is still duplicated

The PR consolidates the swap+warm/evict step into swap_and_refresh_cache, but the pre_uris capture pattern is still copy-pasted identically into optimize_table and optimize_table_light. A small helper like fn snapshot_uris_if_tracking(&self, table: &DeltaTable) -> HashSet<String> would eliminate the remaining duplication.

Transient memory worst-case should be in RUNBOOK

From the config doc comment: "Up to timefusion_warm_concurrency compactions can run at once, so the worst case is block_size_mb * warm_concurrency of transient heap" (default: 256 MB × 4 = 1 GB). This is documented in the Rust doc but not surfaced operationally. Worth a line in RUNBOOK.md under cache tuning.


Potential CI Flakiness

test_sliding_ttl_refresh_on_query is time-sensitive

The test sleeps 600 ms, then 200 ms ("let the re-insert land"), then 400 ms before asserting the entry is still alive. On a heavily loaded CI runner the 200 ms re-insert window may not be enough, causing the entry to expire before the final assertion. Using tokio::time::pause() / advance() (Tokio's time mocking) would make this deterministic.


Minor Nits

  • make_meta_cache_key ("{}#meta"), make_range_cache_key ("{}#range:{}-{}"), and plain make_cache_key are three independent key namespaces. A single doc comment listing all three and their invariants would help future readers avoid accidental collisions.

  • The within_recency wrapper in database.rs is a one-liner that delegates to date_partition_within. The test test_within_recency only exercises the wrapper; testing date_partition_within directly would additionally catch a delegation bug.

  • The timefusion_warm_after_compaction check appears in both warm_cache_for_uris (top guard) and warm_cache_for_table (early return). Two guards for the same condition is safe but slightly redundant.


Summary

The core design is solid and the test suite is excellent — especially the key-consistency and HEAD-elimination tests, which lock in the two main performance invariants. Items to address before merge:

  1. Fix: Stuck key in refreshing DashSet on task panic in maybe_touch
  2. Document: warm_footer_via_head intentionally skips body drain — add a comment explaining why
  3. Accuracy: Update evict_cache_for_uris comment to reflect that only full-file entries (not range-cache entries) are evicted
  4. Ops risk: The 500 GB disk default, 1 GB memory default, and 48 h vacuum retention changes need release notes / RUNBOOK entries for operators upgrading in-place

put_cached now warms the cache directly from the write payload instead of
re-fetching each file from the inner store after the write. The
test_cache_performance_and_s3_bypass integration test still asserted the old
post-write re-GET (inner_gets == 3); update it to inner_gets == 0, matching
the lib unit tests already updated in this PR. Reads still hit cache, so
hits == 6 / misses == 0 are unchanged.
@claude

claude Bot commented Jun 7, 2026

Copy link
Copy Markdown

Code Review: Warm Foyer cache after compaction/flush

Great PR overall — the problem statement is clear, the solution is layered correctly (multipart tee → post-commit warm → footer warming), and the test suite is unusually thorough for cache-layer code. A few issues worth addressing before merge.


Bugs

1. warm_footer_via_head silently skips body consumption

warm_footer correctly calls result.bytes().await.is_ok() to drain the stream, but the fallback path does not:

// warm_footer_via_head — body is NOT consumed
store.get_opts(location, opts).await.is_ok()

The code even documents this requirement in warm_full:

"Explicitly drain the body: GetResult's payload is a stream, and a generic store may not have read it yet by the time get_opts returns."

For a real HTTP/S3 transport, dropping the GetResult without consuming the body leaves the connection in an indeterminate state (and the body bytes never actually land anywhere). Fix:

match store.get_opts(location, opts).await {
    Ok(result) => result.bytes().await.is_ok(),
    Err(_) => false,
}

2. timefusion_foyer_file_size_mb silently becomes a no-op

Previously config.file_size_bytes (derived from timefusion_foyer_file_size_mb) was passed directly to BlockEngineConfig::with_block_size(). The PR replaces that with data_block_size (derived from the new timefusion_foyer_block_size_mb), so any operator who was tuning the old knob will get a silent do-nothing on upgrade. The field should either be removed with a deprecation notice, or its value should be folded into the data_block_size computation as a floor/fallback.


Breaking / Risky Defaults

3. Default foyer_disk_gb jumps from 100 → 500 GB

This is a 5× increase in the logical ceiling foyer will try to use. The comment is correct that foyer creates sparse files, but the ENOSPC warning applies on any disk smaller than 500 GB. An existing deployment running a 200 GB cache volume will hit ENOSPC silently on upgrade. Consider defaulting to something closer to the original (e.g. 200 GB) and letting operators opt into 500 GB, or at startup log a loud warning when disk_size_bytes > 80% of available disk space.

4. vacuum_retention reduced from 72 h → 48 h in the same PR

This is an unrelated behavioral change bundled with cache warming. It affects correctness guarantees (how long tombstoned files survive before being deleted), not cache hit-rate. It should ship separately with its own discussion, especially since the justification ("no query runs longer than ~1h") is an operational assumption that varies by deployment.


Minor Issues

5. within_recency wrapper adds indirection without value

// database.rs
fn within_recency(uri: &str, cutoff: Option<chrono::NaiveDate>) -> bool {
    crate::object_store_cache::date_partition_within(uri, cutoff)
}

This adds a private free function that does nothing but forward. The call sites can use crate::object_store_cache::date_partition_within directly, or at minimum add a use alias instead of a wrapper function.

6. Duplicated track_files / pre_uris in optimize_table and optimize_table_light

The same 7-line block appears verbatim in both optimize paths. Since both now funnel through swap_and_refresh_cache, extracting this into a small helper (or computing it inside swap_and_refresh_cache itself by accepting the current table ref instead of a pre-computed set) would eliminate the drift risk.

7. warm_cache_for_table resolve failure is silently swallowed

if let Ok(table_ref) = db.resolve_table(&project_id, &table_name).await {
    ...
}
// resolve failure: no log, warming silently skipped

A trace!/debug! on the Err arm would make it diagnosable when warming isn't occurring (e.g. during table creation races).

8. CachingMultipartUpload buffer starts with zero capacity

buffer: Some(Vec::new()),

For the common path (small files that fit under the cap), the buffer will reallocate many times as chunks arrive. Pre-allocating to cap.min(expected_size) or even a small initial size (e.g., 64 KB) would reduce allocations without changing semantics.


Test Coverage

The test suite is genuinely excellent — test_warm_footer_eliminates_read_path_heads, test_multipart_warm_read_and_evict_key_consistency, and test_sliding_ttl_refresh_on_query are exactly the right property tests for this code. Two gaps:

  • No test for the warm_footer_via_head fallback path (the suffix GET failure branch). Given issue Issues #1 above, adding one would catch the body-consumption bug.
  • No test that verifies warm_cache_for_uris correctly filters by recency_days at the database layer (only tested at the object-store layer).

Summary

Critical warm_footer_via_head body not consumed (#1)
Breaking silent deprecation of file_size_bytes knob (#2), 5× disk default increase (#3)
Scope vacuum retention change should be a separate PR (#4)
Nits #5#8 are low-priority cleanups

The core approach is solid — warming after commit rather than inline with the multipart write is the right call, and the concurrency-bounded detached-task pattern is correct. Fixing the warm_footer_via_head body-consumption bug and calling out the file_size_bytes deprecation are the must-haves before merge.

/simplify cleanups on the warm-after-compaction diff (no behavior change):

- Extract table_cache_prefix() + relativize_to_prefix() in database.rs and
  use them in both warm_cache_for_uris and evict_cache_for_uris. The two
  paths previously duplicated the query-string strip / trailing-slash trim /
  strip_prefix dance; sharing it keeps warmed and evicted cache keys derived
  identically so they can't desync on a one-char drift.

- Extract capped_block_size() in object_store_cache.rs and apply it to BOTH
  the data and metadata cache builders. The data-cache capping was added
  inline last commit; the metadata builder still used a raw block size and
  could wedge the same way on a small metadata disk. One helper guards both.

- Lift the TTL clamp in maybe_touch into a named ttl_millis local for
  readability.

Skipped (intentional/out-of-scope): applying the recency window to the
on-read footer path (reads should cache what they read), CacheValue ->
Arc<Vec<u8>> to cheapen refresh clones (data-model change beyond this diff),
and inlining make_cache_key in multipart complete (kept for read/write key
lockstep, as its comment notes).
@claude

claude Bot commented Jun 7, 2026

Copy link
Copy Markdown

Code Review: Warm Foyer cache after compaction/flush

Overview

This PR tackles a real and well-diagnosed performance problem: post-compaction p99 sawtooth caused by Foyer never being warmed after multipart uploads to S3. The approach is sound — warming footers (cheap, high-impact for query planning) and optionally full files (heavier, configurable off by default), with proactive eviction of tombstoned files. The PR is large but well-scoped: it doesn't change query semantics, only the cache hit rate.


Potential Bugs

warm_footer_via_head does not consume the response stream

warm_footer correctly drains the body with result.bytes().await.is_ok() — the comment in warm_full even explains why this is necessary for generic stores. But the fallback path forgets to do the same:

// warm_footer_via_head (line ~749)
store.get_opts(location, opts).await.is_ok()  // ← bytes never consumed

For FoyerObjectStoreCache, get_range_cached reads bytes eagerly into a Vec before returning, so the cache is populated even without draining the outer GetResult stream. But if warm_footer_via_head is ever called with a lazy inner store (or during tests with a plain InMemory), the warm silently does nothing. warm_full shows the correct pattern; this should match it:

match store.get_opts(location, opts).await {
    Ok(result) => result.bytes().await.is_ok(),
    Err(_) => false,
}

pre_uris snapshot is taken inside the retry loop in optimize_table_light

for attempt in 0..3 {
    let table_clone = { ... };
    let pre_uris: HashSet<String> = if track_files { ... };

On a retry the pre_uris snapshot is refreshed from the current table state, which is fine only because a failed optimize commit leaves the table state unchanged (Delta ACID). If that assumption ever breaks (e.g., a partial state update), the warm/evict diff would be wrong. Moving the snapshot outside the retry loop (let pre_uris = ... before for attempt in 0..3) would make the intent explicit and eliminate the assumption.

Metadata cache key collision between #meta and #range:

make_meta_cache_key returns "{}#meta" and make_range_cache_key returns "{}#range:{}-{}". These share the same String-keyed metadata_cache. Foyer's per-key isolation means collisions can't corrupt data, but it is fragile — a file path that happens to contain #meta or #range: would produce an incorrect key. A dedicated namespace (e.g. a type-tagged enum value rather than a string key) would be safer.


Performance Concerns

maybe_touch spawns a task per half-TTL hit

On a hot read path, every hit past ttl/2 attempts a tokio::spawn plus a DashSet::insert. The DashSet deduplicates concurrent refreshes for the same key, but under heavy concurrent reads the contention on DashSet::insert adds up. The try_lock on background_tasks can fail (the comment acknowledges this), meaning the spawned handle is dropped and the JoinSet never tracks it — which is fine for best-effort refresh, but could accumulate many untracked tasks under load. Consider rate-limiting the spawn by returning early if the key is already in refreshing before spawning (the current code inserts into refreshing before spawning, which is correct, but the try_lock failure means the handle is untracked).

CachingMultipartUpload buffer abandonment on first oversized part

If the first put_part call exceeds max_warm_bytes, the buffer is immediately set to None. Any bytes accumulated by previous parts are freed. This is the correct behavior, but it means for a 2-part upload where part 1 is just under the cap and part 2 puts it over, we allocate and discard part 1's bytes. Not a bug, but worth a comment that the cap is checked cumulatively.


Breaking / Surprising Defaults

foyer_disk_gb default: 100 → 500

This is the most operationally risky part of the PR. Even though Foyer creates sparse files, 500GB is the logical ceiling at which eviction kicks in. Operators who rely on the default and run on hosts with, say, 200GB of disk cache space will hit ENOSPC before eviction — exactly the failure mode described in the config comment. The warning added to from_app_config when block_size > disk_size is good, but there's no equivalent warning for "configured disk ceiling exceeds actual disk". Consider logging a startup warning if disk_size_bytes exceeds some fraction of available disk space (via statvfs), or document this more prominently in a migration note.

vacuum_retention: 72h → 48h

The justification (48x safety margin for ≤1h queries) is reasonable, but this is a global behavior change. Any operator with slow queries or who relies on time-travel back 49–72h will silently lose that window. This should be called out explicitly in the PR description or a CHANGELOG entry, even if the justification is sound.


Code Quality

within_recency is an unnecessary alias

// database.rs
fn within_recency(uri: &str, cutoff: Option<chrono::NaiveDate>) -> bool {
    crate::object_store_cache::date_partition_within(uri, cutoff)
}

This adds a layer of indirection without adding semantics. Call date_partition_within directly, or make date_partition_within pub(crate) and use it.

Two separate recency knobs with confusing naming

timefusion_cache_recent_days (in CacheConfig, governs cache admission) and timefusion_warm_recency_days (in MaintenanceConfig, governs which files are warmed after compaction) serve different purposes but interact. Their defaults also differ (8 vs 2), meaning it's possible to warm a file that the admission window would refuse — a warm_full call on a 5-day-old partition with cache_recent_days=4 would silently no-op. A comment linking the two knobs (or validating them at startup) would help operators.

Tracking dropped but only logging first occurrence

if dropped == 0 {
    debug!("warm: URI {} does not start with table prefix {}; ...", u, prefix);
}
dropped += 1;

This logs the first mismatched URI but silently accumulates the rest. The final debug! at the end logs the total count. This is fine, but the first-occurrence log's purpose should be clearer (it's to show an example URI, not all URIs).


Test Coverage

Excellent. The PR ships tests for every new behavior:

  • Recency parsing edge cases (test_within_recency, test_is_within_recent_window)
  • Multipart capture + over-cap abandonment (test_multipart_capture_warms_cache)
  • Block size auto-tracking (test_block_size_tracks_optimize_target)
  • Sliding TTL (test_sliding_ttl_refresh_on_query)
  • Eviction path (test_evict_data_entry_removes_cached_file)
  • Footer/full-file warm (test_warm_footer_primes_metadata_cache, test_warm_full_primes_main_cache)
  • Key consistency across warm/read/evict paths (test_multipart_warm_read_and_evict_key_consistency)
  • HEAD elimination on the read path after a footer warm (test_warm_footer_eliminates_read_path_heads)

One gap: there's no integration test asserting the end-to-end property — "after optimize_table completes, subsequent footer reads of the rewritten files hit the cache, not S3." The unit-level tests cover the components well, but a higher-level test in database.rs (using a CountingStore as the inner store) would close that gap and guard against the swap_and_refresh_cache call being accidentally removed from one of the optimize paths.


Summary

The core design is correct and the implementation is careful. The three issues worth fixing before merge are:

  1. Bug: warm_footer_via_head should consume the response stream (matches the warm_full pattern).
  2. Ops risk: Document or warn at startup if foyer_disk_gb default (500GB) might exceed the host's available cache space.
  3. Clarity: Move pre_uris outside the retry loop in optimize_table_light to make the snapshot-before-attempt semantics explicit.

The sliding TTL, multipart capture, and suffix-range cache path are all solid additions. The test suite is thorough.

Two small conciseness consolidations from the deep-research conciseness pass
(no behavior/test change):

- put_result_meta(): the write path built a field-identical ObjectMeta from a
  PutResult in two places (put_cached and the multipart capture's complete()).
  Share one helper.

- MIB/GIB consts in config.rs replace the repeated '* 1024 * 1024' chains in
  the CacheConfig *_bytes() accessors, so they read as the unit they mean.

Skipped (reported): no library swap shrinks the bespoke pieces (multipart
capture is required ObjectStore trait boilerplate, date-partition parse and
sliding-TTL have no drop-in in the current deps), and a chunk-collect helper
would have a single real caller.
@claude

claude Bot commented Jun 7, 2026

Copy link
Copy Markdown

Code Review — PR #39: Post-compaction Foyer cache warming

Overview

This PR fixes a p99 sawtooth on recently-written partitions by warming the Foyer cache after each compaction/flush commit. The root cause is correct: multipart writes bypass the put_cached path, so every post-compaction query cold-starts on S3. The design is sound — warm on new files, evict on tombstoned files, both filtered to recent partitions, both best-effort and non-blocking.

The implementation is generally clean and well-documented. A few things deserve attention before merging.


Breaking default changes (no deprecation period)

These affect any operator who upgrades without reading the changelog:

  • timefusion_foyer_disk_gb: 100 → 500 GB. The comment correctly warns that foyer doesn't pre-allocate (sparse file), but operators on smaller cache volumes (< 500 GB free) will hit ENOSPC before eviction kicks in. This is the highest-risk change in the PR. Recommend shipping this default bump separately from the functional cache-warming work, or at minimum gating it behind a startup check.

  • timefusion_foyer_memory_mb: 512 → 1024 MB. A 2× RAM increase for the in-memory cache tier. Fine on large boxes, surprising on smaller instances.

  • timefusion_vacuum_retention: 72h → 48h. The rationale is solid (48× safety margin, faster tombstone reclaim), but it's a silent reduction in the snapshot-isolation window. Any operator running queries longer than 48h (rare but possible for large backfills) will start seeing file-not-found errors after upgrade.

Consider moving these three defaults to their own PR or adding an upgrade note.


Correctness issues

capped_block_size(desired, 0) returns 0.

fn capped_block_size(desired: usize, disk_size: usize) -> usize {
    desired.min(disk_size / 4).max(MIN_DISK_BLOCK_BYTES).min(disk_size)
}

If disk_size == 0: 0/4 = 0, .max(4MB) = 4MB, .min(0) = 0. Foyer with a 0-byte block size would wedge (similar to the CI wedge you reference in the comment for the 256MB-on-50MB case). While disk_size = 0 should be unreachable today, a guard is cheap:

fn capped_block_size(desired: usize, disk_size: usize) -> usize {
    if disk_size == 0 { return MIN_DISK_BLOCK_BYTES; }
    desired.min(disk_size / 4).max(MIN_DISK_BLOCK_BYTES).min(disk_size)
}

warm_footer falls back to HEAD+GET on any error, including file-not-found.

Err(_) => warm_footer_via_head(store, location, metadata_size_hint).await,

If a file was removed between the warm list being computed and the GET (e.g. concurrent vacuum), this fires a redundant HEAD. Since warm_footer_via_head handles the HEAD error gracefully it's safe, but it's a wasteful extra round-trip. A narrower match on object_store::Error::NotSupported would prevent this, though it may not be worth the complexity for a best-effort path.


Design observations

maybe_touch spawns tasks not tracked at shutdown.

The comment acknowledges this ("don't assume it joins every one"), but a background_tasks JoinSet that silently loses entries is worth naming explicitly in the struct doc. If a refresh task holds a strong reference to CacheValue data, graceful shutdown might not reclaim it promptly. Low severity but worth a note.

CachingMultipartUpload buffer is unbounded per-part before the cap check.

fn put_part(&mut self, data: PutPayload) -> object_store::UploadPart {
    if let Some(buf) = self.buffer.as_mut() {
        if buf.len().saturating_add(data.content_length()) > self.max_warm_bytes {
            self.buffer = None;
        } else {
            for chunk in data.iter() { buf.extend_from_slice(chunk); }
        }
    }
    self.inner.put_part(data)
}

The cap is enforced before the new part is appended, so buf never actually exceeds max_warm_bytes. However, because the cap fires on the part that would cross the limit (not after), a single large first part that's exactly at the cap is captured, while a second part of any size drops the buffer. This is fine — just confirming the invariant holds and the buffer does not transiently exceed the cap.

warm_cache_for_table re-resolves the table for every flush callback.

The flush path in main.rs already has the table reference and the added URIs. warm_cache_for_table spawns a task that calls resolve_table again, which may issue a PG roundtrip. This works but is wasteful. A follow-up to accept an Arc<RwLock<DeltaTable>> directly (bypassing the lookup) would eliminate the redundant resolution for the flush path. Not a blocker.


Code quality

Good:

  • insert_main and CachingMultipartUpload are well-scoped abstractions.
  • swap_and_refresh_cache consolidates the warm/evict pair so full and light optimize paths can't drift — good defensive design.
  • The capped_block_size function with its "CI wedge" comment is exactly the right thing to ship here.
  • date_partition_within is shared between the cache-admission window and the compaction-warm filter, preventing parser drift.
  • Test coverage is strong: key-consistency test, sliding-TTL test, round-trip-count test with CountingStore, multipart capture path, and the recency window — all of these cover real failure modes.

Minor style nit: warm_footer_via_head is a private function whose entire purpose is a fallback for warm_footer. A doc comment linking them (/// Fallback for [warm_footer] when the store does not support suffix ranges.) would make the call graph obvious without reading both bodies.


Summary

The functional design is correct and the test coverage is solid. The two items I'd want resolved before merging:

  1. The foyer_disk_gb default bump (100 → 500) is the highest blast-radius change and should ship with operator guidance or a startup guard.
  2. capped_block_size(desired, 0) = 0 is a latent wedge; the one-line guard is trivial.

Everything else is low-severity or follow-up material.

@tonyalaribe tonyalaribe merged commit 31369ff into master Jun 7, 2026
7 checks passed
tonyalaribe pushed a commit that referenced this pull request Jun 7, 2026
…itions

delta-rs's ZOrder planner has no idempotence guard — no size check, no
single-file-bin skip, no "already sorted" detection — so the full optimize job
rewrote every file in today + yesterday + 2-days-ago into new (cold, multipart)
object keys on every 30-min run, even for sealed days that hadn't changed.
PR #39 added cache warming to soften the resulting cold-starts, but the cheaper
fix is to not do the redundant rewrites in the first place.

optimize_table now snapshots each window partition's live file set and skips any
partition whose set is unchanged since the last successful optimize; `today` is
always processed (the growing leading edge). Sealed days are z-ordered once,
then skipped — so #39's warm/evict has far less cold churn to chase. The new
file set is recorded in an in-memory per-table map (zorder_filesets); a restart
re-z-orders each in-window partition once, harmlessly.

Composes with #39: this narrows what gets rewritten; #39 still warms whatever
actually was. New metrics optimize.partitions_rewritten / partitions_skipped
expose the churn-avoided signal (after-the-fact monitoring, no telemetry
pipeline needed). Unit tests cover the fileset grouping and change detection.

https://claude.ai/code/session_01PLU62s7eRhSMMUXEHSWRCx
tonyalaribe pushed a commit that referenced this pull request Jun 7, 2026
…itions

delta-rs's ZOrder planner has no idempotence guard — no size check, no
single-file-bin skip, no "already sorted" detection — so the full optimize job
rewrote every file in today + yesterday + 2-days-ago into new (cold, multipart)
object keys on every 30-min run, even for sealed days that hadn't changed.
PR #39 added cache warming to soften the resulting cold-starts, but the cheaper
fix is to not do the redundant rewrites in the first place.

optimize_table now snapshots each window partition's live file set and skips any
partition whose set is unchanged since the last successful optimize; `today` is
always processed (the growing leading edge). Sealed days are z-ordered once,
then skipped — so #39's warm/evict has far less cold churn to chase. The new
file set is recorded in an in-memory per-table map (zorder_filesets); a restart
re-z-orders each in-window partition once, harmlessly.

Composes with #39: this narrows what gets rewritten; #39 still warms whatever
actually was. New metrics optimize.partitions_rewritten / partitions_skipped
expose the churn-avoided signal (after-the-fact monitoring, no telemetry
pipeline needed). Unit tests cover the fileset grouping and change detection.

https://claude.ai/code/session_01PLU62s7eRhSMMUXEHSWRCx
tonyalaribe added a commit that referenced this pull request Jun 7, 2026
…itions (#40)

delta-rs's ZOrder planner has no idempotence guard — no size check, no
single-file-bin skip, no "already sorted" detection — so the full optimize job
rewrote every file in today + yesterday + 2-days-ago into new (cold, multipart)
object keys on every 30-min run, even for sealed days that hadn't changed.
PR #39 added cache warming to soften the resulting cold-starts, but the cheaper
fix is to not do the redundant rewrites in the first place.

optimize_table now snapshots each window partition's live file set and skips any
partition whose set is unchanged since the last successful optimize; `today` is
always processed (the growing leading edge). Sealed days are z-ordered once,
then skipped — so #39's warm/evict has far less cold churn to chase. The new
file set is recorded in an in-memory per-table map (zorder_filesets); a restart
re-z-orders each in-window partition once, harmlessly.

Composes with #39: this narrows what gets rewritten; #39 still warms whatever
actually was. New metrics optimize.partitions_rewritten / partitions_skipped
expose the churn-avoided signal (after-the-fact monitoring, no telemetry
pipeline needed). Unit tests cover the fileset grouping and change detection.

https://claude.ai/code/session_01PLU62s7eRhSMMUXEHSWRCx

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants