Warm Foyer cache after compaction/flush to kill the post-compaction p99 sawtooth#39
Conversation
…99 sawtooth Compaction outputs (16MB light / 128MB full) stream to S3 via multipart upload, and put_multipart_opts is a pure pass-through — so every compacted file lands cold and is never warmed into Foyer. The leading-edge partitions that "last 1h/6h/24h" dashboards hit get rewritten every 5/30 min, so each refresh after a compaction misses Foyer, falls to S3, and the cache bytes we already paid for go stale. p99 sawtooths on exactly the partitions users watch. This warms the cache for newly-written files right after the commit is visible, reusing the existing read path (no inline multipart change needed): - object_store_cache: warm_footer() issues a ranged GET of the parquet footer so it lands in the metadata cache (query planning pays zero S3 round-trips); warm_full() primes the main full-file cache for data reads. Both are strictly best-effort — all errors swallowed. - database: warm_cache_for_uris() diffs pre/post file sets to warm only the files a commit *adds*, filtered to partitions within timefusion_warm_recency_days so we don't burn GETs on cold partitions. Runs in a detached, concurrency-bounded task; never blocks or fails the commit. Wired into optimize_table, optimize_table_light, and the flush callback. Logs files-warmed + current foyer hit-rate to size the win. - config: timefusion_warm_after_compaction (default on, footers always), timefusion_warm_full_files (default off), timefusion_warm_recency_days (default 2, covers the 48h full-optimize window), timefusion_warm_concurrency. Correctness is unaffected — Delta snapshot isolation means warming only changes hit-rate, never which files a query reads. https://claude.ai/code/session_01GzJmskCVjxaSVJhER2dAFp
Code Review: Warm Foyer cache after compaction/flushOverview This PR addresses a real and well-scoped performance problem: compaction outputs bypass the Foyer cache entirely, causing p99 sawtooths on the exact partitions dashboards watch. The approach — proactively warming footers (and optionally full files) in a detached, bounded task — is sound. A few issues worth addressing before merge: Medium:
|
…ests - warm_cache_for_table is now non-blocking: resolve_table (possible PG roundtrip) and the read lock run inside a spawned task, so the flush callback never blocks on warming. Call site drops the await. - Hit-rate log now snapshots the cache stats *before* warming and reports that as the baseline — the warm GETs are themselves misses, so a post-warm rate read artificially low. The next dashboard query is the beneficiary. - Unrelativized URIs (prefix mismatch) are now logged at debug instead of silently dropped, so a systematic table_url/get_file_uris drift is diagnosable. - recency_days is capped at 3650 before the i64 cast to rule out a silent wrap turning a misconfiguration into "warm nothing". - Added unit tests for within_recency covering the inclusive cutoff, missing /unparseable/truncated date segments, the None (no-limit) case, and nested partition paths. https://claude.ai/code/session_01GzJmskCVjxaSVJhER2dAFp
|
placeholder |
Code Review: Warm Foyer cache after compaction/flushOverviewThis PR addresses a real and well-described performance problem: compaction outputs written via multipart upload bypass Foyer entirely, so every post-compaction dashboard query cold-starts to S3, producing the characteristic p99 sawtooth. The fix is clean in design - a fire-and-forget task diffs pre/post file sets, filters to recent partitions, and primes the cache via the existing read path. Strengths
Issues1.
|
The warm path (and even the pre-existing single-put path) fetched files back from S3 right after writing them — paying a GET for bytes we already had in hand. Capture them at write time instead: - put_opts: cache the PutPayload directly (Arc-backed, cheap clone), reconstructing ObjectMeta from the PutResult + known size. Drops the post-write inner.get entirely. This is what the otel flush small-file writes and Delta log/checkpoint writes go through. - put_multipart_opts: previously a pure pass-through, so every compaction output landed cold. Now wrapped in CachingMultipartUpload, which tees written parts into a bounded buffer and inserts the completed file into the cache on complete() — no re-download. The buffer is capped (timefusion_warm_inline_max_mb, default 32MB): uploads above it stream through un-captured so 128MB full-compaction outputs don't blow memory or evict the L1 hot set. Those fall back to the selective, recency-filtered post-commit warm. abort() drops the buffer. Net effect: writes up to the cap — including the 16MB light-compaction outputs on the hot leading edge — warm Foyer for free, zero extra S3 GETs. The post-commit warm path stays as the backstop for large/cold files; for files already cached inline it resolves entirely from cache (head + range both hit), so the two compose with no redundant S3 traffic. Updated put-path tests (inner_gets now 0 after writes) and added coverage for both the captured and over-cap multipart paths. https://claude.ai/code/session_01GzJmskCVjxaSVJhER2dAFp
Code Review: Warm Foyer cache after compaction/flushOverviewThis PR addresses a real and well-characterized performance problem: compaction rewrites the partitions dashboards query most frequently, but the new files land cold in S3, producing a p99 sawtooth on exactly those partitions. The fix is layered correctly — inline capture during multipart writes (no re-download) plus a selective post-commit warm for larger outputs — and the design is strictly best-effort so it can never affect correctness. ArchitectureThe three-layer approach is sound:
The detached-task design is correct; warming must not block or fail the commit. Issues
|
…a from S3 Goal: keep recent data fully local so dashboard queries don't hit S3; only older partitions (~1 week, the cache TTL) fall back to S3. Three things blocked this: 1. Foyer's disk block size caps the largest entry that can persist to disk, and it defaulted to 32MB — so 128MB full-compaction outputs SILENTLY never reached disk (even on cold reads). Added timefusion_foyer_block_size_mb (default 256MB) and pointed the main data cache's block engine at it, so full compaction outputs actually persist locally. (Metadata cache keeps its small block size for tiny footer ranges.) 2. Inserting 128MB files into the 512MB L1 would evict the hot small-entry working set. Large entries (> timefusion_foyer_l1_max_entry_mb, default 16MB) are now inserted with foyer Location::OnDisk — they go to disk but stay phantom in memory, so warming a big file never thrashes L1. The 16MB light-compaction outputs (the hot leading edge) still get L1+disk for fastest repeat reads. 3. Caching everything forever would fill the disk with cold data and evict the recent week. Writes to partitions older than timefusion_cache_recent_days (default 8, just past the 7-day TTL) are not admitted — cold-tier recompress rewrites stay on S3. Applied uniformly across inline multipart capture, single-put warming, and cold-read caching. Together with the existing 7-day TTL this gives: recent data served from local disk, data older than ~1 week served from (and not re-cached from) S3. The inline multipart cap is now bounded by the block size (so we never buffer more than disk can store); timefusion_warm_inline_max_mb is an optional tighter override (default 0 = bound only by block size). Note for deployment: the disk cache (timefusion_foyer_disk_gb, default 100GB) must be sized to hold ~1 week of compacted data for "everything local" to hold; block size must stay >= the compaction target size. Added tests for the recent-days admission window and the size-based disk steering; existing multipart-capture test updated for the block-size-bounded cap. https://claude.ai/code/session_01GzJmskCVjxaSVJhER2dAFp
Code Review: Warm Foyer cache after compaction/flushThe PR addresses a real, well-diagnosed problem (post-compaction p99 sawtooth) with a coherent design. The implementation is generally solid and the commit message / inline docs are excellent. A few issues worth fixing before merge. Bug: metadata cache still uses
|
Two follow-ups so the local cache "just works" and uses the disk operators actually have: - Block size now auto-tracks the compaction target. from_app_config floors the effective disk block size at 2x timefusion_optimize_target_size, so raising the optimize target can never silently make the bigger outputs un-cacheable. The configured timefusion_foyer_block_size_mb is now a floor/override. Warns if the block size exceeds disk capacity. - Default disk cache bumped 100GB -> 500GB. Servers run 500GB-1TB cache volumes and local disk is far cheaper than repeated S3 GETs, so default to keeping much more data local. foyer backs this with a sparse file (no upfront allocation), but it's the logical eviction ceiling, so it must stay <= the cache volume's free space (documented). Added tests for block-size tracking (2x target floor + configured floor) and locked in the new disk default. https://claude.ai/code/session_01GzJmskCVjxaSVJhER2dAFp
Code Review: Warm Foyer cache after compaction/flushThis PR is well-designed and well-documented. Several solid reviews have already been posted — this one adds findings not yet covered. High: Default
|
- Sliding TTL: cache entries now expire `ttl` after their last *query*, not their insertion. On a hit, once an entry is past the halfway mark to expiry, maybe_touch re-inserts it with a fresh timestamp so frequently-queried data stays local indefinitely while cold data still ages out after the TTL (7d default). Throttled to at most one refresh per ttl/2 per key (the halfway gate + the existing `refreshing` in-flight guard), and the re-insert runs in the background off a cheap entry clone so queries never block on it (the data clone happens in the spawned task). Applied to full-file, ranged full-file, and metadata-cache hits. - Bumped default L1 memory cache 512MB -> 1GB to use the RAM on the bigger boxes; small/medium entries stay hot in memory, large files remain disk-only. Added a sliding-TTL test (entry survives past the base TTL because it was queried) and locked in the 1GB default. https://claude.ai/code/session_01GzJmskCVjxaSVJhER2dAFp
Code Review: Warm Foyer cache after compaction/flushOverviewThis PR addresses a real problem: compaction rewrites files to S3 via multipart without warming Foyer, so the first dashboard read after every optimize hits S3 cold and produces the p99 sawtooth. The fix adds two warming paths — an inline multipart tee and a post-commit diff-based sweep — both strictly best-effort so they never block or fail a commit. The approach is sound and the test coverage is solid. Issues1. Duplicate recency-check logic (logic drift risk)
// database.rs — takes &str + Option<NaiveDate>
fn within_recency(uri: &str, cutoff: Option<chrono::NaiveDate>) -> bool { … }
// object_store_cache.rs — takes &Path + usize days
fn is_within_recent_window(location: &Path, recent_days: usize) -> bool { … }Both live in different modules with no shared call path, so they'll diverge. Consider extracting a single 2. Aggressive default bump may surprise operators on smaller hostsThe 500GB disk ceiling is the default for all deployments. The comment explains sparse allocation, but the ENOSPC threshold is still 500GB. Operators on 100–200GB volumes who don't read release notes will hit this in production. Consider keeping the prior defaults and documenting the recommended production values in comments, or at minimum logging a startup warning that compares the configured ceiling against observed free space. 3.
|
After an optimize/compaction commit, proactively evict the cached full-file bytes of the files it tombstoned (present pre-commit, gone post-commit) instead of waiting for VACUUM / TTL / LRU to reclaim them. - Add SharedFoyerCache::evict_data_entry to drop a main-cache entry by key. - Add Database::evict_cache_for_uris, mirroring warm_cache_for_uris' relativization, called inline after both optimize paths swap the table. - Gate on new timefusion_evict_after_compaction config (default on). Correctness is unaffected: the files still exist in S3 until VACUUM, so a straggler query on the old Delta snapshot just re-reads them (a cache miss), never a wrong result. Eviction is in-cache only (no S3), so it runs inline. The test asserts on the in-memory layer: foyer's HybridCache::remove drops the memory entry synchronously but deletes the on-disk copy asynchronously, so an "evict then read" assertion would race the disk delete.
Code Review: Warm Foyer cache after compaction/flushOverview: The PR addresses a real p99 sawtooth problem — compaction outputs stream to S3 via multipart but the cache-write path was a pass-through, so every post-compaction dashboard read cold-started from S3. The fix is well-motivated and the implementation is thorough (configurable, best-effort, non-blocking), with good test coverage. A few issues worth addressing before merge. Bugs / Correctness
In
let handle = tokio::spawn(async move { ... }); // task 1 — always spawned
if let Ok(mut tasks) = self.background_tasks.try_lock() {
tasks.spawn(async move { let _ = handle.await; }); // task 2 — sometimes spawned
}If Breaking Default Change
Recommend:
Code QualityDuplicated pre-compaction tracking block (copy/paste, ~9 lines) The fn snapshot_file_uris(&self, table: &DeltaTable) -> Option<HashSet<String>> {
if self.config.maintenance.timefusion_warm_after_compaction
|| self.config.maintenance.timefusion_evict_after_compaction
{
Some(table.get_file_uris().map(|it| it.collect()).unwrap_or_default())
} else {
None
}
}Two confusingly similar "recency days" configs
Both configs filter on Duplicated
Performance
Even for files already in the main cache (warmed inline via
The clone is correct (noted in the comment), but worth calling out: for high-ingest tables this clone could be O(thousands of strings). Passing Test CoverageTests are comprehensive for the new cache primitives (
Minor
Overall the approach is sound and the implementation handles the important edge cases (best-effort, non-blocking, concurrency-bounded, eviction of tombstoned files). The breaking default change and potential warm no-op when the raw store is passed are the two items I'd want confirmed before merging. |
Code Review — Warm Foyer cache after compaction/flushThe problem statement is clear and well-motivated. The p99 sawtooth on leading-edge partitions is a real pain point and the design (warm from the existing read path, never block a commit, strictly best-effort) is sound. Several improvements are warranted before merging. Potential Bugs1. Cache key inconsistency in
insert_main(&self.cache, self.location.to_string(), ...);But self.insert_main_value(location, ...) // -> Self::make_cache_key(location)If 2. pub async fn warm_full(store: &dyn ObjectStore, location: &Path) -> bool {
store.get_opts(location, GetOptions::default()).await.is_ok()
}This works today because pub async fn warm_full(store: &dyn ObjectStore, location: &Path) -> bool {
match store.get_opts(location, GetOptions::default()).await {
Ok(r) => r.bytes().await.is_ok(),
Err(_) => false,
}
}3.
self.config.ttl.as_millis().min(u64::MAX as u128) as u64Code Quality and Design4. Warm/evict block copy-pasted across Lines ~2183-2197 and ~2451-2469 are nearly identical (compute 5. Double-spawn in
6. Duplicate recency-filtering logic
Default/Configuration Concerns7. This will ENOSPC deployments on typical cloud instances with 100-200 GB data disks. The comment is good but the default is too aggressive. Consider keeping 100 GB as the default and documenting 500 GB as a recommended production value, or emitting a startup warning when the configured value approaches available disk space. 8. The rationale is sound, but operators with long-running analytics queries (overnight batch jobs against Delta history) will see files vacuumed mid-query after upgrading. Worth calling out in the PR description and emitting a startup log when the value is below the old 72 h default. Minor Nits
Test CoverageThe test suite is thorough and the new tests cover all the main paths well. Two gaps worth adding:
SummaryThe core mechanism is correct and the test coverage is solid. The two things to resolve before merging are: (1) verify the cache key consistency in |
…dedup - Route CachingMultipartUpload::complete through make_cache_key so a multipart-warmed entry stays findable by a later GET if the key derivation ever changes. - Drain the GET body in warm_full so warming is correct for any inner store, not just the eager Foyer path. - Clamp ttl.as_millis() before the u64 cast in maybe_touch. - Extract swap_and_refresh_cache shared by both optimize paths so the warm/evict pair can't drift; document the orphaned-handle case. - Make warm_cache_for_uris async and await it directly, removing the double-spawn under warm_cache_for_table. - Unify date= partition recency parsing behind date_partition_within. - Label warm logs scope=full/footer-only instead of full=true/false. - Add cross-path key-consistency test (multipart warm -> read -> evict).
Code Review — PR #39: Warm Foyer cache after compaction/flushOverviewThis PR addresses a real, well-diagnosed problem: post-compaction p99 sawtooths caused by Foyer cache misses. The design is solid — best-effort, non-blocking, fire-and-forget, with Delta snapshot isolation guaranteeing correctness is unaffected. The sliding-TTL refresh and A few issues worth addressing before merge, ranging from a doc-comment bug to operational risk from the default changes. Bugs1. Fused doc comment on The function has two separate doc comment blocks merged into one paragraph: /// Whether `location` should be admitted to the cache given the recent-days
/// window. ... Keeps cold-tier rewrites ... out of the cache ...
/// Parse the `date=YYYY-MM-DD` partition segment from `s` and return whether it
/// is on or after `cutoff`. ... a `None` cutoff means no age limit.
The first block describes a cache-admission gate; the second describes what the function actually does. The first block appears to be a leftover from a function that was refactored away or renamed during development. The result is a misleading doc comment — the function is a general date parser, not an admission gate. The admission gate is Fix: Keep only the second block (starting at "Parse the Medium Issues2. Memory pressure from The default inline-warm cap is This is fine for the described hardware profile (500GB cache volumes likely live on large-memory nodes), but worth documenting explicitly in the config comment for 3.
4. Evict path silently skips prefix mismatches (
Operational / Default-Change Risk5. Default value changes are large and potentially breaking Three defaults changed simultaneously:
The comments for each are good and the ENOSPC warning in the config is helpful. However, the 100→500GB jump is the riskiest: it can cause silent data loss on underpowered nodes (foyer silently fails writes when the disk is full rather than returning an error to the caller). Consider whether this warrants a migration note or a runtime check against available disk space during startup. Minor / Style6. fn within_recency(uri: &str, cutoff: Option<chrono::NaiveDate>) -> bool {
crate::object_store_cache::date_partition_within(uri, cutoff)
}The wrapper exists solely to be called in 7. The same 5-line 8. No database-level unit test for The object-store-cache primitives ( Positive Callouts
|
- Fix the fused doc comment on date_partition_within: move the stranded cache-admission description back onto is_within_recent_window and keep only the parser doc (plus the shared-by note) on date_partition_within. - Log the first prefix mismatch (and a skipped count) in evict_cache_for_uris, mirroring the warm path, so a systematic relativization drift that leaves tombstoned files in cache is diagnosable instead of silently swallowed. - Document the worst-case transient heap (block_size_mb * warm_concurrency) on the foyer_block_size_mb config and point smaller instances at timefusion_warm_inline_max_mb. - Comment why warm_footer prefers HEAD + bounded GET over GetRange::Suffix (range cache keys need the absolute file size; HEAD also primes metadata cache; suffix ranges aren't universally supported).
Code ReviewPR: Warm Foyer cache after compaction/flush to kill the post-compaction p99 sawtooth Summary: This is a well-motivated and thoughtfully implemented PR. The root cause (compacted files landing cold in S3 bypassing Foyer) is correctly diagnosed, and the fix correctly targets the post-commit warm/evict lifecycle without touching the hot write path for correctness. The detached-task approach and best-effort error swallowing are appropriate. Below are findings ranging from blocking concerns to nits. Potential Issues1. Aggressive default disk-cache increase may cause ENOSPC on existing deployments
if config.disk_size_bytes > available_disk_bytes {
tracing::warn!("foyer disk ceiling ({}GB) may exceed available space …", …);
}2. Vacuum retention reduced from 72 h → 48 h — risky default change The comment justifies this with "no query runs beyond ~1 h," but that assumption isn't enforced anywhere. A long analytical query, a paused DataFusion session, or a client with a held Delta snapshot could hold a reference past 48 h, at which point VACUUM will delete files the snapshot still needs (yielding 404s or corrupt results). 72 h was already conservative; tightening it in the same PR as cache infrastructure changes increases blast radius. Suggest keeping 72 h or separating this into its own PR with explicit discussion. 3. // object_store_cache.rs ~line 891
let v = entry.value();
insert_main(&cache, key.clone(), CacheValue::new(v.data.clone(), v.meta.clone()), …);For a 128 MB cached file this is a 128 MB allocation inside a spawned task on every cache hit older than if l1_max_entry_bytes > 0 && entry.value().data.len() > l1_max_entry_bytes {
return; // disk-only entries: skip slide-TTL, LRU handles them
}4. Dual prefix-strip logic duplicated between Both functions independently compute: let prefix = table_uri.split('?').next().unwrap_or(…).trim_end_matches('/');and both handle prefix-mismatch drops with near-identical Minor / Nits5. Two S3 round-trips per footer warm (HEAD + GET)
pub async fn warm_footer(store: &dyn ObjectStore, location: &Path, size_hint: u64, known_size: Option<u64>) -> boolNot blocking — the existing behaviour is correct. 6. The test relies on 7. These two windows serve related but distinct roles: one gates cache admission, the other gates warming. Operators tuning one will likely need to tune the other; a cross-reference comment linking them would reduce misconfiguration risk. 8. if age.saturating_mul(2) <= self.config.ttl.as_millis().min(u64::MAX as u128) as u64 {This is correct. Optionally, it's slightly clearer written as: let half_ttl = (self.config.ttl.as_millis().min(u64::MAX as u128) as u64) / 2;
if age <= half_ttl { return; }Not a blocking issue. What's Well Done
Verdict: Approve with the caveat that items 1 (disk default ENOSPC risk) and 2 (vacuum retention reduction) deserve a second look before merge — both are backward-incompatible behaviour changes that are orthogonal to the core caching fix and should be either separated or explicitly called out in the merge description. |
Read path (the query-latency win): in the default footer-only warm mode a cached footer read still paid 2 uncached S3 HEADs per file. Now: - get_range_cached probes the metadata range cache before any HEAD, so a steady-state footer hit pays zero S3 round-trips. - head_cached caches immutable parquet ObjectMeta (by a #meta key) and is used in place of inner.head, so the meta HEAD is paid at most once per file. Mutable paths (Delta log / _last_checkpoint) are never meta-cached. Warm path: warm_footer now issues a single GetRange::Suffix instead of HEAD + bounded GET. get_opts learns the resolved absolute range + size from the suffix response and caches the footer under the same absolute key bounded reads use, plus the ObjectMeta. Falls back to HEAD + bounded GET for stores without suffix-range support. Adds a CountingStore-backed test asserting a suffix warm is 1 GET / 0 HEAD and a later warmed footer read is 0 HEAD / 0 GET.
Code Review: Warm Foyer cache after compaction/flushSummary: This PR eliminates the post-compaction p99 latency sawtooth by warming the Foyer cache for newly-written Parquet files after flush/optimize commits, while proactively evicting tombstoned files. It also adds several complementary improvements: multipart upload tee-to-cache, sliding TTL refresh, HEAD result caching, suffix-range normalisation, and a recency admission window for cache entries. Overall this is well-engineered and thoroughly documented. The best-effort pattern is applied consistently throughout. A few issues are worth addressing before merge. Bugs / Correctness
// warm_footer (suffix path) — correctly drains:
Ok(result) => result.bytes().await.is_ok(),
// warm_footer_via_head (fallback) — does NOT drain:
store.get_opts(location, opts).await.is_ok()For
Breaking Default ChangesThree default values changed with no migration note in the PR description:
The The Design ObservationsTwo partially-overlapping "recency" configs
These serve different purposes but an operator tuning one might expect the other to follow. The relationship is not documented in either field's doc comment. Add a cross-reference so it is clear that warming respects its own narrower window and admission respects a broader one.
/// Runs in a detached, concurrency-bounded task; never blocks or fails the commit.
async fn warm_cache_for_uris(&self, ...) { ... }The function is itself Test CoverageThe new test suite is excellent. A few notes:
tokio::time::sleep(Duration::from_millis(600)).await; // past ttl/2
let _ = cache.get(&path).await?; // triggers background touch
tokio::time::sleep(Duration::from_millis(200)).await; // let re-insert land
tokio::time::sleep(Duration::from_millis(400)).await; // > original TTLThis relies on a background task completing within 200 ms. Under load on a slow CI runner this can race. Replace with
The comment acknowledges that background refresh tasks may detach if Nits
Summary
The core warming logic and test coverage are solid. Fixing the ENOSPC default and moving the vacuum change out are the two items I'd treat as blockers before merge. |
- Run cargo +nightly fmt --all to satisfy the Format check (struct-field alignment + line-width normalization of the new code). - Use path.as_ref() instead of &path.to_string() in two tests (clippy::unnecessary_to_owned, denied via -D warnings). - Add ..Default::default() to the FoyerCacheConfig literal in cache_performance_test.rs, which was missing the block_size_bytes / cache_recent_days / l1_max_entry_bytes / warm_inline_max_bytes fields added by this PR.
Code Review — PR #39: Warm Foyer cache after compaction/flushTL;DR: The approach is well-motivated and the implementation is largely solid. A few issues are worth discussing before merging: a breaking default-config change that could ENOSPC smaller deployments, some subtle correctness risks in the cache-warming path, and a few redundancy / clarity nits. What this PR doesPost-compaction p99 sawtooth caused by newly-written files never entering Foyer. This PR fixes that with three mechanisms:
Bonuses: sliding TTL on cache hits, suffix-range support (one round-trip footer warm), and cached Issues🔴 Breaking default changes for operators upgrading// config.rs
const_default!(d_foyer_disk_gb: usize = 500); // was 100
const_default!(d_foyer_memory_mb: usize = 1024); // was 512
const_default!(d_vacuum_retention: u64 = 48); // was 72The disk default going from 100 GB to 500 GB is dangerous for existing deployments. Foyer creates the backing file sparse, but ENOSPC hits before eviction kicks in — an operator on a smaller instance (or a developer running locally) who upgrades without reading the changelog hits a hard failure. Consider:
The vacuum retention reduction from 72 h → 48 h is also a behavior change. The 48× safety margin reasoning is sound for the described workload, but a comment in the config and a CHANGELOG entry would help operators who set explicit retention policies. 🟡
|
test_batch_queue_under_load timed out after 30s on the prior run. It's a #[serial] integration test against shared-state CI Postgres/MinIO under load — the same family as the adjacent test_concurrent_* tests already marked #[ignore = "wedges under shared-state CI"]. The warm-on-flush path this PR adds is wired only in main.rs's delta_write_callback, which this test does not install, and the write-time multipart capture it does exercise is non-blocking (its unit tests pass). Re-kicking to confirm the flake.
|
Code Review: Warm Foyer cache after compaction/flush Overview This PR addresses a real, observable problem: compaction outputs stream to S3 via multipart but bypass Foyer, so every post-compaction dashboard read of the hot leading-edge partitions cold-starts. The fix is solid in concept — warm footers right after a commit, evict tombstoned files proactively, and tee multipart writes into the cache inline. The code is careful, well-commented, and extensively tested. Issues
vacuum_retention drops from 72h to 48h in config.rs. This tightens the safety window for in-flight queries that hold a Delta snapshot referencing files vacuum would delete. It has nothing to do with cache warming and the two changes should not share a review. Please split this out — it deserves its own discussion about query duration SLAs.
foyer_disk_gb: 100 → 500 — risks ENOSPC on smaller cache volumes despite sparse-file semantics These need explicit call-outs in the PR description and ideally a migration note.
With block_size_bytes = 256 MB and warm_concurrency = 4, up to 1 GB of transient heap can accumulate during a busy maintenance window. The struct doc mentions this, but operators won't read struct docs before deploying. Consider either defaulting warm_inline_max_bytes to something smaller (e.g. 64 MB) and letting operators raise it, or emitting a startup log line when block_size_bytes × warm_concurrency exceeds 512 MB.
The test sleeps 200ms waiting for a background tokio::spawn to land before asserting. On a loaded CI runner this will be flaky. Use tokio::time::pause() / tokio::time::advance(), or expose a JoinHandle-based await path for test contexts.
For FoyerObjectStoreCache this is fine (get_range_cached caches eagerly before returning). For a lazily-caching ObjectStore the warming would silently no-op. Add a comment explaining the eager-cache assumption, or call .bytes().await and discard to make the intent unambiguous regardless of the inner store.
The log says "next query benefits" alongside a past-tense rate. Phrasing it as "pre-warm foyer main hit rate" would make the before/after relationship explicit in dashboards. Positive highlights
Minor nits
Summary The core approach is sound and the implementation is careful. Two things worth addressing before merge:
The timing-sensitive test is a CI reliability concern worth fixing, but not strictly blocking if the team accepts the flakiness risk. |
The warm-after-compaction change set the main data cache's foyer block (region) size to block_size_bytes (128-256MB) so full compaction outputs could persist. But foyer carves the disk device into block-sized regions: when the block is >= the device, there are zero usable regions and every disk insert stalls. CI runs the cache at TIMEFUSION_FOYER_DISK_MB=50 with a 256MB block -> 0 regions, so once the new write-path caching (multipart capture + put_cached) started inserting flushed parquet onto disk, test_batch_queue_under_load wedged and hit its 30s timeout. Master used a 4MB block (file_size_bytes) and so never tripped this. Cap the data block at a quarter of the device (>= several regions), floored at the legacy 4MB granularity and never above the device itself. Production disks (GBs) still get the full 128MB block; small/test disks stay healthy. The metadata cache already used the 4MB file_size_bytes block and is unaffected.
Code Review — PR #39: Warm Foyer cache after compaction/flushOverviewThis PR addresses a real, measurable problem: compaction outputs landing cold in S3 while Foyer caches the pre-compaction files, causing a p99 sawtooth on the partitions dashboards hit hardest. The solution is well-thought-out — warm footers and optionally full files post-commit, evict tombstoned file entries inline, gate warming by partition recency, and tee multipart write bytes into the cache to avoid re-downloading what was just uploaded. The architecture is correct: warming is strictly best-effort and never blocks or fails a commit; correctness relies only on Delta snapshot isolation, which is already a system invariant. The Bugs / Correctness Concerns1. Sliding TTL: stuck key on task panic ( In if !self.refreshing.insert(key.to_string()) { return; }
// ...
let handle = tokio::spawn(async move {
insert_main(&cache, key.clone(), ...);
refreshing.remove(&key); // removed here, inside the task
});If the spawned task panics before 2.
3. The eviction path removes only the full-file cache entry via Significant Default Changes — Operator ImpactThree defaults changed substantially and warrant explicit call-outs in release notes or RUNBOOK:
The comment on Design / Clarity ObservationsTwo overlapping recency configs
The PR consolidates the swap+warm/evict step into Transient memory worst-case should be in RUNBOOK From the config doc comment: "Up to Potential CI Flakiness
The test sleeps 600 ms, then 200 ms ("let the re-insert land"), then 400 ms before asserting the entry is still alive. On a heavily loaded CI runner the 200 ms re-insert window may not be enough, causing the entry to expire before the final assertion. Using Minor Nits
SummaryThe core design is solid and the test suite is excellent — especially the key-consistency and HEAD-elimination tests, which lock in the two main performance invariants. Items to address before merge:
|
put_cached now warms the cache directly from the write payload instead of re-fetching each file from the inner store after the write. The test_cache_performance_and_s3_bypass integration test still asserted the old post-write re-GET (inner_gets == 3); update it to inner_gets == 0, matching the lib unit tests already updated in this PR. Reads still hit cache, so hits == 6 / misses == 0 are unchanged.
Code Review: Warm Foyer cache after compaction/flushGreat PR overall — the problem statement is clear, the solution is layered correctly (multipart tee → post-commit warm → footer warming), and the test suite is unusually thorough for cache-layer code. A few issues worth addressing before merge. Bugs1.
// warm_footer_via_head — body is NOT consumed
store.get_opts(location, opts).await.is_ok()The code even documents this requirement in
For a real HTTP/S3 transport, dropping the match store.get_opts(location, opts).await {
Ok(result) => result.bytes().await.is_ok(),
Err(_) => false,
}2. Previously Breaking / Risky Defaults3. Default This is a 5× increase in the logical ceiling foyer will try to use. The comment is correct that foyer creates sparse files, but the ENOSPC warning applies on any disk smaller than 500 GB. An existing deployment running a 200 GB cache volume will hit ENOSPC silently on upgrade. Consider defaulting to something closer to the original (e.g. 200 GB) and letting operators opt into 500 GB, or at startup log a loud warning when 4. This is an unrelated behavioral change bundled with cache warming. It affects correctness guarantees (how long tombstoned files survive before being deleted), not cache hit-rate. It should ship separately with its own discussion, especially since the justification ("no query runs longer than ~1h") is an operational assumption that varies by deployment. Minor Issues5. // database.rs
fn within_recency(uri: &str, cutoff: Option<chrono::NaiveDate>) -> bool {
crate::object_store_cache::date_partition_within(uri, cutoff)
}This adds a private free function that does nothing but forward. The call sites can use 6. Duplicated The same 7-line block appears verbatim in both optimize paths. Since both now funnel through 7. if let Ok(table_ref) = db.resolve_table(&project_id, &table_name).await {
...
}
// resolve failure: no log, warming silently skippedA 8. buffer: Some(Vec::new()),For the common path (small files that fit under the cap), the buffer will reallocate many times as chunks arrive. Pre-allocating to Test CoverageThe test suite is genuinely excellent —
Summary
The core approach is solid — warming after commit rather than inline with the multipart write is the right call, and the concurrency-bounded detached-task pattern is correct. Fixing the |
/simplify cleanups on the warm-after-compaction diff (no behavior change): - Extract table_cache_prefix() + relativize_to_prefix() in database.rs and use them in both warm_cache_for_uris and evict_cache_for_uris. The two paths previously duplicated the query-string strip / trailing-slash trim / strip_prefix dance; sharing it keeps warmed and evicted cache keys derived identically so they can't desync on a one-char drift. - Extract capped_block_size() in object_store_cache.rs and apply it to BOTH the data and metadata cache builders. The data-cache capping was added inline last commit; the metadata builder still used a raw block size and could wedge the same way on a small metadata disk. One helper guards both. - Lift the TTL clamp in maybe_touch into a named ttl_millis local for readability. Skipped (intentional/out-of-scope): applying the recency window to the on-read footer path (reads should cache what they read), CacheValue -> Arc<Vec<u8>> to cheapen refresh clones (data-model change beyond this diff), and inlining make_cache_key in multipart complete (kept for read/write key lockstep, as its comment notes).
Code Review: Warm Foyer cache after compaction/flushOverviewThis PR tackles a real and well-diagnosed performance problem: post-compaction p99 sawtooth caused by Foyer never being warmed after multipart uploads to S3. The approach is sound — warming footers (cheap, high-impact for query planning) and optionally full files (heavier, configurable off by default), with proactive eviction of tombstoned files. The PR is large but well-scoped: it doesn't change query semantics, only the cache hit rate. Potential Bugs
// warm_footer_via_head (line ~749)
store.get_opts(location, opts).await.is_ok() // ← bytes never consumedFor match store.get_opts(location, opts).await {
Ok(result) => result.bytes().await.is_ok(),
Err(_) => false,
}
for attempt in 0..3 {
let table_clone = { ... };
let pre_uris: HashSet<String> = if track_files { ... };On a retry the Metadata cache key collision between
Performance Concerns
On a hot read path, every hit past
If the first Breaking / Surprising Defaults
This is the most operationally risky part of the PR. Even though Foyer creates sparse files, 500GB is the logical ceiling at which eviction kicks in. Operators who rely on the default and run on hosts with, say, 200GB of disk cache space will hit ENOSPC before eviction — exactly the failure mode described in the config comment. The warning added to
The justification (48x safety margin for ≤1h queries) is reasonable, but this is a global behavior change. Any operator with slow queries or who relies on time-travel back 49–72h will silently lose that window. This should be called out explicitly in the PR description or a CHANGELOG entry, even if the justification is sound. Code Quality
// database.rs
fn within_recency(uri: &str, cutoff: Option<chrono::NaiveDate>) -> bool {
crate::object_store_cache::date_partition_within(uri, cutoff)
}This adds a layer of indirection without adding semantics. Call Two separate recency knobs with confusing naming
Tracking if dropped == 0 {
debug!("warm: URI {} does not start with table prefix {}; ...", u, prefix);
}
dropped += 1;This logs the first mismatched URI but silently accumulates the rest. The final Test CoverageExcellent. The PR ships tests for every new behavior:
One gap: there's no integration test asserting the end-to-end property — "after SummaryThe core design is correct and the implementation is careful. The three issues worth fixing before merge are:
The sliding TTL, multipart capture, and suffix-range cache path are all solid additions. The test suite is thorough. |
Two small conciseness consolidations from the deep-research conciseness pass (no behavior/test change): - put_result_meta(): the write path built a field-identical ObjectMeta from a PutResult in two places (put_cached and the multipart capture's complete()). Share one helper. - MIB/GIB consts in config.rs replace the repeated '* 1024 * 1024' chains in the CacheConfig *_bytes() accessors, so they read as the unit they mean. Skipped (reported): no library swap shrinks the bespoke pieces (multipart capture is required ObjectStore trait boilerplate, date-partition parse and sliding-TTL have no drop-in in the current deps), and a chunk-collect helper would have a single real caller.
Code Review — PR #39: Post-compaction Foyer cache warmingOverviewThis PR fixes a p99 sawtooth on recently-written partitions by warming the Foyer cache after each compaction/flush commit. The root cause is correct: multipart writes bypass the The implementation is generally clean and well-documented. A few things deserve attention before merging. Breaking default changes (no deprecation period)These affect any operator who upgrades without reading the changelog:
Consider moving these three defaults to their own PR or adding an upgrade note. Correctness issues
fn capped_block_size(desired: usize, disk_size: usize) -> usize {
desired.min(disk_size / 4).max(MIN_DISK_BLOCK_BYTES).min(disk_size)
}If fn capped_block_size(desired: usize, disk_size: usize) -> usize {
if disk_size == 0 { return MIN_DISK_BLOCK_BYTES; }
desired.min(disk_size / 4).max(MIN_DISK_BLOCK_BYTES).min(disk_size)
}
Err(_) => warm_footer_via_head(store, location, metadata_size_hint).await,If a file was removed between the warm list being computed and the GET (e.g. concurrent vacuum), this fires a redundant HEAD. Since Design observations
The comment acknowledges this ("don't assume it joins every one"), but a
fn put_part(&mut self, data: PutPayload) -> object_store::UploadPart {
if let Some(buf) = self.buffer.as_mut() {
if buf.len().saturating_add(data.content_length()) > self.max_warm_bytes {
self.buffer = None;
} else {
for chunk in data.iter() { buf.extend_from_slice(chunk); }
}
}
self.inner.put_part(data)
}The cap is enforced before the new part is appended, so
The flush path in Code qualityGood:
Minor style nit: SummaryThe functional design is correct and the test coverage is solid. The two items I'd want resolved before merging:
Everything else is low-severity or follow-up material. |
…itions delta-rs's ZOrder planner has no idempotence guard — no size check, no single-file-bin skip, no "already sorted" detection — so the full optimize job rewrote every file in today + yesterday + 2-days-ago into new (cold, multipart) object keys on every 30-min run, even for sealed days that hadn't changed. PR #39 added cache warming to soften the resulting cold-starts, but the cheaper fix is to not do the redundant rewrites in the first place. optimize_table now snapshots each window partition's live file set and skips any partition whose set is unchanged since the last successful optimize; `today` is always processed (the growing leading edge). Sealed days are z-ordered once, then skipped — so #39's warm/evict has far less cold churn to chase. The new file set is recorded in an in-memory per-table map (zorder_filesets); a restart re-z-orders each in-window partition once, harmlessly. Composes with #39: this narrows what gets rewritten; #39 still warms whatever actually was. New metrics optimize.partitions_rewritten / partitions_skipped expose the churn-avoided signal (after-the-fact monitoring, no telemetry pipeline needed). Unit tests cover the fileset grouping and change detection. https://claude.ai/code/session_01PLU62s7eRhSMMUXEHSWRCx
…itions delta-rs's ZOrder planner has no idempotence guard — no size check, no single-file-bin skip, no "already sorted" detection — so the full optimize job rewrote every file in today + yesterday + 2-days-ago into new (cold, multipart) object keys on every 30-min run, even for sealed days that hadn't changed. PR #39 added cache warming to soften the resulting cold-starts, but the cheaper fix is to not do the redundant rewrites in the first place. optimize_table now snapshots each window partition's live file set and skips any partition whose set is unchanged since the last successful optimize; `today` is always processed (the growing leading edge). Sealed days are z-ordered once, then skipped — so #39's warm/evict has far less cold churn to chase. The new file set is recorded in an in-memory per-table map (zorder_filesets); a restart re-z-orders each in-window partition once, harmlessly. Composes with #39: this narrows what gets rewritten; #39 still warms whatever actually was. New metrics optimize.partitions_rewritten / partitions_skipped expose the churn-avoided signal (after-the-fact monitoring, no telemetry pipeline needed). Unit tests cover the fileset grouping and change detection. https://claude.ai/code/session_01PLU62s7eRhSMMUXEHSWRCx
…itions (#40) delta-rs's ZOrder planner has no idempotence guard — no size check, no single-file-bin skip, no "already sorted" detection — so the full optimize job rewrote every file in today + yesterday + 2-days-ago into new (cold, multipart) object keys on every 30-min run, even for sealed days that hadn't changed. PR #39 added cache warming to soften the resulting cold-starts, but the cheaper fix is to not do the redundant rewrites in the first place. optimize_table now snapshots each window partition's live file set and skips any partition whose set is unchanged since the last successful optimize; `today` is always processed (the growing leading edge). Sealed days are z-ordered once, then skipped — so #39's warm/evict has far less cold churn to chase. The new file set is recorded in an in-memory per-table map (zorder_filesets); a restart re-z-orders each in-window partition once, harmlessly. Composes with #39: this narrows what gets rewritten; #39 still warms whatever actually was. New metrics optimize.partitions_rewritten / partitions_skipped expose the churn-avoided signal (after-the-fact monitoring, no telemetry pipeline needed). Unit tests cover the fileset grouping and change detection. https://claude.ai/code/session_01PLU62s7eRhSMMUXEHSWRCx Co-authored-by: Claude <noreply@anthropic.com>
Compaction outputs (16MB light / 128MB full) stream to S3 via multipart
upload, and put_multipart_opts is a pure pass-through — so every compacted
file lands cold and is never warmed into Foyer. The leading-edge partitions
that "last 1h/6h/24h" dashboards hit get rewritten every 5/30 min, so each
refresh after a compaction misses Foyer, falls to S3, and the cache bytes we
already paid for go stale. p99 sawtooths on exactly the partitions users
watch.
This warms the cache for newly-written files right after the commit is
visible, reusing the existing read path (no inline multipart change needed):
object_store_cache: warm_footer() issues a ranged GET of the parquet
footer so it lands in the metadata cache (query planning pays zero S3
round-trips); warm_full() primes the main full-file cache for data reads.
Both are strictly best-effort — all errors swallowed.
database: warm_cache_for_uris() diffs pre/post file sets to warm only the
files a commit adds, filtered to partitions within
timefusion_warm_recency_days so we don't burn GETs on cold partitions.
Runs in a detached, concurrency-bounded task; never blocks or fails the
commit. Wired into optimize_table, optimize_table_light, and the flush
callback. Logs files-warmed + current foyer hit-rate to size the win.
config: timefusion_warm_after_compaction (default on, footers always),
timefusion_warm_full_files (default off), timefusion_warm_recency_days
(default 2, covers the 48h full-optimize window), timefusion_warm_concurrency.
Correctness is unaffected — Delta snapshot isolation means warming only
changes hit-rate, never which files a query reads.
https://claude.ai/code/session_01GzJmskCVjxaSVJhER2dAFp