perf: Cache Transaction.table_metadata between staged updates#3302
perf: Cache Transaction.table_metadata between staged updates#3302rynewang wants to merge 2 commits intoapache:mainfrom
Conversation
Transaction.table_metadata calls update_table_metadata on every access, which replays every staged update through model_copy(deep=True). Code that reads the property repeatedly within a single logical operation (e.g. the snapshot producer's schema()/spec()/new_manifest_writer() helpers inside per-manifest loops) pays that cost each time. This caches the result keyed on the identity of the base metadata and the staged-updates tuple. Since _updates is an immutable tuple, every self._updates += (...) rebinds it to a new object, so the identity check self-invalidates without any explicit cache-clearing at mutation sites. The cache also invalidates when the underlying Table.metadata is refreshed after a commit. The only observable difference is that last_updated_ms on the returned metadata is now stable across repeated reads of the same logical state, instead of being re-stamped with now() on each access. The timestamp written at commit time is unaffected. Adds a test that asserts repeated reads compute once, and that staging an update invalidates and recomputes.
geruh
left a comment
There was a problem hiding this comment.
Why not just do this:
@property
def table_metadata(self) -> TableMetadata:
if not self._updates:
return self._table.metadata
return update_table_metadata(self._table.metadata, self._updates)Instead of introducing a cache here we could just return the metadata if no pending updates. Avoiding caching look up complexity and should be (small lol) but faster than this for a simple append. I also don't see a world where we stage updates and read frequently. The fix above will fix the normal commit case unless I'm missing something.
|
Hi @geruh thanks for the fast review! On the simplified design though, the case it doesn't cover is Same story for any multi-op transaction where something stages an update before the append (e.g. set_properties then fast_append). My use case is: create a Keeping a ref to the update tuple wont copy the memory, it just increments a refcount so I think it's fine? |
Summary
Transaction.table_metadatacallsupdate_table_metadataon every access, which replays every staged update throughmodel_copy(deep=True). Code that reads the property repeatedly within a single logical operation — e.g. the snapshot producer'sschema()/spec()/new_manifest_writer()helpers inside per-manifest loops, or the ~5 reads in_SnapshotProducer.__init__— pays that cost each time.Change
Cache the result, keyed on the identity of the two inputs (
self._table.metadata,self._updates). Since_updatesis an immutable tuple, everyself._updates += (...)rebinds it to a new object, so theischeck self-invalidates without any explicit cache-clearing at mutation sites (_stage,_apply,CreateTableTransaction._initial_changes,commit_transaction). Likewise the cache misses whenTable.metadatais refreshed after a commit.Behavior note
The only observable difference is that
last_updated_mson the returned metadata is now stable across repeated reads of the same logical state, instead of being re-stamped withnow()on each access. The timestamp written at commit time is unaffected.Testing
New
test_transaction_table_metadata_cachedasserts that 11 consecutive reads trigger exactly oneupdate_table_metadatacall, and that staging an update invalidates and recomputes once.Relationship to #2674 / #3301
#2674 and #3301 hoist the property access out of specific loops in
snapshot.py. This PR fixes the root cause, so those hoists become unnecessary (though harmless) and any future call sites don't need to remember to hoist.