Skip to content

Fix coordinator stalls from large role transactions#37380

Open
SangJunBak wants to merge 3 commits into
MaterializeInc:mainfrom
SangJunBak:jun/sql-421
Open

Fix coordinator stalls from large role transactions#37380
SangJunBak wants to merge 3 commits into
MaterializeInc:mainfrom
SangJunBak:jun/sql-421

Conversation

@SangJunBak

@SangJunBak SangJunBak commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Let n be the number of catalog items

In this PR, we do two optimizations:

  1. Every time we grant a privilege, it effectively updates an item by mutating its privilege map. However everytime we update an item, we need to check if it could lead to a name collision with another item, which is an O(n) operation. We now also check if it's update (as opposed to a create/delete) and if the primary/unique key has even changed O(1). If not, then we know for sure it can't lead to a collision
  2. We batch the GRANT operations by object. Before, if an object was being granted a privilege from one person to 1000 people, that would lead to 1000 operations to the durable catalog. Now we just batch to one operation.

Motivation

Fixes sql-421

Verification

Verified by doing Dennis' repro in the ticket. But also I created a parallel workload for it and tested it before and after the change. The fact that it works after means we're no longer blocking for minutes

TableTransaction ran an O(n) uniqueness scan over the whole collection on every
update_by_key/set/set_many call, even when the update could not change uniqueness
(a privilege or owner change leaves name/schema/type untouched). Bulk operations
that update the same objects repeatedly, such as GRANT ... ON ALL TABLES IN
SCHEMA, call this in a loop, turning a single catalog transaction into
O(ops * catalog_items) work and wedging the single-threaded coordinator.

Add a second per-collection predicate, has_unique_key_changed, alongside
uniqueness_violation. An update is scanned only when it changed a field
uniqueness_violation reads.
GRANT/REVOKE expands to one Op::UpdatePrivilege per (target, grantee), so
GRANT ... ON ALL TABLES IN SCHEMA s TO r1,...,rN produced N_tables * N_grantees
ops, each rewriting its target object durably. Carry a batch of MzAclItems per
target in Op::UpdatePrivilege and apply them with a single durable write per
target, collapsing the cross product to one write per object. Audit events stay
one per grantee.
… during bulk GRANT

Adds a BulkPrivilegeGrant parallel-benchmark scenario: a background thread runs
GRANT/REVOKE SELECT ON ALL TABLES IN SCHEMA to 150 roles back to back over 150
tables, while ten closed loops measure SELECT 1 latency. Asserts SELECT 1 max
latency stays under 30s. Before the durable-layer and op-batching fixes a single
bulk grant wedged the coordinator for minutes
@SangJunBak SangJunBak requested review from def- and ggevay June 30, 2026 23:40
@SangJunBak SangJunBak marked this pull request as ready for review June 30, 2026 23:40
@SangJunBak SangJunBak requested a review from a team as a code owner June 30, 2026 23:40
@SangJunBak SangJunBak changed the title catalog: skip uniqueness scan for uniqueness-neutral durable updates Fix coordinator stalls from large role transactions Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant