Environment
self-hosted (https://develop.sentry.dev/self-hosted/)
Steps to Reproduce
OrganizationOnboardingTaskManager.record() can create severe PostgreSQL row-lock contention when many workers miss the onboarding-task cache for the same (organization_id, task) key at the same time.
In my case it produced a thundering herd of workers queued on the same small set of sentry_organizationonboardingtask rows. My self-hosted Sentry hit a production database incident where queries like SELECT ... FROM sentry_organizationonboardingtask ... FOR UPDATE dominated PostgreSQL master. It consumed around 95% of all DB CPU time and had mean execution time around 1.85s
The problematic workload came from onboarding-task recording on high-volume signal paths, especially events with sourcemaps. Once a hot onboarding task row already exists and is complete, repeated cache misses still drive workers into update_or_create() instead of taking a non-locking no-op path.
There are mitigations added since 25.5.1 (large cache TTL and completed-onboarding bypass), but a cache miss can still fall through to update_or_create() that performs a locking read/update of the target row. Also, skip option defaults to False and only helps once organization option onboarding:complete exists.
Expected Result
When the onboarding task row already exists in terminal state, record() should be able to return without issuing a SELECT ... FOR UPDATE or updating the row.
For an already-complete task, cache miss should be a cheap non-locking read followed by cache refill, not a locking write path.
Actual Result
Heavy DB load and lock contention on sentry_organizationonboardingtask.
Product Area
Other
Link
No response
DSN
No response
Version
25.5.1
Environment
self-hosted (https://develop.sentry.dev/self-hosted/)
Steps to Reproduce
OrganizationOnboardingTaskManager.record()can create severe PostgreSQL row-lock contention when many workers miss the onboarding-task cache for the same(organization_id, task)key at the same time.In my case it produced a thundering herd of workers queued on the same small set of
sentry_organizationonboardingtaskrows. My self-hosted Sentry hit a production database incident where queries likeSELECT ... FROM sentry_organizationonboardingtask ... FOR UPDATEdominated PostgreSQL master. It consumed around 95% of all DB CPU time and had mean execution time around 1.85sThe problematic workload came from onboarding-task recording on high-volume signal paths, especially events with sourcemaps. Once a hot onboarding task row already exists and is complete, repeated cache misses still drive workers into
update_or_create()instead of taking a non-locking no-op path.There are mitigations added since 25.5.1 (large cache TTL and completed-onboarding bypass), but a cache miss can still fall through to
update_or_create()that performs a locking read/update of the target row. Also, skip option defaults toFalseand only helps once organization optiononboarding:completeexists.Expected Result
When the onboarding task row already exists in terminal state,
record()should be able to return without issuing aSELECT ... FOR UPDATEor updating the row.For an already-complete task, cache miss should be a cheap non-locking read followed by cache refill, not a locking write path.
Actual Result
Heavy DB load and lock contention on
sentry_organizationonboardingtask.Product Area
Other
Link
No response
DSN
No response
Version
25.5.1