Skip to content

EAP distributed reads cancelled: Code 394 "Query was cancelled" (SNUBA-A23) #8076

Description

@phacops

Summary

EndpointTraceItemTable (and occasionally other EAP endpoints) intermittently fail with:

QueryException: Code: 394. DB::Exception: Received from snuba-events-analytics-platform-arm-5-3:9000. DB::Exception: Query was cancelled.

Sentry: SNUBA-A23https://sentry.sentry.io/issues/SNUBA-A23 (≈126 occurrences, ongoing since 2025-12-12).

Why this is tracked separately from the __set_* fix (#8074)

PR #8074 fixes the mixed-version Code: 10 ... Not found column ... While executing Remote. failures caused by constant IN-sets leaking unstable __set_<hash> identifiers into SELECT-clause column names (SNUBA-9W6, SNUBA-A1W, SNUBA-B6C, SNUBA-B62, SNUBA-B63, SNUBA-B67, SNUBA-A13).

SNUBA-A23 is a different failure. It is a distributed-query cancellation (QUERY_WAS_CANCELLED, code 394), not a query-construction error. The query in the failing event has no constant IN-set — its aggregate conditions are and(has(mapKeys(attributes_string_2), 'user'), True) — so the membership_as_has rewrite in #8074 does not apply and will not resolve it.

Observed context:

  • Endpoint: EndpointTraceItemTable/v1, referrer tagstore.get_groups_user_counts (a count_unique(user) aggregation).
  • duration_group: <10s — cancelled quickly, so not a long max_execution_time timeout.
  • The cancellation is reported by a remote shard (...arm-5-3:9000) via DB::QueryStatus::throwQueryWasCancelled.

Likely causes to investigate

  • A sibling shard/replica failing or erroring, causing the coordinator to cancel the query on the remaining shards.
  • A memory limit / overcommit tracker killing the query on a shard.
  • Upstream cancellation (client disconnect, rate limiter, or max_threads/concurrency limits).
  • Possible (unconfirmed) indirect link: if a co-located query hit the __set_* bug and made a shard error, in-flight sibling queries could be cancelled — worth re-checking once Inline constant IN-sets in SELECT-clause filters to fix mixed-version distributed reads (SNUBA-9W6) #8074 is deployed to see whether A23's frequency drops.

Suggested next steps

Filed as a follow-up while triaging the last hour of QueryExceptions alongside #8074.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions