[superlog] Retry analytics queries once on transient ClickHouse connection drops#505
[superlog] Retry analytics queries once on transient ClickHouse connection drops#505superlog-app[bot] wants to merge 1 commit into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
2 Skipped Deployments
|
|
The latest updates on your projects. Learn more about Unkey Deploy
|
Greptile SummaryThis PR adds a single immediate retry in
Confidence Score: 3/5Safe to merge after removing ECONNREFUSED from the transient pattern list; all other retry logic is well-scoped. The core retry mechanism is sound for the intended incident class (idle connection resets). The econnrefused pattern is the issue — it represents a hard refusal rather than a pool blip, so an immediate retry doubles query latency during real outages without any chance of success. packages/ai/src/query/batch-executor.ts — specifically the TRANSIENT_CH_ERROR_PATTERNS array and the retry catch block. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[runSingle called] --> B{config exists?}
B -- No --> C[return unknown type error]
B -- Yes --> D[attempt = 0]
D --> E[new SimpleQueryBuilder + execute]
E -- success --> F[mergeWideEvent success\nreturn data]
E -- catch error --> G{attempt == 0 AND\nisTransientChError?}
G -- Yes --> H[continue: attempt = 1]
H --> E
G -- No --> I[mergeWideEvent query_error\nreturn error]
H -- catch error on attempt 1 --> I
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[runSingle called] --> B{config exists?}
B -- No --> C[return unknown type error]
B -- Yes --> D[attempt = 0]
D --> E[new SimpleQueryBuilder + execute]
E -- success --> F[mergeWideEvent success\nreturn data]
E -- catch error --> G{attempt == 0 AND\nisTransientChError?}
G -- Yes --> H[continue: attempt = 1]
H --> E
G -- No --> I[mergeWideEvent query_error\nreturn error]
H -- catch error on attempt 1 --> I
Reviews (1): Last reviewed commit: "[superlog] Retry analytics queries once ..." | Re-trigger Greptile |
| const TRANSIENT_CH_ERROR_PATTERNS = [ | ||
| "socket connection was closed", | ||
| "econnreset", | ||
| "econnrefused", | ||
| "network error", | ||
| "connection reset", | ||
| ]; |
There was a problem hiding this comment.
ECONNREFUSED is semantically different from the other patterns here and is not safely retried without a delay. ECONNRESET and "socket connection was closed" are connection-pool blips where the server closed an idle socket — retrying immediately makes sense. ECONNREFUSED means nothing is listening on the port at all: ClickHouse is down, restarting, or misconfigured. An immediate retry will hit the same refusal, doubling query latency for every in-flight request during an outage. In a batch of N concurrent queries this means every query takes 2× as long before returning an error, potentially making a recoverable outage much worse (thundering herd on restart). Removing it keeps the retry focused on the actual incident class (idle-connection drops) without adding risk during full-service unavailability.
| const TRANSIENT_CH_ERROR_PATTERNS = [ | |
| "socket connection was closed", | |
| "econnreset", | |
| "econnrefused", | |
| "network error", | |
| "connection reset", | |
| ]; | |
| const TRANSIENT_CH_ERROR_PATTERNS = [ | |
| "socket connection was closed", | |
| "econnreset", | |
| "network error", | |
| "connection reset", | |
| ]; |
| } catch (e) { | ||
| if (attempt === 0 && isTransientChError(e)) { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
When the first attempt fails with a transient error and the retry succeeds, no telemetry is emitted for the retry event. Without this, it's impossible to know in production how often the retry path is being exercised, whether it's masking a degraded ClickHouse pool, or if the retry rate is trending upward over time. A single
mergeWideEvent call before the continue gives full visibility at essentially zero cost.
| } catch (e) { | |
| if (attempt === 0 && isTransientChError(e)) { | |
| continue; | |
| } | |
| } catch (e) { | |
| if (attempt === 0 && isTransientChError(e)) { | |
| mergeWideEvent({ query_transient_retry: 1, query_type: req.type }); | |
| continue; | |
| } |
Summary
When ClickHouse drops the socket connection during an analytics query, the
outbound_links(or any) widget returns empty data to the user. The outer HTTP response is HTTP 200, so the browser does not retry, leaving the chart blank with "Query execution failed".The
runSinglehelper inpackages/ai/src/query/batch-executor.tspreviously had a singletry/catchthat immediately returned the error with no retry. Transient socket-level errors ("socket connection was closed unexpectedly", ECONNRESET, etc.) from ClickHouse are not data errors — they are connection pool blips that resolve on the next attempt.The fix wraps the execute call in a two-attempt loop. On the first attempt, if the error matches a known transient pattern and it is not an AbortError (user disconnected), the query is retried immediately. If the retry also fails, or if the error is non-transient, the error is propagated as before. This prevents the user from seeing empty charts and eliminates the false-positive ERROR log for these events.
An alternative approach would be to add retry logic at the ClickHouse client level (in
packages/db/clickhouse), which would cover union queries too. The call-site approach chosen here is more surgical and easier to reason about for the batch executor's retry budget.Incident on Superlog
Was this PR helpful? Leave feedback — goes straight to the Superlog team.
Summary by cubic
Retry analytics queries once when transient ClickHouse connections drop to prevent blank charts and reduce false error logs. Implemented targeted retry logic in
packages/ai/src/query/batch-executor.tswithinrunSingle.isTransientChError) with patterns like "socket connection was closed", "ECONNRESET", etc.AbortError; retries once on first failure only.mergeWideEventif retry fails or error is non-transient.Written for commit 8a8e895. Summary will update on new commits.