Skip to content

[superlog] Retry analytics queries once on transient ClickHouse connection drops#505

Open
superlog-app[bot] wants to merge 1 commit into
stagingfrom
superlog/retry-transient-ch-connection-drops
Open

[superlog] Retry analytics queries once on transient ClickHouse connection drops#505
superlog-app[bot] wants to merge 1 commit into
stagingfrom
superlog/retry-transient-ch-connection-drops

Conversation

@superlog-app

@superlog-app superlog-app Bot commented Jun 28, 2026

Copy link
Copy Markdown

Summary

When ClickHouse drops the socket connection during an analytics query, the outbound_links (or any) widget returns empty data to the user. The outer HTTP response is HTTP 200, so the browser does not retry, leaving the chart blank with "Query execution failed".

The runSingle helper in packages/ai/src/query/batch-executor.ts previously had a single try/catch that immediately returned the error with no retry. Transient socket-level errors ("socket connection was closed unexpectedly", ECONNRESET, etc.) from ClickHouse are not data errors — they are connection pool blips that resolve on the next attempt.

The fix wraps the execute call in a two-attempt loop. On the first attempt, if the error matches a known transient pattern and it is not an AbortError (user disconnected), the query is retried immediately. If the retry also fails, or if the error is non-transient, the error is propagated as before. This prevents the user from seeing empty charts and eliminates the false-positive ERROR log for these events.

An alternative approach would be to add retry logic at the ClickHouse client level (in packages/db/clickhouse), which would cover union queries too. The call-site approach chosen here is more surgical and easier to reason about for the batch executor's retry budget.

Incident on Superlog


Was this PR helpful? Leave feedback — goes straight to the Superlog team.


Summary by cubic

Retry analytics queries once when transient ClickHouse connections drop to prevent blank charts and reduce false error logs. Implemented targeted retry logic in packages/ai/src/query/batch-executor.ts within runSingle.

  • Bug Fixes
    • Added transient error detection (isTransientChError) with patterns like "socket connection was closed", "ECONNRESET", etc.
    • Skips retry on AbortError; retries once on first failure only.
    • Preserves existing error propagation and telemetry via mergeWideEvent if retry fails or error is non-transient.

Written for commit 8a8e895. Summary will update on new commits.

Review in cubic

@vercel vercel Bot temporarily deployed to Preview – documentation June 28, 2026 08:02 Inactive
@vercel

vercel Bot commented Jun 28, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
databuddy-status Ready Ready Preview, Comment Jun 28, 2026 8:02am
2 Skipped Deployments
Project Deployment Actions Updated (UTC)
dashboard Skipped Skipped Jun 28, 2026 8:02am
documentation Skipped Skipped Jun 28, 2026 8:02am

@vercel vercel Bot temporarily deployed to Preview – dashboard June 28, 2026 08:02 Inactive
@unkey-deploy

unkey-deploy Bot commented Jun 28, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Unkey Deploy

Name Status Preview Inspect Updated (UTC)
api (preview) Ready Visit Preview Inspect Jun 28, 2026 8:03am

@greptile-apps

greptile-apps Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a single immediate retry in runSingle for transient ClickHouse connection errors (socket resets, ECONNRESET, etc.) to prevent blank charts when the connection pool drops a socket between queries.

  • Introduces isTransientChError to classify retryable errors by pattern-matching the lowercased message, guarding against AbortErrors, and a for (attempt < 2) loop in runSingle that silently continues on the first transient failure.
  • The retry covers both direct single-query paths and the union-query fallback path (since union failures already delegate to runSingle).
  • ECONNREFUSED is included in the transient patterns but unlike the others it indicates a hard connection refusal (nothing listening on the port), so an immediate retry doubles query latency and retry load during an actual ClickHouse outage rather than during a pool blip.

Confidence Score: 3/5

Safe to merge after removing ECONNREFUSED from the transient pattern list; all other retry logic is well-scoped.

The core retry mechanism is sound for the intended incident class (idle connection resets). The econnrefused pattern is the issue — it represents a hard refusal rather than a pool blip, so an immediate retry doubles query latency during real outages without any chance of success.

packages/ai/src/query/batch-executor.ts — specifically the TRANSIENT_CH_ERROR_PATTERNS array and the retry catch block.

Important Files Changed

Filename Overview
packages/ai/src/query/batch-executor.ts Adds a two-attempt retry loop in runSingle and an isTransientChError classifier. ECONNREFUSED is included as a transient pattern but represents a hard connection refusal, which doubles query latency during real outages. No telemetry is emitted for successful retries.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[runSingle called] --> B{config exists?}
    B -- No --> C[return unknown type error]
    B -- Yes --> D[attempt = 0]
    D --> E[new SimpleQueryBuilder + execute]
    E -- success --> F[mergeWideEvent success\nreturn data]
    E -- catch error --> G{attempt == 0 AND\nisTransientChError?}
    G -- Yes --> H[continue: attempt = 1]
    H --> E
    G -- No --> I[mergeWideEvent query_error\nreturn error]
    H -- catch error on attempt 1 --> I
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[runSingle called] --> B{config exists?}
    B -- No --> C[return unknown type error]
    B -- Yes --> D[attempt = 0]
    D --> E[new SimpleQueryBuilder + execute]
    E -- success --> F[mergeWideEvent success\nreturn data]
    E -- catch error --> G{attempt == 0 AND\nisTransientChError?}
    G -- Yes --> H[continue: attempt = 1]
    H --> E
    G -- No --> I[mergeWideEvent query_error\nreturn error]
    H -- catch error on attempt 1 --> I
Loading

Reviews (1): Last reviewed commit: "[superlog] Retry analytics queries once ..." | Re-trigger Greptile

Comment on lines +54 to +60
const TRANSIENT_CH_ERROR_PATTERNS = [
"socket connection was closed",
"econnreset",
"econnrefused",
"network error",
"connection reset",
];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ECONNREFUSED is semantically different from the other patterns here and is not safely retried without a delay. ECONNRESET and "socket connection was closed" are connection-pool blips where the server closed an idle socket — retrying immediately makes sense. ECONNREFUSED means nothing is listening on the port at all: ClickHouse is down, restarting, or misconfigured. An immediate retry will hit the same refusal, doubling query latency for every in-flight request during an outage. In a batch of N concurrent queries this means every query takes 2× as long before returning an error, potentially making a recoverable outage much worse (thundering herd on restart). Removing it keeps the retry focused on the actual incident class (idle-connection drops) without adding risk during full-service unavailability.

Suggested change
const TRANSIENT_CH_ERROR_PATTERNS = [
"socket connection was closed",
"econnreset",
"econnrefused",
"network error",
"connection reset",
];
const TRANSIENT_CH_ERROR_PATTERNS = [
"socket connection was closed",
"econnreset",
"network error",
"connection reset",
];

Comment on lines +282 to +285
} catch (e) {
if (attempt === 0 && isTransientChError(e)) {
continue;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 When the first attempt fails with a transient error and the retry succeeds, no telemetry is emitted for the retry event. Without this, it's impossible to know in production how often the retry path is being exercised, whether it's masking a degraded ClickHouse pool, or if the retry rate is trending upward over time. A single mergeWideEvent call before the continue gives full visibility at essentially zero cost.

Suggested change
} catch (e) {
if (attempt === 0 && isTransientChError(e)) {
continue;
}
} catch (e) {
if (attempt === 0 && isTransientChError(e)) {
mergeWideEvent({ query_transient_retry: 1, query_type: req.type });
continue;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants