feat(clickhouse): Add clickhouse-connect (HTTP) driver behind a runtime config [EAP-497] by phacops · Pull Request #8060 · getsentry/snuba

phacops · 2026-06-18T09:04:33Z

Linear: EAP-497 — Evaluate clickhouse-driver upgrade

Summary

Introduces an HTTP-based ClickHouse client backed by clickhouse-connect as a drop-in alternative to the native clickhouse-driver protocol, selectable at runtime.

The goal is to migrate off the native TCP protocol and onto HTTP via clickhouse-connect, reusing the connection pooling and retry machinery that library gives us for free, and to do it incrementally so it can be rolled out (and rolled back) without a deploy.

Architecture

The two pools derive from an abstract base, so the native and HTTP drivers are siblings. There is a single, driver-agnostic reader.

Pools (ClickhousePool is the abstract base — the name is kept as the abstract type so existing annotations are unchanged):

ClickhousePool — abstract base (ABC) declaring execute / execute_robust / close and the common attributes.
ClickhouseNativePool (native.py) — native protocol implementation (the renamed former ClickhousePool).
ClickhouseConnectPool (connect.py) — HTTP via clickhouse-connect; connects on the node's HTTP port (default 8123, env CLICKHOUSE_HTTP_PORT) — the same port HTTPBatchWriter uses for inserts.

Reader:

ClickhouseReader (native.py) — the one reader. It adapts ClickhouseResult → Result and is driver-agnostic: it wraps the abstract ClickhousePool, so the same reader serves both drivers (both return the same ClickhouseResult). No native/HTTP reader split is needed.

Node: ClickhouseNode carries both native_port and an optional http_port, so a node is self-describing for either driver. Cluster-produced nodes (the query node and those discovered via system.clusters, which only reports the native port) are stamped with the cluster's HTTP port.

Acquisition flow: ConnectionCache is the single place pools are instantiated and the single place the driver is selected. Every caller goes through it and gets one shared, cached, runtime-selected pool behind the abstract ClickhousePool type:

ClickhouseCluster holds the ConnectionCache; get_node_connection() delegates to ConnectionCache.get_node_connection(), and get_reader()/get_deleter() wrap the returned pool in ClickhouseReader. The reader is built per query (get_reader() is called per execution), so it picks up the current driver each time and just holds the abstract pool — it does not touch the cache itself.
The admin by-host helpers (admin/clickhouse/common.py) and the cleanup / optimize / querylog_to_csv CLIs also acquire their pools through ConnectionCache.get_node_connection() rather than constructing one directly, so they get the same runtime-selected driver and the same shared singleton.

ConnectionCache.get_node_connection() reads the use_clickhouse_connect_driver runtime config itself and constructs the concrete ClickhouseNativePool / ClickhouseConnectPool accordingly (connecting on the node's native_port or http_port), caching both variants side by side (the driver is part of the cache key). Because every connection (reads via get_query_connection/get_reader, and migrations, replacer, optimize, admin, CLIs …) goes through get_node_connection, the choice applies everywhere, and ConnectionCache only ever exposes the abstract ClickhousePool.

Class hierarchy

classDiagram
    class Reader {
        <<abstract>>
    }
    class ClickhouseReader {
        driver-agnostic
        ClickhouseResult to Result
    }
    Reader <|-- ClickhouseReader

    class ClickhousePool {
        <<abstract>>
        execute / execute_robust / close
    }
    class ClickhouseNativePool {
        native protocol (clickhouse-driver)
    }
    class ClickhouseConnectPool {
        HTTP protocol (clickhouse-connect)
    }
    ClickhousePool <|-- ClickhouseNativePool
    ClickhousePool <|-- ClickhouseConnectPool

    ClickhouseReader o-- ClickhousePool : client (abstract)

    class ConnectionCache {
        single place pools are instantiated
        selects driver from runtime config
        get_node_connection returns abstract ClickhousePool
    }
    class ClickhouseCluster {
        get_node_connection delegates to cache
        get_reader wraps pool in ClickhouseReader
    }
    class AdminAndCliHelpers {
        admin common.py
        cleanup / optimize / querylog CLIs
    }
    ClickhouseCluster --> ConnectionCache : holds / gets pool via
    AdminAndCliHelpers ..> ConnectionCache : gets pool via
    ConnectionCache ..> ClickhousePool : creates and caches
    ClickhouseCluster ..> ClickhouseReader : wraps pool in

ClickhouseCluster (and the admin/CLI helpers) only ever touch the abstract ClickhousePool and the single ClickhouseReader; the native-vs-connect choice lives entirely inside ConnectionCache.

Other changes:

snuba/settings/__init__.py — new USE_CLICKHOUSE_CONNECT_DRIVER default (off).
pyproject.toml / uv.lock — add the clickhouse-connect dependency (+ lz4, zstandard) and mypy overrides.
Pool acquisition funneled through ConnectionCache — the cluster, the admin common.py helpers, and the cleanup/optimize/querylog CLIs all get pools via ConnectionCache.get_node_connection, so the driver is runtime-selected behind the abstract type everywhere; only the cache constructs ClickhouseNativePool / ClickhouseConnectPool. The admin credential-leak AST regression guard now treats both direct pool construction and cache acquisition (get_node_connection) as "acquiring a pool", so _validate_node must still run first, and the admin connection caches are driver-aware so runtime switching reaches admin too.

Timeouts

Timeouts are driven by the per-profile ClickhouseClientSettings timeout, applied identically by both drivers (the connect pool honors it as-is — no blanket cap):

Read queries (QUERY profile): 30s — set in the shared profile, so it applies to both drivers.
Migrations (MIGRATE), OPTIMIZE, and other long-running operations keep their existing (default or longer) timeouts.
Profiles without a timeout fall back to a 300s default on the HTTP path.

Note: setting the read timeout in the QUERY profile also gives the native read path a 30s socket timeout (previously unset) — intended, but a behavior change for native worth calling out.

Exception handling

Every clickhouse-connect error inherits from ClickHouseError (the analog of clickhouse_driver's errors.Error); the connect pool catches that base — DatabaseError, ProgrammingError, DataError, etc. — and re-raises a snuba ClickhouseError preserving the server code, exactly as the native pool wraps the errors.Error family. Connection/transport failures (OperationalError) additionally emit the connection_error metric.

Connection pool sizing

The HTTP (urllib3) pool size is always taken from the clickhouse_connect_pool_size runtime config, falling back to CLICKHOUSE_MAX_POOL_SIZE (same default as native). It is not a per-caller construction parameter: there is a single shared pool per connection (instantiated by ConnectionCache), sized from the config and read once when the cached client is first created — so the size can be tuned at runtime.

Retry behavior

The connect pool does not reimplement the native pool's retry logic — all retries are delegated to clickhouse-connect (stale keep-alive sockets, transport errors, HTTP 429/503/504). Deliberately dropped: the native TOO_MANY_SIMULTANEOUS_QUERIES (code 202) backoff — over HTTP ClickHouse returns it as a 500 + X-ClickHouse-Exception-Code: 202 header, which clickhouse-connect doesn't retry, so it's surfaced directly as a ClickhouseError preserving the code. execute_robust is therefore a thin alias for execute; retryable is kept only for interface parity.

How to roll out

Controlled by the use_clickhouse_connect_driver runtime config (default USE_CLICKHOUSE_CONNECT_DRIVER, off). get_node_connection() reads it per call (≤10s config memoization), so flipping it takes effect without a restart and applies to every connection.

Runtime configs added

use_clickhouse_connect_driver — route connections through clickhouse-connect (HTTP). Default off.
clickhouse_connect_pool_size — the HTTP connection pool size. Default CLICKHOUSE_MAX_POOL_SIZE.

Dependency / lockfile

clickhouse-connect (1.3.0, plus lz4 and zstandard) is now in the internal PyPI mirror (getsentry/pypi#2273), and uv.lock is regenerated to include it — so the dependency-install CI jobs can resolve it.

Known follow-ups (from automated review)

WITH TOTALS over HTTP is covered by a unit test (test_with_totals_handled_over_http): clickhouse-connect's Native-format reader concatenates the trailing totals block into result_set, so the trailing row is the totals row and the driver-agnostic reader's "totals = last row" assumption holds on the HTTP path too. Worth a final confirmation against a live server before relying on it. (If HTTP totals ever turn out to differ, that's the point at which a connect-specific reader subclass would be reintroduced.)
Tracing / trace_output: the admin API is driver-agnostic for query execution, but the admin trace view and its profile-events parsing depend on trace_output (server send_logs_level logs). clickhouse-connect doesn't surface those (it only reads the X-ClickHouse-Summary header for the profile), so trace_output is empty on the HTTP path and those two admin features return nothing when the connect driver is enabled. Accepted for now (documented in connect.py); reconstructing traces over HTTP would mean querying system.text_log by query_id — a separate feature.
No fail-fast pool-get timeout equivalent (CLICKHOUSE_POOL_GET_TIMEOUT_SECONDS) on the HTTP path.

Testing

New tests/clickhouse/test_connect.py (incl. result/profile mapping, error wrapping, timeout passthrough, pool sizing, the totals-over-HTTP test, and the type-name→reader-transform test), a driver-selection test in tests/clusters/test_cluster.py, the admin credential-leak AST guard, and existing native/reader tests pass.
mypy (strict) and ruff pass on the changed files (and the CI pre-commit check).
A live-ClickHouse-over-HTTP end-to-end test was not possible in the dev sandbox; the clickhouse-connect API usage was validated against the installed library.

🤖 Generated with Claude Code

… config Introduce an HTTP-based ClickHouse client backed by `clickhouse-connect` as a drop-in alternative to the native `clickhouse-driver` protocol. - New `ClickhouseConnectPool` (snuba/clickhouse/connect.py) exposes the same `execute`/`execute_robust`/`close` interface as `ClickhousePool` and returns the same `ClickhouseResult`, so callers and the `NativeDriverReader` need no changes. - It relies on clickhouse-connect's built-in urllib3 connection pool (sized to `max_pool_size`) instead of maintaining our own LifoQueue, with sessions disabled so the shared client can serve concurrent queries. - `ClickhousePool` transparently dispatches each query to the HTTP pool when the `use_clickhouse_connect_driver` runtime config is enabled and an HTTP port is known. The default comes from the new `USE_CLICKHOUSE_CONNECT_DRIVER` setting (off by default), so the migration can be rolled out and rolled back without a deploy. - The cluster's HTTP port is threaded through `ConnectionCache` so the main query/read path can select the driver at runtime. - Error and concurrency-limit retry semantics mirror the native pool. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

…nect Drop the hand-rolled retry logic from ClickhouseConnectPool and rely solely on clickhouse-connect's built-in retry behavior (stale keep-alive sockets, transport errors, and HTTP 429/503/504). This also abandons the native pool's TOO_MANY_SIMULTANEOUS_QUERIES backoff for the HTTP path: clickhouse-connect does not retry that error (ClickHouse returns it as an HTTP 500 with an exception-code header, not a 429/503/504), so it is now surfaced directly to the caller as a ClickhouseError that preserves the code. execute_robust becomes a thin alias for execute, and the `retryable` argument is kept only for interface parity with ClickhousePool. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

- Wrap the full clickhouse-connect error family in the connect pool: catch the base ClickHouseError (DatabaseError, ProgrammingError, DataError, ...) in addition to OperationalError, and surface everything as a snuba ClickhouseError preserving the server code, mirroring how ClickhousePool wraps the clickhouse_driver errors.Error family. OperationalError still emits the connection_error metric like the native NetworkError path. - Cap the connect/send-receive timeouts at 30s. Settings profiles such as MIGRATE carry very large timeouts that are inappropriate for the HTTP query path, so they are clamped down (and an unset timeout defaults to 30s). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

The clickhouse-connect HTTP pool defaults to CLICKHOUSE_MAX_POOL_SIZE, the same setting used by the native pool. Add a `clickhouse_connect_pool_size` runtime config that overrides the urllib3 pool's maxsize when set. The value is read once when the (cached) client is first created. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

cursor · 2026-06-18T09:24:36Z

+            results=results,
+            profile=profile_data,
+            trace_output="",
+        )


Debug trace logs always empty

Medium Severity

On the HTTP path, ClickhouseResult.trace_output is always set to an empty string even when per-query settings include send_logs_level trace. The native pool captures driver trace logs in that case and returns them in trace_output, which RPC debug metadata relies on.

^{Reviewed by Cursor Bugbot for commit 2cfaab9. Configure here.}

Keep snuba/clickhouse/native.py purely the native driver — it no longer knows about clickhouse-connect. The choice of which driver to use is made one level higher, in the connection cache. - native.py is reverted to its original state (no http_port, no dispatch). - ClickhouseConnectPool now subclasses ClickhousePool and overrides execute/execute_robust/close, so it is a true drop-in with no type changes rippling through the callers (the dependency direction is connect -> native, which is correct). - ConnectionCache.get_node_connection reads the use_clickhouse_connect_driver runtime config (via the new cluster.use_clickhouse_connect_driver helper) and builds either a native or an HTTP pool, keyed by driver in the cache. - get_reader/get_deleter no longer cache their reader so the runtime toggle takes effect on the read path without a restart; the connections themselves remain cached by the connection cache. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

Replace the per-call, cache-key-driven driver choice with a single DriverSelectingPool that owns both the native and the clickhouse-connect pool and routes each query to one of them based on the use_clickhouse_connect_driver runtime config. - Both underlying pools are now created once per connection target and cached together inside the selecting pool. Flipping the runtime config no longer creates a pool on the fly; it only changes which already-built pool a query is routed to. - DriverSelectingPool subclasses ClickhousePool (overriding execute/execute_robust/close), so the connection cache, the reader and all direct callers keep their existing ClickhousePool typing. The reader needs no driver knowledge: it just wraps the selecting connection. - get_reader/get_deleter cache their reader again, since dynamic driver switching now lives in the pool. - The use_clickhouse_connect_driver helper moves to snuba.clickhouse.connect alongside the pools that use it. native.py remains byte-for-byte identical to master. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

The connection cache no longer splits between returning a bare ClickhousePool and a DriverSelectingPool. It always builds a DriverSelectingPool and lets that single object decide where each query goes. The selecting pool's connect pool is now optional: when the HTTP port is unknown the pool is built with connect_pool=None and every query is routed to the native pool. This keeps all routing logic in one place instead of splitting it between the cache and the pool. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

cursor · 2026-06-18T09:57:25Z

+                settings,
+                columnar,
+                capture_trace,
+            )


HTTP pool lacks get timeout

Medium Severity

ClickhouseConnectPool.execute sends every query through a shared HTTP client with no equivalent of the native pool’s bounded queue and CLICKHOUSE_POOL_GET_TIMEOUT_SECONDS fail-fast. Under load or a slow ClickHouse, workers can block waiting for urllib3 pool slots instead of failing after the configured pool-get timeout like ClickhousePool does.

^{Reviewed by Cursor Bugbot for commit 1e8affa. Configure here.}

cursor · 2026-06-18T09:57:25Z

+            # native pool likewise wraps the whole clickhouse_driver errors.Error
+            # family into ClickhouseError, preserving the server error code when
+            # there is one.
+            raise ClickhouseError(str(e), code=getattr(e, "code", None) or -1) from e


Zero error code becomes negative

Low Severity

HTTP error wrapping uses getattr(e, "code", None) or -1, so a legitimate ClickHouse error code of 0 is stored as -1 instead of 0, unlike the native pool which passes through e.code directly.

^{Reviewed by Cursor Bugbot for commit 1e8affa. Configure here.}

The clickhouse-connect driver no longer takes a configurable HTTP port; it always connects on the default ClickHouse HTTP port (8123). This removes the http_port plumbing from ConnectionCache.get_node_connection and the cluster, and lets the connect pool always be built, so DriverSelectingPool's connect pool is no longer optional. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

Replace the DriverSelectingPool with a dedicated HTTPDriverReader that sits alongside NativeDriverReader, and move the driver choice up to ClickhouseCluster.get_reader. - HTTPDriverReader (connect.py) subclasses NativeDriverReader and wraps a ClickhouseConnectPool. Result handling is identical (both drivers return a ClickhouseResult); it exists as its own type so the driver is explicit. - get_reader picks NativeDriverReader or HTTPDriverReader based on the use_clickhouse_connect_driver runtime config; both readers are built at most once and cached, so flipping the config only changes which is returned. - The HTTP driver is now scoped to the read path only. Direct callers (migrations, replacer, optimize, ...) keep using the native pool returned by get_node_connection. - ConnectionCache caches native and HTTP pools side by side (driver in the cache key) via get_node_connection / get_http_node_connection. - use_clickhouse_connect_driver moves to cluster.py so the native path never imports clickhouse-connect just to read the flag. native.py remains byte-for-byte identical to master. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

Replace `== True` with `is True` for the is_single_node assertions flagged by ruff (E712). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

…just reads Move the driver choice into ClickhouseCluster.get_node_connection so every caller picks up the runtime config, not only the read path. When use_clickhouse_connect_driver is enabled, get_node_connection (and therefore get_query_connection, migrations, replacer, optimize, ...) returns the clickhouse-connect HTTP pool; otherwise the native pool. get_reader/get_deleter now wrap whatever pool get_node_connection returns in the matching reader type (HTTPDriverReader or NativeDriverReader) via a small __build_reader helper. Readers are no longer cached — they are cheap to build and the config is read on each call, so the driver switches at runtime. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

Stop capping the clickhouse-connect timeout at 30s for every connection. The connect pool now honors the per-profile send_receive_timeout exactly like the native driver (falling back to a 300s default when a profile sets none). Express the read-query timeout policy in the shared QUERY client settings profile: it now uses a 30s timeout, which applies to both the native and the HTTP driver. Migrations (MIGRATE), OPTIMIZE and other long-running operations keep their existing default/longer timeouts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

cursor · 2026-06-18T10:36:36Z

+                    user=user,
+                    password=password,
+                    database=database,
+                    client_settings=client_settings_dict,


None timeout breaks pool default

Medium Severity

ConnectionCache now passes send_receive_timeout=timeout even when the profile timeout is None. That overrides ClickhousePool's intentional 35s default with None (native driver default), changing socket timeouts for profiles like CLEANUP, INSERT, and REPLACE on every deploy—not only when the HTTP driver is enabled.

Additional Locations (1)

snuba/clickhouse/connect.py#L130-L135

^{Reviewed by Cursor Bugbot for commit a87be31. Configure here.}

Introduce a proper class hierarchy so the native and HTTP drivers are siblings rather than one extending the other: - ClickhousePool is now an abstract base (ABC) declaring the execute/execute_robust/close interface. The native implementation moves to a new ClickhouseNativePool(ClickhousePool); ClickhouseConnectPool also derives from the abstract base. The name ClickhousePool is kept as the abstract type so the ~50 type annotations referencing it are unchanged; only construction sites switch to ClickhouseNativePool. - Readers: a shared ClickhouseReader(Reader) holds the result-handling logic, with NativeDriverReader and HTTPDriverReader as thin siblings (HTTPDriverReader no longer subclasses NativeDriverReader). Updated all ClickhouseNativePool construction sites (cluster cache, admin common, cleanup/optimize/querylog CLIs, tests/fakes) and the admin credential-leak AST guard + mocks to target ClickhouseNativePool. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

get_reader and get_deleter previously read use_clickhouse_connect_driver twice (once when resolving the pool, once when picking the reader type), so a config flip between the reads could pair an HTTPDriverReader with a native pool (or vice versa). Resolve the flag once and thread it to both get_node_connection and the reader builder so they always agree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

cursor · 2026-06-18T11:39:49Z

+            query_result = client.query(
+                query,
+                parameters=params if params else None,
+                settings=query_settings,


HTTP params placeholder mismatch

High Severity

The HTTP pool forwards parameters to clickhouse-connect while Snuba SQL still uses %(name)s placeholders written for clickhouse-driver. clickhouse-connect expects {name:Type} in the query text, so parameterized statements can fail or run with wrong bindings once use_clickhouse_connect_driver is enabled (e.g. system.clusters discovery on multi-node clusters).

Additional Locations (1)

snuba/clusters/cluster.py#L536-L541

^{Reviewed by Cursor Bugbot for commit fac0556. Configure here.}

… stray dump.rdb - FakeClickhouseCluster.get_node_connection now accepts the use_connect parameter added to ClickhouseCluster.get_node_connection, fixing the mypy override-incompatibility caught by the pre-commit hook. - Remove dump.rdb (a Redis dump accidentally committed) from the repo and gitignore it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

…s abstract Add ClickhousePool.get_reader() so each pool returns its matching reader: ClickhouseNativePool -> NativeDriverReader, ClickhouseConnectPool -> HTTPDriverReader. ClickhouseCluster.get_reader/get_deleter now just call pool.get_reader(), so the cluster only ever touches the abstract ClickhousePool and Reader -- it no longer imports or constructs the concrete reader classes. This also removes the cluster's __build_reader and the use_connect threading: because the reader is built from the pool, the reader and pool can never disagree about the driver, so the double config read (and its race) is gone. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

linear-code · 2026-06-18T12:04:00Z

EAP-497

…te uv.lock The two reader subclasses were identical -- ClickhouseReader is driver-agnostic (it only uses the abstract ClickhousePool and transforms ClickhouseResult). So drop NativeDriverReader/HTTPDriverReader and ClickhousePool.get_reader(): the cluster wraps whichever pool get_node_connection selects in the single ClickhouseReader. The native-vs-HTTP choice stays at the pool level (ConnectionCache), and ClickhouseCluster touches only the abstract ClickhousePool plus the one reader. Also regenerate uv.lock now that clickhouse-connect (+ lz4, zstandard) is available in the internal PyPI mirror (getsentry/pypi#2273), which unblocks the dependency-install CI jobs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

The HTTP (clickhouse-connect) pool connects on http_port rather than the native node.port, but http_port was not part of the ConnectionCache key. Two clusters sharing the same node, credentials and settings but using different HTTP ports could collide on the same cached pool and silently reuse the wrong port. Add http_port to the cache key (and CacheKey type). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

…onCache The admin _build_validated_pool helper and the cleanup/optimize/querylog CLIs constructed a ClickhouseNativePool directly, hardcoding the native driver and bypassing the runtime native-vs-connect selection that the cluster's connections go through. Funnel all of them through ConnectionCache.get_node_connection so the driver is chosen by the use_clickhouse_connect_driver runtime config and every caller only ever handles the abstract ClickhousePool type. The cache now accepts an optional max_pool_size (added to the cache key) so admin can keep its pool of 2; the OPTIMIZE/CLEANUP timeouts ride along via the client settings profile the cache already reads. The admin EAP-488 guard now treats both direct pool construction and cache acquisition (get_node_connection) as "acquiring a pool", so _validate_node must still run first; its tests are updated accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

…n cache Pool sizing is no longer a per-caller concern: the connect pool always takes its size from the clickhouse_connect_pool_size runtime config (falling back to CLICKHOUSE_MAX_POOL_SIZE), so it is tunable at runtime and there is just one shared pool per connection, instantiated by ConnectionCache. Drop the max_pool_size plumbing added to ConnectionCache and the admin pool-of-2 override, and remove the now-unused max_pool_size constructor arg from ClickhouseConnectPool. Make the admin connection caches (NODE_CONNECTIONS / CLUSTER_CONNECTIONS) driver-aware: now that admin pools are driver-selected, the cache keys must include the driver token, otherwise flipping use_clickhouse_connect_driver would keep returning a pool pinned to the previously cached driver until restart. Fixes the regression flagged on the admin helpers. querylog_to_csv reads CLICKHOUSE_HTTP_PORT (the same env the cluster config uses) instead of hardcoding the default HTTP port. Add a focused test asserting WITH TOTALS works over the HTTP driver: the connect pool surfaces the totals block as the trailing row and the driver-agnostic ClickhouseReader splits it out as totals. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

… close ConnectionCache.get_node_connection no longer takes a use_connect parameter: it reads the use_clickhouse_connect_driver runtime flag itself, so the driver decision lives in exactly one place. Callers (cluster, admin helper, the cleanup/optimize/querylog CLIs) drop the use_connect=use_clickhouse_connect_driver() boilerplate, and the now-dead use_connect override param is removed from the test FakeClickhouseCluster. Also take the pool's lock in ClickhouseConnectPool.close() so teardown can't race with the lazy client init in _get_client(). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

…nsforms clickhouse-connect's column_type.name returns the canonical ClickHouse type strings (identical to the native driver), so the reader's Date/DateTime/UUID regex transforms match identically on the HTTP path. Add an end-to-end test through the real connect pool + ClickhouseReader using clickhouse-connect's actual type names (including parametrized DateTime('UTC') and Nullable(UUID)) asserting the values are transformed, so a future format change would be caught instead of silently skipping transformations. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

…port Rename ClickhouseNode.port to native_port and add an optional http_port, so a node is self-describing for either driver. The connection cache no longer takes a separate http_port argument: it reads node.native_port for the native pool and node.http_port for the connect pool, and the node (which now includes http_port in its identity) subsumes the previous standalone http_port cache-key element. Cluster-produced nodes (the query node and those discovered via system.clusters, which only reports the native port) are stamped with the cluster's http_port. The admin/CLI by-host helpers pass http_port when constructing their node. Nodes built outside a cluster (tests, the replacer load balancer) leave http_port as None since they never open HTTP connections. Field accesses of node.port are updated to node.native_port across the cluster, admin, and migrations code; positional ClickhouseNode(...) constructions are unaffected by the rename. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

…river-migration-iarw98 # Conflicts: # uv.lock

Pre-commit mypy (no-implicit-reexport, full test coverage) flagged two issues the source-only runs missed: - test_connect.py: cast the FakeFormattedQuery stubs to FormattedQuery when calling ClickhouseReader.execute. - test_system_queries.py: import the connection_cache singleton from snuba.clusters.cluster instead of reaching through snuba.admin.clickhouse.common (which doesn't re-export it); patching the method on the shared object still covers the reference bound inside common. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

The admin API is driver-agnostic for query execution, but the trace view and its profile-events parsing depend on trace_output, which clickhouse-connect cannot capture (it only reads the X-ClickHouse-Summary header). Document this accepted limitation where trace logs are requested and where trace_output is emitted, noting that the system.text_log-by-query_id reconstruction would be a separate feature. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 6 total unresolved issues (including 5 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 9a71608. Configure here.}

cursor · 2026-06-18T13:42:02Z

 ) -> ClickhousePool:
    storage = _get_storage(storage_name)

-    key = f"{storage.get_storage_key()}-{clickhouse_host}"


Admin pool size limit removed

Medium Severity

Admin ClickHouse connections no longer cap the native pool at two connections. _build_validated_pool now uses the shared connection_cache, which builds pools with the default CLICKHOUSE_MAX_POOL_SIZE (25) or the HTTP runtime pool size, so targeted admin nodes can see far more simultaneous connections than before.

^{Reviewed by Cursor Bugbot for commit 9a71608. Configure here.}

This is intentional, not an oversight. Earlier in the PR the admin helper built its own ClickhouseNativePool(..., max_pool_size=2); we deliberately removed that per-caller cap so admin acquires the one shared, runtime-config-sized pool from ConnectionCache like every other caller (the connect pool always sizes from clickhouse_connect_pool_size, the native pool from CLICKHOUSE_MAX_POOL_SIZE). Keeping a hardcoded 2 would mean reintroducing the per-caller max_pool_size plumbing through the cache that we just took out, and would give admin a different-sized pool than everything else hitting the same node.

Admin is low-traffic (human operators running system queries / tracing), so sharing the standard sizing is acceptable, and pool size is now runtime-tunable via clickhouse_connect_pool_size if it ever needs limiting. Leaving as-is.

Generated by Claude Code

The QUERY profile timeout went None->30 to give user-facing reads a 30s cap, but the QUERY profile is also borrowed by non-read callers that can legitimately run longer: cluster topology discovery (system.clusters), storage-routing load lookups, delete-throttling system-table checks, the span-export manual job, and admin table copies. Capping those at 30s could break them on large/loaded clusters. Add an unbounded INTERNAL client-settings profile and move those callers onto it, restoring their pre-PR (unbounded) behavior, while the user read path (get_reader -> QUERY) keeps the 30s cap. Add a test asserting QUERY is 30s and INTERNAL is unbounded. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

phacops requested a review from a team as a code owner June 18, 2026 09:04

cursor Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/clickhouse/native.py Outdated

sentry Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/clickhouse/connect.py Outdated

Comment thread snuba/clickhouse/native.py Outdated

claude added 2 commits June 18, 2026 09:14

cursor Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/clickhouse/connect.py

sentry Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/clickhouse/connect.py

cursor Bot reviewed Jun 18, 2026

View reviewed changes

sentry Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/clickhouse/connect.py Outdated

cursor Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/clusters/cluster.py

sentry Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/clickhouse/connect.py

sentry Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/clusters/cluster.py Outdated

cursor Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/clusters/cluster.py Outdated

claude added 2 commits June 18, 2026 10:15

cursor Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/clickhouse/connect.py

sentry Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/clusters/cluster.py Outdated

cursor Bot reviewed Jun 18, 2026

View reviewed changes

sentry Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/clusters/cluster.py

phacops requested a review from a team as a code owner June 18, 2026 11:15

sentry Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/clusters/cluster.py

cursor Bot reviewed Jun 18, 2026

View reviewed changes

sentry Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/clusters/cluster.py Outdated

sentry Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread tests/clusters/fake_cluster.py

phacops changed the title ~~feat(clickhouse): Add clickhouse-connect (HTTP) driver behind a runtime config~~ feat(clickhouse): Add clickhouse-connect (HTTP) driver behind a runtime config [EAP-497] Jun 18, 2026

cursor Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/clusters/cluster.py

sentry Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/clickhouse/connect.py

claude added 2 commits June 18, 2026 12:15

cursor Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/admin/clickhouse/common.py Outdated

sentry Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/admin/clickhouse/common.py

Comment thread snuba/cli/querylog_to_csv.py

sentry Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/clickhouse/connect.py Outdated

sentry Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/clickhouse/connect.py

sentry Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/cli/optimize.py Outdated

claude and others added 5 commits June 18, 2026 13:21

Merge remote-tracking branch 'origin/master' into claude/clickhouse-d…

c485a37

…river-migration-iarw98 # Conflicts: # uv.lock

[getsentry/action-github-commit] Auto commit

58ac330

cursor Bot reviewed Jun 18, 2026

View reviewed changes

sentry Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread snuba/clusters/cluster.py Outdated

Uh oh!

Conversation

phacops commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Class hierarchy

Timeouts

Exception handling

Connection pool sizing

Retry behavior

How to roll out

Runtime configs added

Dependency / lockfile

Known follow-ups (from automated review)

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot Jun 18, 2026

Choose a reason for hiding this comment

Debug trace logs always empty

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot Jun 18, 2026

Choose a reason for hiding this comment

HTTP pool lacks get timeout

Uh oh!

cursor Bot Jun 18, 2026

Choose a reason for hiding this comment

Zero error code becomes negative

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot Jun 18, 2026

Choose a reason for hiding this comment

None timeout breaks pool default

Uh oh!

Uh oh!

Uh oh!

cursor Bot Jun 18, 2026

Choose a reason for hiding this comment

HTTP params placeholder mismatch

Uh oh!

Uh oh!

Uh oh!

linear-code Bot commented Jun 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 18, 2026

Choose a reason for hiding this comment

Admin pool size limit removed

Uh oh!

phacops Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

phacops commented Jun 18, 2026 •

edited

Loading