feat(clickhouse): Add clickhouse-connect (HTTP) driver behind a runtime config [EAP-497]#8060
feat(clickhouse): Add clickhouse-connect (HTTP) driver behind a runtime config [EAP-497]#8060phacops wants to merge 30 commits into
Conversation
… config Introduce an HTTP-based ClickHouse client backed by `clickhouse-connect` as a drop-in alternative to the native `clickhouse-driver` protocol. - New `ClickhouseConnectPool` (snuba/clickhouse/connect.py) exposes the same `execute`/`execute_robust`/`close` interface as `ClickhousePool` and returns the same `ClickhouseResult`, so callers and the `NativeDriverReader` need no changes. - It relies on clickhouse-connect's built-in urllib3 connection pool (sized to `max_pool_size`) instead of maintaining our own LifoQueue, with sessions disabled so the shared client can serve concurrent queries. - `ClickhousePool` transparently dispatches each query to the HTTP pool when the `use_clickhouse_connect_driver` runtime config is enabled and an HTTP port is known. The default comes from the new `USE_CLICKHOUSE_CONNECT_DRIVER` setting (off by default), so the migration can be rolled out and rolled back without a deploy. - The cluster's HTTP port is threaded through `ConnectionCache` so the main query/read path can select the driver at runtime. - Error and concurrency-limit retry semantics mirror the native pool. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
…nect Drop the hand-rolled retry logic from ClickhouseConnectPool and rely solely on clickhouse-connect's built-in retry behavior (stale keep-alive sockets, transport errors, and HTTP 429/503/504). This also abandons the native pool's TOO_MANY_SIMULTANEOUS_QUERIES backoff for the HTTP path: clickhouse-connect does not retry that error (ClickHouse returns it as an HTTP 500 with an exception-code header, not a 429/503/504), so it is now surfaced directly to the caller as a ClickhouseError that preserves the code. execute_robust becomes a thin alias for execute, and the `retryable` argument is kept only for interface parity with ClickhousePool. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
- Wrap the full clickhouse-connect error family in the connect pool: catch the base ClickHouseError (DatabaseError, ProgrammingError, DataError, ...) in addition to OperationalError, and surface everything as a snuba ClickhouseError preserving the server code, mirroring how ClickhousePool wraps the clickhouse_driver errors.Error family. OperationalError still emits the connection_error metric like the native NetworkError path. - Cap the connect/send-receive timeouts at 30s. Settings profiles such as MIGRATE carry very large timeouts that are inappropriate for the HTTP query path, so they are clamped down (and an unset timeout defaults to 30s). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
The clickhouse-connect HTTP pool defaults to CLICKHOUSE_MAX_POOL_SIZE, the same setting used by the native pool. Add a `clickhouse_connect_pool_size` runtime config that overrides the urllib3 pool's maxsize when set. The value is read once when the (cached) client is first created. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
| results=results, | ||
| profile=profile_data, | ||
| trace_output="", | ||
| ) |
There was a problem hiding this comment.
Debug trace logs always empty
Medium Severity
On the HTTP path, ClickhouseResult.trace_output is always set to an empty string even when per-query settings include send_logs_level trace. The native pool captures driver trace logs in that case and returns them in trace_output, which RPC debug metadata relies on.
Reviewed by Cursor Bugbot for commit 2cfaab9. Configure here.
Keep snuba/clickhouse/native.py purely the native driver — it no longer knows about clickhouse-connect. The choice of which driver to use is made one level higher, in the connection cache. - native.py is reverted to its original state (no http_port, no dispatch). - ClickhouseConnectPool now subclasses ClickhousePool and overrides execute/execute_robust/close, so it is a true drop-in with no type changes rippling through the callers (the dependency direction is connect -> native, which is correct). - ConnectionCache.get_node_connection reads the use_clickhouse_connect_driver runtime config (via the new cluster.use_clickhouse_connect_driver helper) and builds either a native or an HTTP pool, keyed by driver in the cache. - get_reader/get_deleter no longer cache their reader so the runtime toggle takes effect on the read path without a restart; the connections themselves remain cached by the connection cache. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Replace the per-call, cache-key-driven driver choice with a single DriverSelectingPool that owns both the native and the clickhouse-connect pool and routes each query to one of them based on the use_clickhouse_connect_driver runtime config. - Both underlying pools are now created once per connection target and cached together inside the selecting pool. Flipping the runtime config no longer creates a pool on the fly; it only changes which already-built pool a query is routed to. - DriverSelectingPool subclasses ClickhousePool (overriding execute/execute_robust/close), so the connection cache, the reader and all direct callers keep their existing ClickhousePool typing. The reader needs no driver knowledge: it just wraps the selecting connection. - get_reader/get_deleter cache their reader again, since dynamic driver switching now lives in the pool. - The use_clickhouse_connect_driver helper moves to snuba.clickhouse.connect alongside the pools that use it. native.py remains byte-for-byte identical to master. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
The connection cache no longer splits between returning a bare ClickhousePool and a DriverSelectingPool. It always builds a DriverSelectingPool and lets that single object decide where each query goes. The selecting pool's connect pool is now optional: when the HTTP port is unknown the pool is built with connect_pool=None and every query is routed to the native pool. This keeps all routing logic in one place instead of splitting it between the cache and the pool. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
| settings, | ||
| columnar, | ||
| capture_trace, | ||
| ) |
There was a problem hiding this comment.
HTTP pool lacks get timeout
Medium Severity
ClickhouseConnectPool.execute sends every query through a shared HTTP client with no equivalent of the native pool’s bounded queue and CLICKHOUSE_POOL_GET_TIMEOUT_SECONDS fail-fast. Under load or a slow ClickHouse, workers can block waiting for urllib3 pool slots instead of failing after the configured pool-get timeout like ClickhousePool does.
Reviewed by Cursor Bugbot for commit 1e8affa. Configure here.
| # native pool likewise wraps the whole clickhouse_driver errors.Error | ||
| # family into ClickhouseError, preserving the server error code when | ||
| # there is one. | ||
| raise ClickhouseError(str(e), code=getattr(e, "code", None) or -1) from e |
There was a problem hiding this comment.
Zero error code becomes negative
Low Severity
HTTP error wrapping uses getattr(e, "code", None) or -1, so a legitimate ClickHouse error code of 0 is stored as -1 instead of 0, unlike the native pool which passes through e.code directly.
Reviewed by Cursor Bugbot for commit 1e8affa. Configure here.
The clickhouse-connect driver no longer takes a configurable HTTP port; it always connects on the default ClickHouse HTTP port (8123). This removes the http_port plumbing from ConnectionCache.get_node_connection and the cluster, and lets the connect pool always be built, so DriverSelectingPool's connect pool is no longer optional. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Replace the DriverSelectingPool with a dedicated HTTPDriverReader that sits alongside NativeDriverReader, and move the driver choice up to ClickhouseCluster.get_reader. - HTTPDriverReader (connect.py) subclasses NativeDriverReader and wraps a ClickhouseConnectPool. Result handling is identical (both drivers return a ClickhouseResult); it exists as its own type so the driver is explicit. - get_reader picks NativeDriverReader or HTTPDriverReader based on the use_clickhouse_connect_driver runtime config; both readers are built at most once and cached, so flipping the config only changes which is returned. - The HTTP driver is now scoped to the read path only. Direct callers (migrations, replacer, optimize, ...) keep using the native pool returned by get_node_connection. - ConnectionCache caches native and HTTP pools side by side (driver in the cache key) via get_node_connection / get_http_node_connection. - use_clickhouse_connect_driver moves to cluster.py so the native path never imports clickhouse-connect just to read the flag. native.py remains byte-for-byte identical to master. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Replace `== True` with `is True` for the is_single_node assertions flagged by ruff (E712). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
…just reads Move the driver choice into ClickhouseCluster.get_node_connection so every caller picks up the runtime config, not only the read path. When use_clickhouse_connect_driver is enabled, get_node_connection (and therefore get_query_connection, migrations, replacer, optimize, ...) returns the clickhouse-connect HTTP pool; otherwise the native pool. get_reader/get_deleter now wrap whatever pool get_node_connection returns in the matching reader type (HTTPDriverReader or NativeDriverReader) via a small __build_reader helper. Readers are no longer cached — they are cheap to build and the config is read on each call, so the driver switches at runtime. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Stop capping the clickhouse-connect timeout at 30s for every connection. The connect pool now honors the per-profile send_receive_timeout exactly like the native driver (falling back to a 300s default when a profile sets none). Express the read-query timeout policy in the shared QUERY client settings profile: it now uses a 30s timeout, which applies to both the native and the HTTP driver. Migrations (MIGRATE), OPTIMIZE and other long-running operations keep their existing default/longer timeouts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
| user=user, | ||
| password=password, | ||
| database=database, | ||
| client_settings=client_settings_dict, |
There was a problem hiding this comment.
None timeout breaks pool default
Medium Severity
ConnectionCache now passes send_receive_timeout=timeout even when the profile timeout is None. That overrides ClickhousePool's intentional 35s default with None (native driver default), changing socket timeouts for profiles like CLEANUP, INSERT, and REPLACE on every deploy—not only when the HTTP driver is enabled.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit a87be31. Configure here.
Introduce a proper class hierarchy so the native and HTTP drivers are siblings rather than one extending the other: - ClickhousePool is now an abstract base (ABC) declaring the execute/execute_robust/close interface. The native implementation moves to a new ClickhouseNativePool(ClickhousePool); ClickhouseConnectPool also derives from the abstract base. The name ClickhousePool is kept as the abstract type so the ~50 type annotations referencing it are unchanged; only construction sites switch to ClickhouseNativePool. - Readers: a shared ClickhouseReader(Reader) holds the result-handling logic, with NativeDriverReader and HTTPDriverReader as thin siblings (HTTPDriverReader no longer subclasses NativeDriverReader). Updated all ClickhouseNativePool construction sites (cluster cache, admin common, cleanup/optimize/querylog CLIs, tests/fakes) and the admin credential-leak AST guard + mocks to target ClickhouseNativePool. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
get_reader and get_deleter previously read use_clickhouse_connect_driver twice (once when resolving the pool, once when picking the reader type), so a config flip between the reads could pair an HTTPDriverReader with a native pool (or vice versa). Resolve the flag once and thread it to both get_node_connection and the reader builder so they always agree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
| query_result = client.query( | ||
| query, | ||
| parameters=params if params else None, | ||
| settings=query_settings, |
There was a problem hiding this comment.
HTTP params placeholder mismatch
High Severity
The HTTP pool forwards parameters to clickhouse-connect while Snuba SQL still uses %(name)s placeholders written for clickhouse-driver. clickhouse-connect expects {name:Type} in the query text, so parameterized statements can fail or run with wrong bindings once use_clickhouse_connect_driver is enabled (e.g. system.clusters discovery on multi-node clusters).
Additional Locations (1)
Reviewed by Cursor Bugbot for commit fac0556. Configure here.
… stray dump.rdb - FakeClickhouseCluster.get_node_connection now accepts the use_connect parameter added to ClickhouseCluster.get_node_connection, fixing the mypy override-incompatibility caught by the pre-commit hook. - Remove dump.rdb (a Redis dump accidentally committed) from the repo and gitignore it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
…s abstract Add ClickhousePool.get_reader() so each pool returns its matching reader: ClickhouseNativePool -> NativeDriverReader, ClickhouseConnectPool -> HTTPDriverReader. ClickhouseCluster.get_reader/get_deleter now just call pool.get_reader(), so the cluster only ever touches the abstract ClickhousePool and Reader -- it no longer imports or constructs the concrete reader classes. This also removes the cluster's __build_reader and the use_connect threading: because the reader is built from the pool, the reader and pool can never disagree about the driver, so the double config read (and its race) is gone. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
…te uv.lock The two reader subclasses were identical -- ClickhouseReader is driver-agnostic (it only uses the abstract ClickhousePool and transforms ClickhouseResult). So drop NativeDriverReader/HTTPDriverReader and ClickhousePool.get_reader(): the cluster wraps whichever pool get_node_connection selects in the single ClickhouseReader. The native-vs-HTTP choice stays at the pool level (ConnectionCache), and ClickhouseCluster touches only the abstract ClickhousePool plus the one reader. Also regenerate uv.lock now that clickhouse-connect (+ lz4, zstandard) is available in the internal PyPI mirror (getsentry/pypi#2273), which unblocks the dependency-install CI jobs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
The HTTP (clickhouse-connect) pool connects on http_port rather than the native node.port, but http_port was not part of the ConnectionCache key. Two clusters sharing the same node, credentials and settings but using different HTTP ports could collide on the same cached pool and silently reuse the wrong port. Add http_port to the cache key (and CacheKey type). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
…onCache The admin _build_validated_pool helper and the cleanup/optimize/querylog CLIs constructed a ClickhouseNativePool directly, hardcoding the native driver and bypassing the runtime native-vs-connect selection that the cluster's connections go through. Funnel all of them through ConnectionCache.get_node_connection so the driver is chosen by the use_clickhouse_connect_driver runtime config and every caller only ever handles the abstract ClickhousePool type. The cache now accepts an optional max_pool_size (added to the cache key) so admin can keep its pool of 2; the OPTIMIZE/CLEANUP timeouts ride along via the client settings profile the cache already reads. The admin EAP-488 guard now treats both direct pool construction and cache acquisition (get_node_connection) as "acquiring a pool", so _validate_node must still run first; its tests are updated accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
…n cache Pool sizing is no longer a per-caller concern: the connect pool always takes its size from the clickhouse_connect_pool_size runtime config (falling back to CLICKHOUSE_MAX_POOL_SIZE), so it is tunable at runtime and there is just one shared pool per connection, instantiated by ConnectionCache. Drop the max_pool_size plumbing added to ConnectionCache and the admin pool-of-2 override, and remove the now-unused max_pool_size constructor arg from ClickhouseConnectPool. Make the admin connection caches (NODE_CONNECTIONS / CLUSTER_CONNECTIONS) driver-aware: now that admin pools are driver-selected, the cache keys must include the driver token, otherwise flipping use_clickhouse_connect_driver would keep returning a pool pinned to the previously cached driver until restart. Fixes the regression flagged on the admin helpers. querylog_to_csv reads CLICKHOUSE_HTTP_PORT (the same env the cluster config uses) instead of hardcoding the default HTTP port. Add a focused test asserting WITH TOTALS works over the HTTP driver: the connect pool surfaces the totals block as the trailing row and the driver-agnostic ClickhouseReader splits it out as totals. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
… close ConnectionCache.get_node_connection no longer takes a use_connect parameter: it reads the use_clickhouse_connect_driver runtime flag itself, so the driver decision lives in exactly one place. Callers (cluster, admin helper, the cleanup/optimize/querylog CLIs) drop the use_connect=use_clickhouse_connect_driver() boilerplate, and the now-dead use_connect override param is removed from the test FakeClickhouseCluster. Also take the pool's lock in ClickhouseConnectPool.close() so teardown can't race with the lazy client init in _get_client(). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
…nsforms
clickhouse-connect's column_type.name returns the canonical ClickHouse type
strings (identical to the native driver), so the reader's Date/DateTime/UUID
regex transforms match identically on the HTTP path. Add an end-to-end test
through the real connect pool + ClickhouseReader using clickhouse-connect's
actual type names (including parametrized DateTime('UTC') and Nullable(UUID))
asserting the values are transformed, so a future format change would be
caught instead of silently skipping transformations.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
…port Rename ClickhouseNode.port to native_port and add an optional http_port, so a node is self-describing for either driver. The connection cache no longer takes a separate http_port argument: it reads node.native_port for the native pool and node.http_port for the connect pool, and the node (which now includes http_port in its identity) subsumes the previous standalone http_port cache-key element. Cluster-produced nodes (the query node and those discovered via system.clusters, which only reports the native port) are stamped with the cluster's http_port. The admin/CLI by-host helpers pass http_port when constructing their node. Nodes built outside a cluster (tests, the replacer load balancer) leave http_port as None since they never open HTTP connections. Field accesses of node.port are updated to node.native_port across the cluster, admin, and migrations code; positional ClickhouseNode(...) constructions are unaffected by the rename. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
…river-migration-iarw98 # Conflicts: # uv.lock
Pre-commit mypy (no-implicit-reexport, full test coverage) flagged two issues the source-only runs missed: - test_connect.py: cast the FakeFormattedQuery stubs to FormattedQuery when calling ClickhouseReader.execute. - test_system_queries.py: import the connection_cache singleton from snuba.clusters.cluster instead of reaching through snuba.admin.clickhouse.common (which doesn't re-export it); patching the method on the shared object still covers the reference bound inside common. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
The admin API is driver-agnostic for query execution, but the trace view and its profile-events parsing depend on trace_output, which clickhouse-connect cannot capture (it only reads the X-ClickHouse-Summary header). Document this accepted limitation where trace logs are requested and where trace_output is emitted, noting that the system.text_log-by-query_id reconstruction would be a separate feature. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 6 total unresolved issues (including 5 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 9a71608. Configure here.
| ) -> ClickhousePool: | ||
| storage = _get_storage(storage_name) | ||
|
|
||
| key = f"{storage.get_storage_key()}-{clickhouse_host}" |
There was a problem hiding this comment.
Admin pool size limit removed
Medium Severity
Admin ClickHouse connections no longer cap the native pool at two connections. _build_validated_pool now uses the shared connection_cache, which builds pools with the default CLICKHOUSE_MAX_POOL_SIZE (25) or the HTTP runtime pool size, so targeted admin nodes can see far more simultaneous connections than before.
Reviewed by Cursor Bugbot for commit 9a71608. Configure here.
There was a problem hiding this comment.
This is intentional, not an oversight. Earlier in the PR the admin helper built its own ClickhouseNativePool(..., max_pool_size=2); we deliberately removed that per-caller cap so admin acquires the one shared, runtime-config-sized pool from ConnectionCache like every other caller (the connect pool always sizes from clickhouse_connect_pool_size, the native pool from CLICKHOUSE_MAX_POOL_SIZE). Keeping a hardcoded 2 would mean reintroducing the per-caller max_pool_size plumbing through the cache that we just took out, and would give admin a different-sized pool than everything else hitting the same node.
Admin is low-traffic (human operators running system queries / tracing), so sharing the standard sizing is acceptable, and pool size is now runtime-tunable via clickhouse_connect_pool_size if it ever needs limiting. Leaving as-is.
Generated by Claude Code
The QUERY profile timeout went None->30 to give user-facing reads a 30s cap, but the QUERY profile is also borrowed by non-read callers that can legitimately run longer: cluster topology discovery (system.clusters), storage-routing load lookups, delete-throttling system-table checks, the span-export manual job, and admin table copies. Capping those at 30s could break them on large/loaded clusters. Add an unbounded INTERNAL client-settings profile and move those callers onto it, restoring their pre-PR (unbounded) behavior, while the user read path (get_reader -> QUERY) keeps the 30s cap. Add a test asserting QUERY is 30s and INTERNAL is unbounded. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC


Linear: EAP-497 — Evaluate clickhouse-driver upgrade
Summary
Introduces an HTTP-based ClickHouse client backed by
clickhouse-connectas a drop-in alternative to the nativeclickhouse-driverprotocol, selectable at runtime.The goal is to migrate off the native TCP protocol and onto HTTP via
clickhouse-connect, reusing the connection pooling and retry machinery that library gives us for free, and to do it incrementally so it can be rolled out (and rolled back) without a deploy.Architecture
The two pools derive from an abstract base, so the native and HTTP drivers are siblings. There is a single, driver-agnostic reader.
Pools (
ClickhousePoolis the abstract base — the name is kept as the abstract type so existing annotations are unchanged):ClickhousePool— abstract base (ABC) declaringexecute/execute_robust/closeand the common attributes.ClickhouseNativePool(native.py) — native protocol implementation (the renamed formerClickhousePool).ClickhouseConnectPool(connect.py) — HTTP viaclickhouse-connect; connects on the node's HTTP port (default 8123, envCLICKHOUSE_HTTP_PORT) — the same portHTTPBatchWriteruses for inserts.Reader:
ClickhouseReader(native.py) — the one reader. It adaptsClickhouseResult→Resultand is driver-agnostic: it wraps the abstractClickhousePool, so the same reader serves both drivers (both return the sameClickhouseResult). No native/HTTP reader split is needed.Node:
ClickhouseNodecarries bothnative_portand an optionalhttp_port, so a node is self-describing for either driver. Cluster-produced nodes (the query node and those discovered viasystem.clusters, which only reports the native port) are stamped with the cluster's HTTP port.Acquisition flow:
ConnectionCacheis the single place pools are instantiated and the single place the driver is selected. Every caller goes through it and gets one shared, cached, runtime-selected pool behind the abstractClickhousePooltype:ClickhouseClusterholds theConnectionCache;get_node_connection()delegates toConnectionCache.get_node_connection(), andget_reader()/get_deleter()wrap the returned pool inClickhouseReader. The reader is built per query (get_reader()is called per execution), so it picks up the current driver each time and just holds the abstract pool — it does not touch the cache itself.admin/clickhouse/common.py) and thecleanup/optimize/querylog_to_csvCLIs also acquire their pools throughConnectionCache.get_node_connection()rather than constructing one directly, so they get the same runtime-selected driver and the same shared singleton.ConnectionCache.get_node_connection()reads theuse_clickhouse_connect_driverruntime config itself and constructs the concreteClickhouseNativePool/ClickhouseConnectPoolaccordingly (connecting on the node'snative_portorhttp_port), caching both variants side by side (the driver is part of the cache key). Because every connection (reads viaget_query_connection/get_reader, and migrations, replacer, optimize, admin, CLIs …) goes throughget_node_connection, the choice applies everywhere, andConnectionCacheonly ever exposes the abstractClickhousePool.Class hierarchy
classDiagram class Reader { <<abstract>> } class ClickhouseReader { driver-agnostic ClickhouseResult to Result } Reader <|-- ClickhouseReader class ClickhousePool { <<abstract>> execute / execute_robust / close } class ClickhouseNativePool { native protocol (clickhouse-driver) } class ClickhouseConnectPool { HTTP protocol (clickhouse-connect) } ClickhousePool <|-- ClickhouseNativePool ClickhousePool <|-- ClickhouseConnectPool ClickhouseReader o-- ClickhousePool : client (abstract) class ConnectionCache { single place pools are instantiated selects driver from runtime config get_node_connection returns abstract ClickhousePool } class ClickhouseCluster { get_node_connection delegates to cache get_reader wraps pool in ClickhouseReader } class AdminAndCliHelpers { admin common.py cleanup / optimize / querylog CLIs } ClickhouseCluster --> ConnectionCache : holds / gets pool via AdminAndCliHelpers ..> ConnectionCache : gets pool via ConnectionCache ..> ClickhousePool : creates and caches ClickhouseCluster ..> ClickhouseReader : wraps pool inClickhouseCluster(and the admin/CLI helpers) only ever touch the abstractClickhousePooland the singleClickhouseReader; the native-vs-connect choice lives entirely insideConnectionCache.Other changes:
snuba/settings/__init__.py— newUSE_CLICKHOUSE_CONNECT_DRIVERdefault (off).pyproject.toml/uv.lock— add theclickhouse-connectdependency (+lz4,zstandard) and mypy overrides.ConnectionCache— the cluster, the admincommon.pyhelpers, and the cleanup/optimize/querylog CLIs all get pools viaConnectionCache.get_node_connection, so the driver is runtime-selected behind the abstract type everywhere; only the cache constructsClickhouseNativePool/ClickhouseConnectPool. The admin credential-leak AST regression guard now treats both direct pool construction and cache acquisition (get_node_connection) as "acquiring a pool", so_validate_nodemust still run first, and the admin connection caches are driver-aware so runtime switching reaches admin too.Timeouts
Timeouts are driven by the per-profile
ClickhouseClientSettingstimeout, applied identically by both drivers (the connect pool honors it as-is — no blanket cap):QUERYprofile): 30s — set in the shared profile, so it applies to both drivers.MIGRATE),OPTIMIZE, and other long-running operations keep their existing (default or longer) timeouts.Note: setting the read timeout in the
QUERYprofile also gives the native read path a 30s socket timeout (previously unset) — intended, but a behavior change for native worth calling out.Exception handling
Every clickhouse-connect error inherits from
ClickHouseError(the analog ofclickhouse_driver'serrors.Error); the connect pool catches that base —DatabaseError,ProgrammingError,DataError, etc. — and re-raises a snubaClickhouseErrorpreserving the server code, exactly as the native pool wraps theerrors.Errorfamily. Connection/transport failures (OperationalError) additionally emit theconnection_errormetric.Connection pool sizing
The HTTP (
urllib3) pool size is always taken from theclickhouse_connect_pool_sizeruntime config, falling back toCLICKHOUSE_MAX_POOL_SIZE(same default as native). It is not a per-caller construction parameter: there is a single shared pool per connection (instantiated byConnectionCache), sized from the config and read once when the cached client is first created — so the size can be tuned at runtime.Retry behavior
The connect pool does not reimplement the native pool's retry logic — all retries are delegated to
clickhouse-connect(stale keep-alive sockets, transport errors, HTTP429/503/504). Deliberately dropped: the nativeTOO_MANY_SIMULTANEOUS_QUERIES(code 202) backoff — over HTTP ClickHouse returns it as a500+X-ClickHouse-Exception-Code: 202header, which clickhouse-connect doesn't retry, so it's surfaced directly as aClickhouseErrorpreserving the code.execute_robustis therefore a thin alias forexecute;retryableis kept only for interface parity.How to roll out
Controlled by the
use_clickhouse_connect_driverruntime config (defaultUSE_CLICKHOUSE_CONNECT_DRIVER, off).get_node_connection()reads it per call (≤10s config memoization), so flipping it takes effect without a restart and applies to every connection.Runtime configs added
use_clickhouse_connect_driver— route connections through clickhouse-connect (HTTP). Default off.clickhouse_connect_pool_size— the HTTP connection pool size. DefaultCLICKHOUSE_MAX_POOL_SIZE.Dependency / lockfile
clickhouse-connect(1.3.0, pluslz4andzstandard) is now in the internal PyPI mirror (getsentry/pypi#2273), anduv.lockis regenerated to include it — so the dependency-install CI jobs can resolve it.Known follow-ups (from automated review)
WITH TOTALSover HTTP is covered by a unit test (test_with_totals_handled_over_http): clickhouse-connect's Native-format reader concatenates the trailing totals block intoresult_set, so the trailing row is the totals row and the driver-agnostic reader's "totals = last row" assumption holds on the HTTP path too. Worth a final confirmation against a live server before relying on it. (If HTTP totals ever turn out to differ, that's the point at which a connect-specific reader subclass would be reintroduced.)trace_output: the admin API is driver-agnostic for query execution, but the admin trace view and its profile-events parsing depend ontrace_output(serversend_logs_levellogs). clickhouse-connect doesn't surface those (it only reads theX-ClickHouse-Summaryheader for the profile), sotrace_outputis empty on the HTTP path and those two admin features return nothing when the connect driver is enabled. Accepted for now (documented inconnect.py); reconstructing traces over HTTP would mean queryingsystem.text_logbyquery_id— a separate feature.CLICKHOUSE_POOL_GET_TIMEOUT_SECONDS) on the HTTP path.Testing
tests/clickhouse/test_connect.py(incl. result/profile mapping, error wrapping, timeout passthrough, pool sizing, the totals-over-HTTP test, and the type-name→reader-transform test), a driver-selection test intests/clusters/test_cluster.py, the admin credential-leak AST guard, and existing native/reader tests pass.mypy(strict) andruffpass on the changed files (and the CIpre-commitcheck).clickhouse-connectAPI usage was validated against the installed library.🤖 Generated with Claude Code