Skip to content

feat(clickhouse): Add clickhouse-connect (HTTP) driver behind a runtime config [EAP-497]#8060

Open
phacops wants to merge 30 commits into
masterfrom
claude/clickhouse-driver-migration-iarw98
Open

feat(clickhouse): Add clickhouse-connect (HTTP) driver behind a runtime config [EAP-497]#8060
phacops wants to merge 30 commits into
masterfrom
claude/clickhouse-driver-migration-iarw98

Conversation

@phacops

@phacops phacops commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Linear: EAP-497 — Evaluate clickhouse-driver upgrade

Summary

Introduces an HTTP-based ClickHouse client backed by clickhouse-connect as a drop-in alternative to the native clickhouse-driver protocol, selectable at runtime.

The goal is to migrate off the native TCP protocol and onto HTTP via clickhouse-connect, reusing the connection pooling and retry machinery that library gives us for free, and to do it incrementally so it can be rolled out (and rolled back) without a deploy.

Architecture

The two pools derive from an abstract base, so the native and HTTP drivers are siblings. There is a single, driver-agnostic reader.

Pools (ClickhousePool is the abstract base — the name is kept as the abstract type so existing annotations are unchanged):

  • ClickhousePool — abstract base (ABC) declaring execute / execute_robust / close and the common attributes.
  • ClickhouseNativePool (native.py) — native protocol implementation (the renamed former ClickhousePool).
  • ClickhouseConnectPool (connect.py) — HTTP via clickhouse-connect; connects on the node's HTTP port (default 8123, env CLICKHOUSE_HTTP_PORT) — the same port HTTPBatchWriter uses for inserts.

Reader:

  • ClickhouseReader (native.py) — the one reader. It adapts ClickhouseResultResult and is driver-agnostic: it wraps the abstract ClickhousePool, so the same reader serves both drivers (both return the same ClickhouseResult). No native/HTTP reader split is needed.

Node: ClickhouseNode carries both native_port and an optional http_port, so a node is self-describing for either driver. Cluster-produced nodes (the query node and those discovered via system.clusters, which only reports the native port) are stamped with the cluster's HTTP port.

Acquisition flow: ConnectionCache is the single place pools are instantiated and the single place the driver is selected. Every caller goes through it and gets one shared, cached, runtime-selected pool behind the abstract ClickhousePool type:

  • ClickhouseCluster holds the ConnectionCache; get_node_connection() delegates to ConnectionCache.get_node_connection(), and get_reader()/get_deleter() wrap the returned pool in ClickhouseReader. The reader is built per query (get_reader() is called per execution), so it picks up the current driver each time and just holds the abstract pool — it does not touch the cache itself.
  • The admin by-host helpers (admin/clickhouse/common.py) and the cleanup / optimize / querylog_to_csv CLIs also acquire their pools through ConnectionCache.get_node_connection() rather than constructing one directly, so they get the same runtime-selected driver and the same shared singleton.

ConnectionCache.get_node_connection() reads the use_clickhouse_connect_driver runtime config itself and constructs the concrete ClickhouseNativePool / ClickhouseConnectPool accordingly (connecting on the node's native_port or http_port), caching both variants side by side (the driver is part of the cache key). Because every connection (reads via get_query_connection/get_reader, and migrations, replacer, optimize, admin, CLIs …) goes through get_node_connection, the choice applies everywhere, and ConnectionCache only ever exposes the abstract ClickhousePool.

Class hierarchy

classDiagram
    class Reader {
        <<abstract>>
    }
    class ClickhouseReader {
        driver-agnostic
        ClickhouseResult to Result
    }
    Reader <|-- ClickhouseReader

    class ClickhousePool {
        <<abstract>>
        execute / execute_robust / close
    }
    class ClickhouseNativePool {
        native protocol (clickhouse-driver)
    }
    class ClickhouseConnectPool {
        HTTP protocol (clickhouse-connect)
    }
    ClickhousePool <|-- ClickhouseNativePool
    ClickhousePool <|-- ClickhouseConnectPool

    ClickhouseReader o-- ClickhousePool : client (abstract)

    class ConnectionCache {
        single place pools are instantiated
        selects driver from runtime config
        get_node_connection returns abstract ClickhousePool
    }
    class ClickhouseCluster {
        get_node_connection delegates to cache
        get_reader wraps pool in ClickhouseReader
    }
    class AdminAndCliHelpers {
        admin common.py
        cleanup / optimize / querylog CLIs
    }
    ClickhouseCluster --> ConnectionCache : holds / gets pool via
    AdminAndCliHelpers ..> ConnectionCache : gets pool via
    ConnectionCache ..> ClickhousePool : creates and caches
    ClickhouseCluster ..> ClickhouseReader : wraps pool in
Loading

ClickhouseCluster (and the admin/CLI helpers) only ever touch the abstract ClickhousePool and the single ClickhouseReader; the native-vs-connect choice lives entirely inside ConnectionCache.

Other changes:

  • snuba/settings/__init__.py — new USE_CLICKHOUSE_CONNECT_DRIVER default (off).
  • pyproject.toml / uv.lock — add the clickhouse-connect dependency (+ lz4, zstandard) and mypy overrides.
  • Pool acquisition funneled through ConnectionCache — the cluster, the admin common.py helpers, and the cleanup/optimize/querylog CLIs all get pools via ConnectionCache.get_node_connection, so the driver is runtime-selected behind the abstract type everywhere; only the cache constructs ClickhouseNativePool / ClickhouseConnectPool. The admin credential-leak AST regression guard now treats both direct pool construction and cache acquisition (get_node_connection) as "acquiring a pool", so _validate_node must still run first, and the admin connection caches are driver-aware so runtime switching reaches admin too.

Timeouts

Timeouts are driven by the per-profile ClickhouseClientSettings timeout, applied identically by both drivers (the connect pool honors it as-is — no blanket cap):

  • Read queries (QUERY profile): 30s — set in the shared profile, so it applies to both drivers.
  • Migrations (MIGRATE), OPTIMIZE, and other long-running operations keep their existing (default or longer) timeouts.
  • Profiles without a timeout fall back to a 300s default on the HTTP path.

Note: setting the read timeout in the QUERY profile also gives the native read path a 30s socket timeout (previously unset) — intended, but a behavior change for native worth calling out.

Exception handling

Every clickhouse-connect error inherits from ClickHouseError (the analog of clickhouse_driver's errors.Error); the connect pool catches that base — DatabaseError, ProgrammingError, DataError, etc. — and re-raises a snuba ClickhouseError preserving the server code, exactly as the native pool wraps the errors.Error family. Connection/transport failures (OperationalError) additionally emit the connection_error metric.

Connection pool sizing

The HTTP (urllib3) pool size is always taken from the clickhouse_connect_pool_size runtime config, falling back to CLICKHOUSE_MAX_POOL_SIZE (same default as native). It is not a per-caller construction parameter: there is a single shared pool per connection (instantiated by ConnectionCache), sized from the config and read once when the cached client is first created — so the size can be tuned at runtime.

Retry behavior

The connect pool does not reimplement the native pool's retry logic — all retries are delegated to clickhouse-connect (stale keep-alive sockets, transport errors, HTTP 429/503/504). Deliberately dropped: the native TOO_MANY_SIMULTANEOUS_QUERIES (code 202) backoff — over HTTP ClickHouse returns it as a 500 + X-ClickHouse-Exception-Code: 202 header, which clickhouse-connect doesn't retry, so it's surfaced directly as a ClickhouseError preserving the code. execute_robust is therefore a thin alias for execute; retryable is kept only for interface parity.

How to roll out

Controlled by the use_clickhouse_connect_driver runtime config (default USE_CLICKHOUSE_CONNECT_DRIVER, off). get_node_connection() reads it per call (≤10s config memoization), so flipping it takes effect without a restart and applies to every connection.

Runtime configs added

  • use_clickhouse_connect_driver — route connections through clickhouse-connect (HTTP). Default off.
  • clickhouse_connect_pool_size — the HTTP connection pool size. Default CLICKHOUSE_MAX_POOL_SIZE.

Dependency / lockfile

clickhouse-connect (1.3.0, plus lz4 and zstandard) is now in the internal PyPI mirror (getsentry/pypi#2273), and uv.lock is regenerated to include it — so the dependency-install CI jobs can resolve it.

Known follow-ups (from automated review)

  • WITH TOTALS over HTTP is covered by a unit test (test_with_totals_handled_over_http): clickhouse-connect's Native-format reader concatenates the trailing totals block into result_set, so the trailing row is the totals row and the driver-agnostic reader's "totals = last row" assumption holds on the HTTP path too. Worth a final confirmation against a live server before relying on it. (If HTTP totals ever turn out to differ, that's the point at which a connect-specific reader subclass would be reintroduced.)
  • Tracing / trace_output: the admin API is driver-agnostic for query execution, but the admin trace view and its profile-events parsing depend on trace_output (server send_logs_level logs). clickhouse-connect doesn't surface those (it only reads the X-ClickHouse-Summary header for the profile), so trace_output is empty on the HTTP path and those two admin features return nothing when the connect driver is enabled. Accepted for now (documented in connect.py); reconstructing traces over HTTP would mean querying system.text_log by query_id — a separate feature.
  • No fail-fast pool-get timeout equivalent (CLICKHOUSE_POOL_GET_TIMEOUT_SECONDS) on the HTTP path.

Testing

  • New tests/clickhouse/test_connect.py (incl. result/profile mapping, error wrapping, timeout passthrough, pool sizing, the totals-over-HTTP test, and the type-name→reader-transform test), a driver-selection test in tests/clusters/test_cluster.py, the admin credential-leak AST guard, and existing native/reader tests pass.
  • mypy (strict) and ruff pass on the changed files (and the CI pre-commit check).
  • A live-ClickHouse-over-HTTP end-to-end test was not possible in the dev sandbox; the clickhouse-connect API usage was validated against the installed library.

🤖 Generated with Claude Code

… config

Introduce an HTTP-based ClickHouse client backed by `clickhouse-connect`
as a drop-in alternative to the native `clickhouse-driver` protocol.

- New `ClickhouseConnectPool` (snuba/clickhouse/connect.py) exposes the
  same `execute`/`execute_robust`/`close` interface as `ClickhousePool`
  and returns the same `ClickhouseResult`, so callers and the
  `NativeDriverReader` need no changes.
- It relies on clickhouse-connect's built-in urllib3 connection pool
  (sized to `max_pool_size`) instead of maintaining our own LifoQueue,
  with sessions disabled so the shared client can serve concurrent
  queries.
- `ClickhousePool` transparently dispatches each query to the HTTP pool
  when the `use_clickhouse_connect_driver` runtime config is enabled and
  an HTTP port is known. The default comes from the new
  `USE_CLICKHOUSE_CONNECT_DRIVER` setting (off by default), so the
  migration can be rolled out and rolled back without a deploy.
- The cluster's HTTP port is threaded through `ConnectionCache` so the
  main query/read path can select the driver at runtime.
- Error and concurrency-limit retry semantics mirror the native pool.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
@phacops phacops requested a review from a team as a code owner June 18, 2026 09:04
Comment thread snuba/clickhouse/native.py Outdated
Comment thread snuba/clickhouse/connect.py Outdated
Comment thread snuba/clickhouse/native.py Outdated
claude added 2 commits June 18, 2026 09:14
…nect

Drop the hand-rolled retry logic from ClickhouseConnectPool and rely
solely on clickhouse-connect's built-in retry behavior (stale keep-alive
sockets, transport errors, and HTTP 429/503/504).

This also abandons the native pool's TOO_MANY_SIMULTANEOUS_QUERIES
backoff for the HTTP path: clickhouse-connect does not retry that error
(ClickHouse returns it as an HTTP 500 with an exception-code header, not
a 429/503/504), so it is now surfaced directly to the caller as a
ClickhouseError that preserves the code. execute_robust becomes a thin
alias for execute, and the `retryable` argument is kept only for
interface parity with ClickhousePool.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
- Wrap the full clickhouse-connect error family in the connect pool: catch
  the base ClickHouseError (DatabaseError, ProgrammingError, DataError, ...)
  in addition to OperationalError, and surface everything as a snuba
  ClickhouseError preserving the server code, mirroring how ClickhousePool
  wraps the clickhouse_driver errors.Error family. OperationalError still
  emits the connection_error metric like the native NetworkError path.
- Cap the connect/send-receive timeouts at 30s. Settings profiles such as
  MIGRATE carry very large timeouts that are inappropriate for the HTTP
  query path, so they are clamped down (and an unset timeout defaults to 30s).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Comment thread snuba/clickhouse/connect.py
Comment thread snuba/clickhouse/connect.py
The clickhouse-connect HTTP pool defaults to CLICKHOUSE_MAX_POOL_SIZE, the
same setting used by the native pool. Add a `clickhouse_connect_pool_size`
runtime config that overrides the urllib3 pool's maxsize when set. The
value is read once when the (cached) client is first created.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Comment thread snuba/clickhouse/native.py Outdated
results=results,
profile=profile_data,
trace_output="",
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug trace logs always empty

Medium Severity

On the HTTP path, ClickhouseResult.trace_output is always set to an empty string even when per-query settings include send_logs_level trace. The native pool captures driver trace logs in that case and returns them in trace_output, which RPC debug metadata relies on.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2cfaab9. Configure here.

Comment thread snuba/clickhouse/connect.py Outdated
Keep snuba/clickhouse/native.py purely the native driver — it no longer
knows about clickhouse-connect. The choice of which driver to use is made
one level higher, in the connection cache.

- native.py is reverted to its original state (no http_port, no dispatch).
- ClickhouseConnectPool now subclasses ClickhousePool and overrides
  execute/execute_robust/close, so it is a true drop-in with no type
  changes rippling through the callers (the dependency direction is
  connect -> native, which is correct).
- ConnectionCache.get_node_connection reads the use_clickhouse_connect_driver
  runtime config (via the new cluster.use_clickhouse_connect_driver helper)
  and builds either a native or an HTTP pool, keyed by driver in the cache.
- get_reader/get_deleter no longer cache their reader so the runtime toggle
  takes effect on the read path without a restart; the connections
  themselves remain cached by the connection cache.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Comment thread snuba/clusters/cluster.py
Comment thread snuba/clickhouse/connect.py
Replace the per-call, cache-key-driven driver choice with a single
DriverSelectingPool that owns both the native and the clickhouse-connect
pool and routes each query to one of them based on the
use_clickhouse_connect_driver runtime config.

- Both underlying pools are now created once per connection target and
  cached together inside the selecting pool. Flipping the runtime config
  no longer creates a pool on the fly; it only changes which already-built
  pool a query is routed to.
- DriverSelectingPool subclasses ClickhousePool (overriding
  execute/execute_robust/close), so the connection cache, the reader and
  all direct callers keep their existing ClickhousePool typing. The reader
  needs no driver knowledge: it just wraps the selecting connection.
- get_reader/get_deleter cache their reader again, since dynamic driver
  switching now lives in the pool.
- The use_clickhouse_connect_driver helper moves to snuba.clickhouse.connect
  alongside the pools that use it.

native.py remains byte-for-byte identical to master.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Comment thread snuba/clusters/cluster.py Outdated
The connection cache no longer splits between returning a bare
ClickhousePool and a DriverSelectingPool. It always builds a
DriverSelectingPool and lets that single object decide where each query
goes.

The selecting pool's connect pool is now optional: when the HTTP port is
unknown the pool is built with connect_pool=None and every query is routed
to the native pool. This keeps all routing logic in one place instead of
splitting it between the cache and the pool.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
settings,
columnar,
capture_trace,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTTP pool lacks get timeout

Medium Severity

ClickhouseConnectPool.execute sends every query through a shared HTTP client with no equivalent of the native pool’s bounded queue and CLICKHOUSE_POOL_GET_TIMEOUT_SECONDS fail-fast. Under load or a slow ClickHouse, workers can block waiting for urllib3 pool slots instead of failing after the configured pool-get timeout like ClickhousePool does.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1e8affa. Configure here.

# native pool likewise wraps the whole clickhouse_driver errors.Error
# family into ClickhouseError, preserving the server error code when
# there is one.
raise ClickhouseError(str(e), code=getattr(e, "code", None) or -1) from e

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero error code becomes negative

Low Severity

HTTP error wrapping uses getattr(e, "code", None) or -1, so a legitimate ClickHouse error code of 0 is stored as -1 instead of 0, unlike the native pool which passes through e.code directly.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1e8affa. Configure here.

The clickhouse-connect driver no longer takes a configurable HTTP port; it
always connects on the default ClickHouse HTTP port (8123). This removes the
http_port plumbing from ConnectionCache.get_node_connection and the cluster,
and lets the connect pool always be built, so DriverSelectingPool's connect
pool is no longer optional.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Comment thread snuba/clusters/cluster.py Outdated
claude added 2 commits June 18, 2026 10:15
Replace the DriverSelectingPool with a dedicated HTTPDriverReader that sits
alongside NativeDriverReader, and move the driver choice up to
ClickhouseCluster.get_reader.

- HTTPDriverReader (connect.py) subclasses NativeDriverReader and wraps a
  ClickhouseConnectPool. Result handling is identical (both drivers return a
  ClickhouseResult); it exists as its own type so the driver is explicit.
- get_reader picks NativeDriverReader or HTTPDriverReader based on the
  use_clickhouse_connect_driver runtime config; both readers are built at
  most once and cached, so flipping the config only changes which is
  returned.
- The HTTP driver is now scoped to the read path only. Direct callers
  (migrations, replacer, optimize, ...) keep using the native pool returned
  by get_node_connection.
- ConnectionCache caches native and HTTP pools side by side (driver in the
  cache key) via get_node_connection / get_http_node_connection.
- use_clickhouse_connect_driver moves to cluster.py so the native path never
  imports clickhouse-connect just to read the flag.

native.py remains byte-for-byte identical to master.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Replace `== True` with `is True` for the is_single_node assertions flagged
by ruff (E712).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Comment thread snuba/clickhouse/connect.py
…just reads

Move the driver choice into ClickhouseCluster.get_node_connection so every
caller picks up the runtime config, not only the read path. When
use_clickhouse_connect_driver is enabled, get_node_connection (and therefore
get_query_connection, migrations, replacer, optimize, ...) returns the
clickhouse-connect HTTP pool; otherwise the native pool.

get_reader/get_deleter now wrap whatever pool get_node_connection returns in
the matching reader type (HTTPDriverReader or NativeDriverReader) via a small
__build_reader helper. Readers are no longer cached — they are cheap to
build and the config is read on each call, so the driver switches at runtime.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Comment thread snuba/clusters/cluster.py Outdated
Stop capping the clickhouse-connect timeout at 30s for every connection.
The connect pool now honors the per-profile send_receive_timeout exactly
like the native driver (falling back to a 300s default when a profile sets
none).

Express the read-query timeout policy in the shared QUERY client settings
profile: it now uses a 30s timeout, which applies to both the native and the
HTTP driver. Migrations (MIGRATE), OPTIMIZE and other long-running
operations keep their existing default/longer timeouts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Comment thread snuba/clusters/cluster.py Outdated
user=user,
password=password,
database=database,
client_settings=client_settings_dict,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None timeout breaks pool default

Medium Severity

ConnectionCache now passes send_receive_timeout=timeout even when the profile timeout is None. That overrides ClickhousePool's intentional 35s default with None (native driver default), changing socket timeouts for profiles like CLEANUP, INSERT, and REPLACE on every deploy—not only when the HTTP driver is enabled.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a87be31. Configure here.

Comment thread snuba/clusters/cluster.py
Introduce a proper class hierarchy so the native and HTTP drivers are
siblings rather than one extending the other:

- ClickhousePool is now an abstract base (ABC) declaring the
  execute/execute_robust/close interface. The native implementation moves to
  a new ClickhouseNativePool(ClickhousePool); ClickhouseConnectPool also
  derives from the abstract base. The name ClickhousePool is kept as the
  abstract type so the ~50 type annotations referencing it are unchanged;
  only construction sites switch to ClickhouseNativePool.
- Readers: a shared ClickhouseReader(Reader) holds the result-handling logic,
  with NativeDriverReader and HTTPDriverReader as thin siblings (HTTPDriverReader
  no longer subclasses NativeDriverReader).

Updated all ClickhouseNativePool construction sites (cluster cache, admin
common, cleanup/optimize/querylog CLIs, tests/fakes) and the admin
credential-leak AST guard + mocks to target ClickhouseNativePool.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
@phacops phacops requested a review from a team as a code owner June 18, 2026 11:15
Comment thread snuba/clusters/cluster.py
get_reader and get_deleter previously read use_clickhouse_connect_driver
twice (once when resolving the pool, once when picking the reader type), so a
config flip between the reads could pair an HTTPDriverReader with a native
pool (or vice versa). Resolve the flag once and thread it to both
get_node_connection and the reader builder so they always agree.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
query_result = client.query(
query,
parameters=params if params else None,
settings=query_settings,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTTP params placeholder mismatch

High Severity

The HTTP pool forwards parameters to clickhouse-connect while Snuba SQL still uses %(name)s placeholders written for clickhouse-driver. clickhouse-connect expects {name:Type} in the query text, so parameterized statements can fail or run with wrong bindings once use_clickhouse_connect_driver is enabled (e.g. system.clusters discovery on multi-node clusters).

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit fac0556. Configure here.

Comment thread snuba/clusters/cluster.py Outdated
… stray dump.rdb

- FakeClickhouseCluster.get_node_connection now accepts the use_connect
  parameter added to ClickhouseCluster.get_node_connection, fixing the mypy
  override-incompatibility caught by the pre-commit hook.
- Remove dump.rdb (a Redis dump accidentally committed) from the repo and
  gitignore it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Comment thread tests/clusters/fake_cluster.py
…s abstract

Add ClickhousePool.get_reader() so each pool returns its matching reader:
ClickhouseNativePool -> NativeDriverReader, ClickhouseConnectPool ->
HTTPDriverReader. ClickhouseCluster.get_reader/get_deleter now just call
pool.get_reader(), so the cluster only ever touches the abstract ClickhousePool
and Reader -- it no longer imports or constructs the concrete reader classes.

This also removes the cluster's __build_reader and the use_connect threading:
because the reader is built from the pool, the reader and pool can never
disagree about the driver, so the double config read (and its race) is gone.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
@phacops phacops changed the title feat(clickhouse): Add clickhouse-connect (HTTP) driver behind a runtime config feat(clickhouse): Add clickhouse-connect (HTTP) driver behind a runtime config [EAP-497] Jun 18, 2026
@linear-code

linear-code Bot commented Jun 18, 2026

Copy link
Copy Markdown

EAP-497

…te uv.lock

The two reader subclasses were identical -- ClickhouseReader is driver-agnostic
(it only uses the abstract ClickhousePool and transforms ClickhouseResult). So
drop NativeDriverReader/HTTPDriverReader and ClickhousePool.get_reader(): the
cluster wraps whichever pool get_node_connection selects in the single
ClickhouseReader. The native-vs-HTTP choice stays at the pool level
(ConnectionCache), and ClickhouseCluster touches only the abstract ClickhousePool
plus the one reader.

Also regenerate uv.lock now that clickhouse-connect (+ lz4, zstandard) is
available in the internal PyPI mirror (getsentry/pypi#2273), which unblocks the
dependency-install CI jobs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Comment thread snuba/clusters/cluster.py
Comment thread snuba/clickhouse/connect.py
claude added 2 commits June 18, 2026 12:15
The HTTP (clickhouse-connect) pool connects on http_port rather than the
native node.port, but http_port was not part of the ConnectionCache key.
Two clusters sharing the same node, credentials and settings but using
different HTTP ports could collide on the same cached pool and silently
reuse the wrong port. Add http_port to the cache key (and CacheKey type).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
…onCache

The admin _build_validated_pool helper and the cleanup/optimize/querylog
CLIs constructed a ClickhouseNativePool directly, hardcoding the native
driver and bypassing the runtime native-vs-connect selection that the
cluster's connections go through.

Funnel all of them through ConnectionCache.get_node_connection so the
driver is chosen by the use_clickhouse_connect_driver runtime config and
every caller only ever handles the abstract ClickhousePool type. The cache
now accepts an optional max_pool_size (added to the cache key) so admin can
keep its pool of 2; the OPTIMIZE/CLEANUP timeouts ride along via the client
settings profile the cache already reads.

The admin EAP-488 guard now treats both direct pool construction and cache
acquisition (get_node_connection) as "acquiring a pool", so _validate_node
must still run first; its tests are updated accordingly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Comment thread snuba/admin/clickhouse/common.py Outdated
Comment thread snuba/admin/clickhouse/common.py
Comment thread snuba/cli/querylog_to_csv.py
…n cache

Pool sizing is no longer a per-caller concern: the connect pool always takes
its size from the clickhouse_connect_pool_size runtime config (falling back
to CLICKHOUSE_MAX_POOL_SIZE), so it is tunable at runtime and there is just
one shared pool per connection, instantiated by ConnectionCache. Drop the
max_pool_size plumbing added to ConnectionCache and the admin pool-of-2
override, and remove the now-unused max_pool_size constructor arg from
ClickhouseConnectPool.

Make the admin connection caches (NODE_CONNECTIONS / CLUSTER_CONNECTIONS)
driver-aware: now that admin pools are driver-selected, the cache keys must
include the driver token, otherwise flipping use_clickhouse_connect_driver
would keep returning a pool pinned to the previously cached driver until
restart. Fixes the regression flagged on the admin helpers.

querylog_to_csv reads CLICKHOUSE_HTTP_PORT (the same env the cluster config
uses) instead of hardcoding the default HTTP port.

Add a focused test asserting WITH TOTALS works over the HTTP driver: the
connect pool surfaces the totals block as the trailing row and the
driver-agnostic ClickhouseReader splits it out as totals.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Comment thread snuba/clickhouse/connect.py Outdated
… close

ConnectionCache.get_node_connection no longer takes a use_connect parameter:
it reads the use_clickhouse_connect_driver runtime flag itself, so the driver
decision lives in exactly one place. Callers (cluster, admin helper, the
cleanup/optimize/querylog CLIs) drop the use_connect=use_clickhouse_connect_driver()
boilerplate, and the now-dead use_connect override param is removed from the
test FakeClickhouseCluster.

Also take the pool's lock in ClickhouseConnectPool.close() so teardown can't
race with the lazy client init in _get_client().

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Comment thread snuba/clickhouse/connect.py
…nsforms

clickhouse-connect's column_type.name returns the canonical ClickHouse type
strings (identical to the native driver), so the reader's Date/DateTime/UUID
regex transforms match identically on the HTTP path. Add an end-to-end test
through the real connect pool + ClickhouseReader using clickhouse-connect's
actual type names (including parametrized DateTime('UTC') and Nullable(UUID))
asserting the values are transformed, so a future format change would be
caught instead of silently skipping transformations.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Comment thread snuba/cli/optimize.py Outdated
claude and others added 5 commits June 18, 2026 13:21
…port

Rename ClickhouseNode.port to native_port and add an optional http_port, so a
node is self-describing for either driver. The connection cache no longer
takes a separate http_port argument: it reads node.native_port for the native
pool and node.http_port for the connect pool, and the node (which now includes
http_port in its identity) subsumes the previous standalone http_port cache-key
element.

Cluster-produced nodes (the query node and those discovered via system.clusters,
which only reports the native port) are stamped with the cluster's http_port.
The admin/CLI by-host helpers pass http_port when constructing their node.
Nodes built outside a cluster (tests, the replacer load balancer) leave
http_port as None since they never open HTTP connections.

Field accesses of node.port are updated to node.native_port across the cluster,
admin, and migrations code; positional ClickhouseNode(...) constructions are
unaffected by the rename.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
…river-migration-iarw98

# Conflicts:
#	uv.lock
Pre-commit mypy (no-implicit-reexport, full test coverage) flagged two issues
the source-only runs missed:
- test_connect.py: cast the FakeFormattedQuery stubs to FormattedQuery when
  calling ClickhouseReader.execute.
- test_system_queries.py: import the connection_cache singleton from
  snuba.clusters.cluster instead of reaching through snuba.admin.clickhouse.common
  (which doesn't re-export it); patching the method on the shared object still
  covers the reference bound inside common.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
The admin API is driver-agnostic for query execution, but the trace view and
its profile-events parsing depend on trace_output, which clickhouse-connect
cannot capture (it only reads the X-ClickHouse-Summary header). Document this
accepted limitation where trace logs are requested and where trace_output is
emitted, noting that the system.text_log-by-query_id reconstruction would be a
separate feature.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 6 total unresolved issues (including 5 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9a71608. Configure here.

) -> ClickhousePool:
storage = _get_storage(storage_name)

key = f"{storage.get_storage_key()}-{clickhouse_host}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admin pool size limit removed

Medium Severity

Admin ClickHouse connections no longer cap the native pool at two connections. _build_validated_pool now uses the shared connection_cache, which builds pools with the default CLICKHOUSE_MAX_POOL_SIZE (25) or the HTTP runtime pool size, so targeted admin nodes can see far more simultaneous connections than before.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9a71608. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, not an oversight. Earlier in the PR the admin helper built its own ClickhouseNativePool(..., max_pool_size=2); we deliberately removed that per-caller cap so admin acquires the one shared, runtime-config-sized pool from ConnectionCache like every other caller (the connect pool always sizes from clickhouse_connect_pool_size, the native pool from CLICKHOUSE_MAX_POOL_SIZE). Keeping a hardcoded 2 would mean reintroducing the per-caller max_pool_size plumbing through the cache that we just took out, and would give admin a different-sized pool than everything else hitting the same node.

Admin is low-traffic (human operators running system queries / tracing), so sharing the standard sizing is acceptable, and pool size is now runtime-tunable via clickhouse_connect_pool_size if it ever needs limiting. Leaving as-is.


Generated by Claude Code

Comment thread snuba/clusters/cluster.py Outdated
The QUERY profile timeout went None->30 to give user-facing reads a 30s cap,
but the QUERY profile is also borrowed by non-read callers that can legitimately
run longer: cluster topology discovery (system.clusters), storage-routing load
lookups, delete-throttling system-table checks, the span-export manual job, and
admin table copies. Capping those at 30s could break them on large/loaded
clusters.

Add an unbounded INTERNAL client-settings profile and move those callers onto
it, restoring their pre-PR (unbounded) behavior, while the user read path
(get_reader -> QUERY) keeps the 30s cap. Add a test asserting QUERY is 30s and
INTERNAL is unbounded.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011MC5N7AYMhJDBkJ4P7engC
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants