Skip to content

[BUG] Multiplexed LISTEN connection leaks pool slots on reconnect under server-side idle connection termination #3694

@rohan-tungsten

Description

@rohan-tungsten

Describe the issue

pkg/repository/multiplexer.go's Connect callback hands the raw *pgx.Conn from an acquired pool slot to pgxlisten.Listener without retaining the *pgxpool.Conn wrapper:

https://github.com/hatchet-dev/hatchet/blob/v0.83.41/pkg/repository/multiplexer.go#L58-L66

Connect: func(ctx context.Context) (*pgx.Conn, error) {
    poolConn, err := m.pool.Acquire(ctx)
    if err != nil {
        return nil, err
    }
    return poolConn.Conn(), nil          // *pgxpool.Conn wrapper falls out of scope
},

The wrapper goes out of scope with no Release() call, so pgxpool's internal bookkeeping permanently counts the slot as "acquired." When the server-side terminates the listener's conn (idle_session_timeout, pgbouncer server_idle_timeout, an L7 proxy idle kill), pgxlisten's listen() returns with an error, defer conn.Close(ctx) closes the raw conn — but the pool wrapper was already orphaned.

On reconnect, pgxlisten calls Connect again and acquires a fresh slot. The previous slot is still marked as acquired from pgx's POV, but its underlying TCP is dead. Every reconnect cycle leaks one pool slot.

This is a separate bug from #2771 (fixed by #2772). #2772 ensured reconnect happens at all. This issue concerns the slot-leak consequence of how reconnect is wired — #2772 didn't address it.

Environment

  • Engine image: ghcr.io/hatchet-dev/hatchet/hatchet-lite:v0.79.16. Reviewed main / v0.83.41 — multiplexer code is unchanged, same bug present.
  • PostgreSQL: 15.10 (AWS RDS), with idle_session_timeout = 3600000 (1 h) set on the parameter group.
  • Database config: all Hatchet defaults — DATABASE_MAX_CONNS=50, DATABASE_MIN_CONNS=1, DATABASE_MAX_QUEUE_CONNS=50, DATABASE_MIN_QUEUE_CONNS=10, DATABASE_MAX_CONN_LIFETIME=15m, DATABASE_MAX_CONN_IDLE_TIME=1m.
  • Msgqueue kind: postgres (pg_notify via multiplexed listener).

Expected behavior

A long-running Hatchet engine should be stable under server-side connection termination policies. Each listener-conn death should cleanly return its pool slot and reconnect without any persistent resource accumulation.

Actual behavior

Hourly on the dot, the listener logs:

WRN error in listener
  error="waiting for notification: FATAL: terminating connection due to idle-session timeout (SQLSTATE 57P05)"
  service=database

pgxlisten reconnects cleanly (no error cascade visible — #2772's fix is working as intended). But pgxpool accumulates:

  • DatabaseConnections on the Postgres server drains by ~1 per hour. In our case it went from ~30 at engine startup to ~12 over 60 hours — a clean, monotonic decline with no recovery.
  • pgxpool.Stat().AcquiredConns() grows monotonically as wrapper-orphans accumulate.
  • After ~60 cycles, the pool's "acquired" counter saturates at MaxConns=50 while only ~12 actual TCP conns exist server-side. New pool.Acquire() calls block against the cap.

Once the pool is effectively wedged, downstream failure is severe:

  1. Queue/scheduler/controller subsystems' periodic queries time out with context deadline exceeded.
  2. Partition-heartbeat UPDATEs into SchedulerPartition / ControllerPartition / TenantWorkerPartition stop landing.
  3. The periodic rebalanceInactive*Partitions job (internal/services/partition/partition.go) runs and — being a simpler single-row DELETE — sometimes still wins a live slot. It successfully deletes the engine's own stale partition rows.
  4. ON DELETE SET NULL on Tenant → Partition FKs cascades: Tenant.schedulerPartitionId, Tenant.controllerPartitionId, Tenant.workerPartitionId all go NULL.
  5. Worker RegisterWorker RPCs can no longer bind to a partition; worker heartbeat RPC handlers can't complete within the SDK's 5 s deadline.
  6. Workers' /health endpoint returns 503 (Python SDK action_listener_process._compute_healthlast_heartbeat_succeeded == False), container health check fails, ECS (or equivalent) restarts the worker, cycle repeats every ~3 min.
  7. Restarting the worker doesn't help — the engine's control plane is dead. Only an engine restart (rebuilds pool from zero, re-runs partition bootstrap at startup) recovers.

Reproducer

  1. On a Postgres 14+ instance, set idle_session_timeout = 300000 (5 min) via parameter group or ALTER SYSTEM.
  2. Start hatchet-lite (any recent version, including v0.83.41) with SERVER_MSGQUEUE_KIND=postgres.
  3. Every 5 min you'll see WRN error in listener … FATAL: terminating connection due to idle-session timeout in engine logs. Reconnect happens after ReconnectDelay=10s.
  4. Query SELECT count(*) FROM pg_stat_activity WHERE application_name LIKE '%hatchet%' — count drifts downward monotonically, ~1 per listener cycle.
  5. After enough cycles (depends on pool max + real demand), engine subsystems start logging context deadline exceeded on all operations.
  6. Eventually partition tables get emptied by the engine's own rebalance cron and the engine becomes functionally dead until restart.

Bonus failure amplifier: DATABASE_MIN_QUEUE_CONNS=10 (default) ensures 10 queue-pool conns are never recycled by the client-side idle timer, so they pin up against the 1 h Postgres kill and all die in lockstep once per hour, accelerating the leak rate.

Root cause

pgx v5's pgxpool uses explicit Acquire/Release semantics with no wrapper finalizer. The contract is "whoever acquires must release" — pgx has no background reaper for conns held indefinitely. In the multiplexer's Connect:

poolConn, err := m.pool.Acquire(ctx)
// ...
return poolConn.Conn(), nil

poolConn is a local variable. When the function returns, the *pgxpool.Conn wrapper is unreachable. pgxlisten holds only the raw *pgx.Conn. There is no path back to poolConn.Release().

When pgxlisten.Listener.listen() returns on error, its defer conn.Close(ctx) closes the raw conn — but this sends Terminate on the underlying protocol, it does not call pgxpool.Conn.Release(). pgxpool's bookkeeping never decrements acquired, and the slot is never recycled into the idle queue or destroyed.

Neither MaxConnLifetime, MaxConnIdleTime, nor HealthCheckPeriod can recover this slot. All three operate on pgxpool's idle list via puddle.AcquireAllIdle() — the dead slot is in the acquired state and is invisible to these mechanisms.

Suggested fix

Two independent options, either works:

Option A — retain and release the wrapper

Keep a reference to the *pgxpool.Conn alive for the duration of the raw conn, and release it when the raw conn is closed. pgxlisten doesn't expose a release hook, so a minimal workaround is to track the previous pool connection in the Connect closure and release it on the next acquisition:

var currentPoolConn *pgxpool.Conn
var poolConnMu sync.Mutex

listener := &pgxlisten.Listener{
    Connect: func(ctx context.Context) (*pgx.Conn, error) {
        poolConnMu.Lock()
        defer poolConnMu.Unlock()

        // Previous conn is dead (that's why we're here); release its slot.
        if currentPoolConn != nil {
            currentPoolConn.Release()
        }

        poolConn, err := m.pool.Acquire(ctx)
        if err != nil {
            currentPoolConn = nil
            return nil, err
        }
        currentPoolConn = poolConn
        return poolConn.Conn(), nil
    },
    // ...
}

pgxpool's Release() on a conn that's been externally closed is safe: Release() inspects conn state (IsClosed() || IsBusy() || TxStatus() != 'I') and, if any is true, calls res.Destroy() which closes and removes the slot cleanly.

Option B — use a dedicated conn outside the pool

Replace pool.Acquire(ctx) in the Connect callback with pgx.Connect(ctx, connString):

Connect: func(ctx context.Context) (*pgx.Conn, error) {
    return pgx.Connect(ctx, m.connString)
},

The listener's conn is then independent of pgxpool. Its lifecycle is controlled by pgxlisten alone — defer conn.Close(ctx) closes the raw conn, the next Connect opens a fresh one. No pool accounting involved.

Trade-off: the listener conn no longer counts against DATABASE_MAX_CONNS. Operators need to know this when sizing Postgres max_connections for Hatchet. The addition is minimal (one conn per engine instance).

Option A is less invasive. Option B is conceptually cleaner. Either closes the leak.

Complementary recommendation — listener keepalive

Independent of the leak fix, the listener will still die every hour under idle_session_timeout because Postgres considers any session with no client-issued queries to be idle, regardless of incoming NotificationResponses. A periodic SELECT 1 from the listener goroutine (e.g. every 10 min) would reset the server-side idle timer and eliminate the death loop entirely. This is a common pattern for LISTEN/NOTIFY consumers.

pgxlisten doesn't offer a keepalive hook, so this lives in Hatchet. Option B composes well with this — once the listener owns its conn outright, adding a keepalive is straightforward.

Evidence from a production-adjacent reproduction

Our dev environment matched the reproducer exactly. We added idle_session_timeout = 3600000 via infrastructure-as-code, the engine ran fine for ~60 hours (workload was light enough that the leak took time to bite), then pool saturation → partition wipe → worker flap → engine wedge. Forensics:

  • Listener kills once per hour, on the hour (UTC), for 60+ hours straight before the visible failure.
  • RDS DatabaseConnections metric: 30 (at engine startup) → 12 (by failure).
  • pgxpool acquired_connections=50 when the pool logged saturation, while RDS simultaneously reported 7 real conns — the 43-count gap is the zombie population.
  • Partition rows rebootstrap cleanly on engine restart. ON DELETE SET NULL cascade on Tenant FKs confirmed as the mechanism via DB inspection.

Happy to share sanitized logs, the full pg_stat_user_tables snapshot of the partition tables during the wedge, and the RDS connection-count decline graph if that helps triage.

Related

I'm happy to open a PR for either Option A or Option B if the project prefers one direction. Let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions