[BUG] Multiplexed LISTEN connection leaks pool slots on reconnect under server-side idle connection termination

## Describe the issue

`pkg/repository/multiplexer.go`'s `Connect` callback hands the raw `*pgx.Conn` from an acquired pool slot to `pgxlisten.Listener` without retaining the `*pgxpool.Conn` wrapper:

https://github.com/hatchet-dev/hatchet/blob/v0.83.41/pkg/repository/multiplexer.go#L58-L66

```go
Connect: func(ctx context.Context) (*pgx.Conn, error) {
    poolConn, err := m.pool.Acquire(ctx)
    if err != nil {
        return nil, err
    }
    return poolConn.Conn(), nil          // *pgxpool.Conn wrapper falls out of scope
},
```

The wrapper goes out of scope with no `Release()` call, so pgxpool's internal bookkeeping permanently counts the slot as "acquired." When the server-side terminates the listener's conn (`idle_session_timeout`, pgbouncer `server_idle_timeout`, an L7 proxy idle kill), pgxlisten's `listen()` returns with an error, `defer conn.Close(ctx)` closes the raw conn — but the pool wrapper was already orphaned.

On reconnect, pgxlisten calls `Connect` again and acquires a *fresh* slot. The previous slot is still marked as acquired from pgx's POV, but its underlying TCP is dead. **Every reconnect cycle leaks one pool slot.**

This is a separate bug from #2771 (fixed by #2772). #2772 ensured reconnect happens at all. This issue concerns the slot-leak *consequence* of how reconnect is wired — #2772 didn't address it.

## Environment

- **Engine image:** `ghcr.io/hatchet-dev/hatchet/hatchet-lite:v0.79.16`. Reviewed `main` / `v0.83.41` — multiplexer code is unchanged, same bug present.
- **PostgreSQL:** 15.10 (AWS RDS), with `idle_session_timeout = 3600000` (1 h) set on the parameter group.
- **Database config:** all Hatchet defaults — `DATABASE_MAX_CONNS=50`, `DATABASE_MIN_CONNS=1`, `DATABASE_MAX_QUEUE_CONNS=50`, `DATABASE_MIN_QUEUE_CONNS=10`, `DATABASE_MAX_CONN_LIFETIME=15m`, `DATABASE_MAX_CONN_IDLE_TIME=1m`.
- **Msgqueue kind:** `postgres` (pg_notify via multiplexed listener).

## Expected behavior

A long-running Hatchet engine should be stable under server-side connection termination policies. Each listener-conn death should cleanly return its pool slot and reconnect without any persistent resource accumulation.

## Actual behavior

**Hourly on the dot**, the listener logs:

```
WRN error in listener
  error="waiting for notification: FATAL: terminating connection due to idle-session timeout (SQLSTATE 57P05)"
  service=database
```

pgxlisten reconnects cleanly (no error cascade visible — #2772's fix is working as intended). But pgxpool accumulates:

- **`DatabaseConnections` on the Postgres server** drains by ~1 per hour. In our case it went from ~30 at engine startup to ~12 over 60 hours — a clean, monotonic decline with no recovery.
- **`pgxpool.Stat().AcquiredConns()`** grows monotonically as wrapper-orphans accumulate.
- After ~60 cycles, the pool's "acquired" counter saturates at `MaxConns=50` while only ~12 actual TCP conns exist server-side. New `pool.Acquire()` calls block against the cap.

Once the pool is effectively wedged, downstream failure is severe:

1. Queue/scheduler/controller subsystems' periodic queries time out with `context deadline exceeded`.
2. Partition-heartbeat `UPDATE`s into `SchedulerPartition` / `ControllerPartition` / `TenantWorkerPartition` stop landing.
3. The periodic `rebalanceInactive*Partitions` job (`internal/services/partition/partition.go`) runs and — being a simpler single-row `DELETE` — sometimes still wins a live slot. It successfully deletes the engine's own stale partition rows.
4. `ON DELETE SET NULL` on `Tenant → Partition` FKs cascades: `Tenant.schedulerPartitionId`, `Tenant.controllerPartitionId`, `Tenant.workerPartitionId` all go NULL.
5. Worker `RegisterWorker` RPCs can no longer bind to a partition; worker heartbeat RPC handlers can't complete within the SDK's 5 s deadline.
6. Workers' `/health` endpoint returns 503 (Python SDK `action_listener_process._compute_health` → `last_heartbeat_succeeded == False`), container health check fails, ECS (or equivalent) restarts the worker, cycle repeats every ~3 min.
7. Restarting the worker doesn't help — the engine's control plane is dead. Only an engine restart (rebuilds pool from zero, re-runs partition bootstrap at startup) recovers.

## Reproducer

1. On a Postgres 14+ instance, set `idle_session_timeout = 300000` (5 min) via parameter group or `ALTER SYSTEM`.
2. Start `hatchet-lite` (any recent version, including `v0.83.41`) with `SERVER_MSGQUEUE_KIND=postgres`.
3. Every 5 min you'll see `WRN error in listener … FATAL: terminating connection due to idle-session timeout` in engine logs. Reconnect happens after `ReconnectDelay=10s`.
4. Query `SELECT count(*) FROM pg_stat_activity WHERE application_name LIKE '%hatchet%'` — count drifts downward monotonically, ~1 per listener cycle.
5. After enough cycles (depends on pool max + real demand), engine subsystems start logging `context deadline exceeded` on all operations.
6. Eventually partition tables get emptied by the engine's own rebalance cron and the engine becomes functionally dead until restart.

Bonus failure amplifier: `DATABASE_MIN_QUEUE_CONNS=10` (default) ensures 10 queue-pool conns are never recycled by the client-side idle timer, so they pin up against the 1 h Postgres kill and all die in lockstep once per hour, accelerating the leak rate.

## Root cause

pgx v5's `pgxpool` uses explicit `Acquire`/`Release` semantics with no wrapper finalizer. The contract is "whoever acquires must release" — pgx has no background reaper for conns held indefinitely. In the multiplexer's `Connect`:

```go
poolConn, err := m.pool.Acquire(ctx)
// ...
return poolConn.Conn(), nil
```

`poolConn` is a local variable. When the function returns, the `*pgxpool.Conn` wrapper is unreachable. pgxlisten holds only the raw `*pgx.Conn`. There is no path back to `poolConn.Release()`.

When `pgxlisten.Listener.listen()` returns on error, its `defer conn.Close(ctx)` closes the raw conn — but this sends Terminate on the underlying protocol, it does not call `pgxpool.Conn.Release()`. pgxpool's bookkeeping never decrements `acquired`, and the slot is never recycled into the idle queue or destroyed.

Neither `MaxConnLifetime`, `MaxConnIdleTime`, nor `HealthCheckPeriod` can recover this slot. All three operate on pgxpool's *idle* list via `puddle.AcquireAllIdle()` — the dead slot is in the *acquired* state and is invisible to these mechanisms.

## Suggested fix

Two independent options, either works:

### Option A — retain and release the wrapper

Keep a reference to the `*pgxpool.Conn` alive for the duration of the raw conn, and release it when the raw conn is closed. pgxlisten doesn't expose a release hook, so a minimal workaround is to track the previous pool connection in the `Connect` closure and release it on the next acquisition:

```go
var currentPoolConn *pgxpool.Conn
var poolConnMu sync.Mutex

listener := &pgxlisten.Listener{
    Connect: func(ctx context.Context) (*pgx.Conn, error) {
        poolConnMu.Lock()
        defer poolConnMu.Unlock()

        // Previous conn is dead (that's why we're here); release its slot.
        if currentPoolConn != nil {
            currentPoolConn.Release()
        }

        poolConn, err := m.pool.Acquire(ctx)
        if err != nil {
            currentPoolConn = nil
            return nil, err
        }
        currentPoolConn = poolConn
        return poolConn.Conn(), nil
    },
    // ...
}
```

pgxpool's `Release()` on a conn that's been externally closed is safe: `Release()` inspects conn state (`IsClosed() || IsBusy() || TxStatus() != 'I'`) and, if any is true, calls `res.Destroy()` which closes and removes the slot cleanly.

### Option B — use a dedicated conn outside the pool

Replace `pool.Acquire(ctx)` in the Connect callback with `pgx.Connect(ctx, connString)`:

```go
Connect: func(ctx context.Context) (*pgx.Conn, error) {
    return pgx.Connect(ctx, m.connString)
},
```

The listener's conn is then independent of pgxpool. Its lifecycle is controlled by pgxlisten alone — `defer conn.Close(ctx)` closes the raw conn, the next `Connect` opens a fresh one. No pool accounting involved.

Trade-off: the listener conn no longer counts against `DATABASE_MAX_CONNS`. Operators need to know this when sizing Postgres `max_connections` for Hatchet. The addition is minimal (one conn per engine instance).

Option A is less invasive. Option B is conceptually cleaner. Either closes the leak.

### Complementary recommendation — listener keepalive

Independent of the leak fix, the listener will still die every hour under `idle_session_timeout` because Postgres considers any session with no *client-issued queries* to be idle, regardless of incoming `NotificationResponse`s. A periodic `SELECT 1` from the listener goroutine (e.g. every 10 min) would reset the server-side idle timer and eliminate the death loop entirely. This is a common pattern for LISTEN/NOTIFY consumers.

pgxlisten doesn't offer a keepalive hook, so this lives in Hatchet. Option B composes well with this — once the listener owns its conn outright, adding a keepalive is straightforward.

## Evidence from a production-adjacent reproduction

Our dev environment matched the reproducer exactly. We added `idle_session_timeout = 3600000` via infrastructure-as-code, the engine ran fine for ~60 hours (workload was light enough that the leak took time to bite), then pool saturation → partition wipe → worker flap → engine wedge. Forensics:

- Listener kills once per hour, on the hour (UTC), for 60+ hours straight before the visible failure.
- RDS `DatabaseConnections` metric: 30 (at engine startup) → 12 (by failure).
- pgxpool `acquired_connections=50` when the pool logged saturation, while RDS simultaneously reported 7 real conns — the 43-count gap is the zombie population.
- Partition rows rebootstrap cleanly on engine restart. `ON DELETE SET NULL` cascade on `Tenant` FKs confirmed as the mechanism via DB inspection.

Happy to share sanitized logs, the full `pg_stat_user_tables` snapshot of the partition tables during the wedge, and the RDS connection-count decline graph if that helps triage.

## Related

- #2771 / #2772 — prior reconnect bug, already fixed. This issue is orthogonal.
- #3280 — L7 proxy killing `ListenV2` on the worker side. Same shape of problem (idle kill + client that doesn't detect cleanly) but different connection.

I'm happy to open a PR for either Option A or Option B if the project prefers one direction. Let me know.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Multiplexed LISTEN connection leaks pool slots on reconnect under server-side idle connection termination #3694

Describe the issue

Environment

Expected behavior

Actual behavior

Reproducer

Root cause

Suggested fix

Option A — retain and release the wrapper

Option B — use a dedicated conn outside the pool

Complementary recommendation — listener keepalive

Evidence from a production-adjacent reproduction

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Multiplexed LISTEN connection leaks pool slots on reconnect under server-side idle connection termination #3694

Description

Describe the issue

Environment

Expected behavior

Actual behavior

Reproducer

Root cause

Suggested fix

Option A — retain and release the wrapper

Option B — use a dedicated conn outside the pool

Complementary recommendation — listener keepalive

Evidence from a production-adjacent reproduction

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions