Describe the issue
pkg/repository/multiplexer.go's Connect callback hands the raw *pgx.Conn from an acquired pool slot to pgxlisten.Listener without retaining the *pgxpool.Conn wrapper:
https://github.com/hatchet-dev/hatchet/blob/v0.83.41/pkg/repository/multiplexer.go#L58-L66
Connect: func(ctx context.Context) (*pgx.Conn, error) {
poolConn, err := m.pool.Acquire(ctx)
if err != nil {
return nil, err
}
return poolConn.Conn(), nil // *pgxpool.Conn wrapper falls out of scope
},
The wrapper goes out of scope with no Release() call, so pgxpool's internal bookkeeping permanently counts the slot as "acquired." When the server-side terminates the listener's conn (idle_session_timeout, pgbouncer server_idle_timeout, an L7 proxy idle kill), pgxlisten's listen() returns with an error, defer conn.Close(ctx) closes the raw conn — but the pool wrapper was already orphaned.
On reconnect, pgxlisten calls Connect again and acquires a fresh slot. The previous slot is still marked as acquired from pgx's POV, but its underlying TCP is dead. Every reconnect cycle leaks one pool slot.
This is a separate bug from #2771 (fixed by #2772). #2772 ensured reconnect happens at all. This issue concerns the slot-leak consequence of how reconnect is wired — #2772 didn't address it.
Environment
- Engine image:
ghcr.io/hatchet-dev/hatchet/hatchet-lite:v0.79.16. Reviewed main / v0.83.41 — multiplexer code is unchanged, same bug present.
- PostgreSQL: 15.10 (AWS RDS), with
idle_session_timeout = 3600000 (1 h) set on the parameter group.
- Database config: all Hatchet defaults —
DATABASE_MAX_CONNS=50, DATABASE_MIN_CONNS=1, DATABASE_MAX_QUEUE_CONNS=50, DATABASE_MIN_QUEUE_CONNS=10, DATABASE_MAX_CONN_LIFETIME=15m, DATABASE_MAX_CONN_IDLE_TIME=1m.
- Msgqueue kind:
postgres (pg_notify via multiplexed listener).
Expected behavior
A long-running Hatchet engine should be stable under server-side connection termination policies. Each listener-conn death should cleanly return its pool slot and reconnect without any persistent resource accumulation.
Actual behavior
Hourly on the dot, the listener logs:
WRN error in listener
error="waiting for notification: FATAL: terminating connection due to idle-session timeout (SQLSTATE 57P05)"
service=database
pgxlisten reconnects cleanly (no error cascade visible — #2772's fix is working as intended). But pgxpool accumulates:
DatabaseConnections on the Postgres server drains by ~1 per hour. In our case it went from ~30 at engine startup to ~12 over 60 hours — a clean, monotonic decline with no recovery.
pgxpool.Stat().AcquiredConns() grows monotonically as wrapper-orphans accumulate.
- After ~60 cycles, the pool's "acquired" counter saturates at
MaxConns=50 while only ~12 actual TCP conns exist server-side. New pool.Acquire() calls block against the cap.
Once the pool is effectively wedged, downstream failure is severe:
- Queue/scheduler/controller subsystems' periodic queries time out with
context deadline exceeded.
- Partition-heartbeat
UPDATEs into SchedulerPartition / ControllerPartition / TenantWorkerPartition stop landing.
- The periodic
rebalanceInactive*Partitions job (internal/services/partition/partition.go) runs and — being a simpler single-row DELETE — sometimes still wins a live slot. It successfully deletes the engine's own stale partition rows.
ON DELETE SET NULL on Tenant → Partition FKs cascades: Tenant.schedulerPartitionId, Tenant.controllerPartitionId, Tenant.workerPartitionId all go NULL.
- Worker
RegisterWorker RPCs can no longer bind to a partition; worker heartbeat RPC handlers can't complete within the SDK's 5 s deadline.
- Workers'
/health endpoint returns 503 (Python SDK action_listener_process._compute_health → last_heartbeat_succeeded == False), container health check fails, ECS (or equivalent) restarts the worker, cycle repeats every ~3 min.
- Restarting the worker doesn't help — the engine's control plane is dead. Only an engine restart (rebuilds pool from zero, re-runs partition bootstrap at startup) recovers.
Reproducer
- On a Postgres 14+ instance, set
idle_session_timeout = 300000 (5 min) via parameter group or ALTER SYSTEM.
- Start
hatchet-lite (any recent version, including v0.83.41) with SERVER_MSGQUEUE_KIND=postgres.
- Every 5 min you'll see
WRN error in listener … FATAL: terminating connection due to idle-session timeout in engine logs. Reconnect happens after ReconnectDelay=10s.
- Query
SELECT count(*) FROM pg_stat_activity WHERE application_name LIKE '%hatchet%' — count drifts downward monotonically, ~1 per listener cycle.
- After enough cycles (depends on pool max + real demand), engine subsystems start logging
context deadline exceeded on all operations.
- Eventually partition tables get emptied by the engine's own rebalance cron and the engine becomes functionally dead until restart.
Bonus failure amplifier: DATABASE_MIN_QUEUE_CONNS=10 (default) ensures 10 queue-pool conns are never recycled by the client-side idle timer, so they pin up against the 1 h Postgres kill and all die in lockstep once per hour, accelerating the leak rate.
Root cause
pgx v5's pgxpool uses explicit Acquire/Release semantics with no wrapper finalizer. The contract is "whoever acquires must release" — pgx has no background reaper for conns held indefinitely. In the multiplexer's Connect:
poolConn, err := m.pool.Acquire(ctx)
// ...
return poolConn.Conn(), nil
poolConn is a local variable. When the function returns, the *pgxpool.Conn wrapper is unreachable. pgxlisten holds only the raw *pgx.Conn. There is no path back to poolConn.Release().
When pgxlisten.Listener.listen() returns on error, its defer conn.Close(ctx) closes the raw conn — but this sends Terminate on the underlying protocol, it does not call pgxpool.Conn.Release(). pgxpool's bookkeeping never decrements acquired, and the slot is never recycled into the idle queue or destroyed.
Neither MaxConnLifetime, MaxConnIdleTime, nor HealthCheckPeriod can recover this slot. All three operate on pgxpool's idle list via puddle.AcquireAllIdle() — the dead slot is in the acquired state and is invisible to these mechanisms.
Suggested fix
Two independent options, either works:
Option A — retain and release the wrapper
Keep a reference to the *pgxpool.Conn alive for the duration of the raw conn, and release it when the raw conn is closed. pgxlisten doesn't expose a release hook, so a minimal workaround is to track the previous pool connection in the Connect closure and release it on the next acquisition:
var currentPoolConn *pgxpool.Conn
var poolConnMu sync.Mutex
listener := &pgxlisten.Listener{
Connect: func(ctx context.Context) (*pgx.Conn, error) {
poolConnMu.Lock()
defer poolConnMu.Unlock()
// Previous conn is dead (that's why we're here); release its slot.
if currentPoolConn != nil {
currentPoolConn.Release()
}
poolConn, err := m.pool.Acquire(ctx)
if err != nil {
currentPoolConn = nil
return nil, err
}
currentPoolConn = poolConn
return poolConn.Conn(), nil
},
// ...
}
pgxpool's Release() on a conn that's been externally closed is safe: Release() inspects conn state (IsClosed() || IsBusy() || TxStatus() != 'I') and, if any is true, calls res.Destroy() which closes and removes the slot cleanly.
Option B — use a dedicated conn outside the pool
Replace pool.Acquire(ctx) in the Connect callback with pgx.Connect(ctx, connString):
Connect: func(ctx context.Context) (*pgx.Conn, error) {
return pgx.Connect(ctx, m.connString)
},
The listener's conn is then independent of pgxpool. Its lifecycle is controlled by pgxlisten alone — defer conn.Close(ctx) closes the raw conn, the next Connect opens a fresh one. No pool accounting involved.
Trade-off: the listener conn no longer counts against DATABASE_MAX_CONNS. Operators need to know this when sizing Postgres max_connections for Hatchet. The addition is minimal (one conn per engine instance).
Option A is less invasive. Option B is conceptually cleaner. Either closes the leak.
Complementary recommendation — listener keepalive
Independent of the leak fix, the listener will still die every hour under idle_session_timeout because Postgres considers any session with no client-issued queries to be idle, regardless of incoming NotificationResponses. A periodic SELECT 1 from the listener goroutine (e.g. every 10 min) would reset the server-side idle timer and eliminate the death loop entirely. This is a common pattern for LISTEN/NOTIFY consumers.
pgxlisten doesn't offer a keepalive hook, so this lives in Hatchet. Option B composes well with this — once the listener owns its conn outright, adding a keepalive is straightforward.
Evidence from a production-adjacent reproduction
Our dev environment matched the reproducer exactly. We added idle_session_timeout = 3600000 via infrastructure-as-code, the engine ran fine for ~60 hours (workload was light enough that the leak took time to bite), then pool saturation → partition wipe → worker flap → engine wedge. Forensics:
- Listener kills once per hour, on the hour (UTC), for 60+ hours straight before the visible failure.
- RDS
DatabaseConnections metric: 30 (at engine startup) → 12 (by failure).
- pgxpool
acquired_connections=50 when the pool logged saturation, while RDS simultaneously reported 7 real conns — the 43-count gap is the zombie population.
- Partition rows rebootstrap cleanly on engine restart.
ON DELETE SET NULL cascade on Tenant FKs confirmed as the mechanism via DB inspection.
Happy to share sanitized logs, the full pg_stat_user_tables snapshot of the partition tables during the wedge, and the RDS connection-count decline graph if that helps triage.
Related
I'm happy to open a PR for either Option A or Option B if the project prefers one direction. Let me know.
Describe the issue
pkg/repository/multiplexer.go'sConnectcallback hands the raw*pgx.Connfrom an acquired pool slot topgxlisten.Listenerwithout retaining the*pgxpool.Connwrapper:https://github.com/hatchet-dev/hatchet/blob/v0.83.41/pkg/repository/multiplexer.go#L58-L66
The wrapper goes out of scope with no
Release()call, so pgxpool's internal bookkeeping permanently counts the slot as "acquired." When the server-side terminates the listener's conn (idle_session_timeout, pgbouncerserver_idle_timeout, an L7 proxy idle kill), pgxlisten'slisten()returns with an error,defer conn.Close(ctx)closes the raw conn — but the pool wrapper was already orphaned.On reconnect, pgxlisten calls
Connectagain and acquires a fresh slot. The previous slot is still marked as acquired from pgx's POV, but its underlying TCP is dead. Every reconnect cycle leaks one pool slot.This is a separate bug from #2771 (fixed by #2772). #2772 ensured reconnect happens at all. This issue concerns the slot-leak consequence of how reconnect is wired — #2772 didn't address it.
Environment
ghcr.io/hatchet-dev/hatchet/hatchet-lite:v0.79.16. Reviewedmain/v0.83.41— multiplexer code is unchanged, same bug present.idle_session_timeout = 3600000(1 h) set on the parameter group.DATABASE_MAX_CONNS=50,DATABASE_MIN_CONNS=1,DATABASE_MAX_QUEUE_CONNS=50,DATABASE_MIN_QUEUE_CONNS=10,DATABASE_MAX_CONN_LIFETIME=15m,DATABASE_MAX_CONN_IDLE_TIME=1m.postgres(pg_notify via multiplexed listener).Expected behavior
A long-running Hatchet engine should be stable under server-side connection termination policies. Each listener-conn death should cleanly return its pool slot and reconnect without any persistent resource accumulation.
Actual behavior
Hourly on the dot, the listener logs:
pgxlisten reconnects cleanly (no error cascade visible — #2772's fix is working as intended). But pgxpool accumulates:
DatabaseConnectionson the Postgres server drains by ~1 per hour. In our case it went from ~30 at engine startup to ~12 over 60 hours — a clean, monotonic decline with no recovery.pgxpool.Stat().AcquiredConns()grows monotonically as wrapper-orphans accumulate.MaxConns=50while only ~12 actual TCP conns exist server-side. Newpool.Acquire()calls block against the cap.Once the pool is effectively wedged, downstream failure is severe:
context deadline exceeded.UPDATEs intoSchedulerPartition/ControllerPartition/TenantWorkerPartitionstop landing.rebalanceInactive*Partitionsjob (internal/services/partition/partition.go) runs and — being a simpler single-rowDELETE— sometimes still wins a live slot. It successfully deletes the engine's own stale partition rows.ON DELETE SET NULLonTenant → PartitionFKs cascades:Tenant.schedulerPartitionId,Tenant.controllerPartitionId,Tenant.workerPartitionIdall go NULL.RegisterWorkerRPCs can no longer bind to a partition; worker heartbeat RPC handlers can't complete within the SDK's 5 s deadline./healthendpoint returns 503 (Python SDKaction_listener_process._compute_health→last_heartbeat_succeeded == False), container health check fails, ECS (or equivalent) restarts the worker, cycle repeats every ~3 min.Reproducer
idle_session_timeout = 300000(5 min) via parameter group orALTER SYSTEM.hatchet-lite(any recent version, includingv0.83.41) withSERVER_MSGQUEUE_KIND=postgres.WRN error in listener … FATAL: terminating connection due to idle-session timeoutin engine logs. Reconnect happens afterReconnectDelay=10s.SELECT count(*) FROM pg_stat_activity WHERE application_name LIKE '%hatchet%'— count drifts downward monotonically, ~1 per listener cycle.context deadline exceededon all operations.Bonus failure amplifier:
DATABASE_MIN_QUEUE_CONNS=10(default) ensures 10 queue-pool conns are never recycled by the client-side idle timer, so they pin up against the 1 h Postgres kill and all die in lockstep once per hour, accelerating the leak rate.Root cause
pgx v5's
pgxpooluses explicitAcquire/Releasesemantics with no wrapper finalizer. The contract is "whoever acquires must release" — pgx has no background reaper for conns held indefinitely. In the multiplexer'sConnect:poolConnis a local variable. When the function returns, the*pgxpool.Connwrapper is unreachable. pgxlisten holds only the raw*pgx.Conn. There is no path back topoolConn.Release().When
pgxlisten.Listener.listen()returns on error, itsdefer conn.Close(ctx)closes the raw conn — but this sends Terminate on the underlying protocol, it does not callpgxpool.Conn.Release(). pgxpool's bookkeeping never decrementsacquired, and the slot is never recycled into the idle queue or destroyed.Neither
MaxConnLifetime,MaxConnIdleTime, norHealthCheckPeriodcan recover this slot. All three operate on pgxpool's idle list viapuddle.AcquireAllIdle()— the dead slot is in the acquired state and is invisible to these mechanisms.Suggested fix
Two independent options, either works:
Option A — retain and release the wrapper
Keep a reference to the
*pgxpool.Connalive for the duration of the raw conn, and release it when the raw conn is closed. pgxlisten doesn't expose a release hook, so a minimal workaround is to track the previous pool connection in theConnectclosure and release it on the next acquisition:pgxpool's
Release()on a conn that's been externally closed is safe:Release()inspects conn state (IsClosed() || IsBusy() || TxStatus() != 'I') and, if any is true, callsres.Destroy()which closes and removes the slot cleanly.Option B — use a dedicated conn outside the pool
Replace
pool.Acquire(ctx)in the Connect callback withpgx.Connect(ctx, connString):The listener's conn is then independent of pgxpool. Its lifecycle is controlled by pgxlisten alone —
defer conn.Close(ctx)closes the raw conn, the nextConnectopens a fresh one. No pool accounting involved.Trade-off: the listener conn no longer counts against
DATABASE_MAX_CONNS. Operators need to know this when sizing Postgresmax_connectionsfor Hatchet. The addition is minimal (one conn per engine instance).Option A is less invasive. Option B is conceptually cleaner. Either closes the leak.
Complementary recommendation — listener keepalive
Independent of the leak fix, the listener will still die every hour under
idle_session_timeoutbecause Postgres considers any session with no client-issued queries to be idle, regardless of incomingNotificationResponses. A periodicSELECT 1from the listener goroutine (e.g. every 10 min) would reset the server-side idle timer and eliminate the death loop entirely. This is a common pattern for LISTEN/NOTIFY consumers.pgxlisten doesn't offer a keepalive hook, so this lives in Hatchet. Option B composes well with this — once the listener owns its conn outright, adding a keepalive is straightforward.
Evidence from a production-adjacent reproduction
Our dev environment matched the reproducer exactly. We added
idle_session_timeout = 3600000via infrastructure-as-code, the engine ran fine for ~60 hours (workload was light enough that the leak took time to bite), then pool saturation → partition wipe → worker flap → engine wedge. Forensics:DatabaseConnectionsmetric: 30 (at engine startup) → 12 (by failure).acquired_connections=50when the pool logged saturation, while RDS simultaneously reported 7 real conns — the 43-count gap is the zombie population.ON DELETE SET NULLcascade onTenantFKs confirmed as the mechanism via DB inspection.Happy to share sanitized logs, the full
pg_stat_user_tablessnapshot of the partition tables during the wedge, and the RDS connection-count decline graph if that helps triage.Related
ListenV2on the worker side. Same shape of problem (idle kill + client that doesn't detect cleanly) but different connection.I'm happy to open a PR for either Option A or Option B if the project prefers one direction. Let me know.