Search before creating an issue
Bug Description
On a setup that interacts with clusters through the SSHComputingElement (the new one based on fabric) the PilotStatusAgent grows to tens of GB of RAM (~45 GB observed)
and floods the SSH gateways with a burst of connections, overloading the host running the
agent and the gateways it connects to.
Root cause: when declaring stalled pilots Deleted, PilotStatusAgent._killPilots() calls
DiracAdmin.killPilot() once per pilot. Each call goes through
WMSUtilities.killPilotsInQueues(), which builds a fresh ComputingElement via
ComputingElementFactory.getCE() and calls ce.killJob(). For an SSHComputingElement
that opens a new SSH connection (and, with SSHTunnel, a second connection to the gateway),
and the CE/connection is never closed.
Consequences when a backlog of stalled pilots accumulates (e.g. after a site/queue stops
reporting pilot status):
- one (or two, with a gateway) new SSH connection per stalled pilot → connection burst
on the gateways;
- the fabric/paramiko
Connection objects and their background Transport threads are
never released → unbounded memory growth. Fabric's documentation explicitly warns that
relying on garbage collection to close connections "is not currently safe".
Steps to Reproduce
- Configure a queue served by an
SSHComputingElement using an SSHTunnel (gateway).
- Let a backlog of pilots in transient states accumulate older than
PilotStalledDays
(e.g. a queue/CE that stops updating pilot status for a few days).
- Run
PilotStatusAgent.
- Observe, during
handleOldPilots:
- a burst of SSH connections from the agent host to the gateway, one per stalled pilot;
dirac-agent .../PilotStatusAgent process climbing into the tens of GB;
Expected Behavior
PilotStatusAgent declares stalled pilots Deleted and kills them on their CEs without
unbounded memory growth and without opening (and leaking) a separate SSH connection per
pilot. Connections to a given queue/gateway should be (reused and released when
no longer valid.
Actual Behavior
- The agent process grows to ~45 GB and overloads the host
- A large number of SSH connections are opened
- Connections/threads are never closed, so the footprint persists/accumulates within and
across cycles.
Additional Context
Proposed fix (implemented on a branch):
Reuse CEs/connections across cycles instead of creating one per pilot:
- extract the CE-caching logic the
SiteDirector already uses (hash-based invalidation)
into a shared QueueCECache (in QueueUtilities), with getCE() (cached-or-rebuilt)
and drop() (evict + close());
PilotStatusAgent._killPilots() groups pilots by queue and issues one killJob()
per queue on a cached CE, refreshing pilot credentials each cycle;
- migrate
getQueuesResolved() (hence SiteDirector and PushJobAgent) onto
QueueCECache, which additionally fixes a latent leak there (CEs were dropped from the
cache on config change / invalid queue without being closed).
Search before creating an issue
Bug Description
On a setup that interacts with clusters through the
SSHComputingElement(the new one based on fabric) thePilotStatusAgentgrows to tens of GB of RAM (~45 GB observed)and floods the SSH gateways with a burst of connections, overloading the host running the
agent and the gateways it connects to.
Root cause: when declaring stalled pilots
Deleted,PilotStatusAgent._killPilots()callsDiracAdmin.killPilot()once per pilot. Each call goes throughWMSUtilities.killPilotsInQueues(), which builds a freshComputingElementviaComputingElementFactory.getCE()and callsce.killJob(). For anSSHComputingElementthat opens a new SSH connection (and, with
SSHTunnel, a second connection to the gateway),and the CE/connection is never closed.
Consequences when a backlog of stalled pilots accumulates (e.g. after a site/queue stops
reporting pilot status):
on the gateways;
Connectionobjects and their backgroundTransportthreads arenever released → unbounded memory growth. Fabric's documentation explicitly warns that
relying on garbage collection to close connections "is not currently safe".
Steps to Reproduce
SSHComputingElementusing anSSHTunnel(gateway).PilotStalledDays(e.g. a queue/CE that stops updating pilot status for a few days).
PilotStatusAgent.handleOldPilots:dirac-agent .../PilotStatusAgentprocess climbing into the tens of GB;Expected Behavior
PilotStatusAgentdeclares stalled pilotsDeletedand kills them on their CEs withoutunbounded memory growth and without opening (and leaking) a separate SSH connection per
pilot. Connections to a given queue/gateway should be (reused and released when
no longer valid.
Actual Behavior
across cycles.
Additional Context
Proposed fix (implemented on a branch):
Reuse CEs/connections across cycles instead of creating one per pilot:
SiteDirectoralready uses (hash-based invalidation)into a shared
QueueCECache(inQueueUtilities), withgetCE()(cached-or-rebuilt)and
drop()(evict +close());PilotStatusAgent._killPilots()groups pilots by queue and issues onekillJob()per queue on a cached CE, refreshing pilot credentials each cycle;
getQueuesResolved()(henceSiteDirectorandPushJobAgent) ontoQueueCECache, which additionally fixes a latent leak there (CEs were dropped from thecache on config change / invalid queue without being closed).