Skip to content

fix(registry): evict stale lease on heartbeat so re-registration can fire#1

Closed
atsyplikhin wants to merge 1 commit into
feat/nats-websocket-listenerfrom
fix/registry-stale-lease-eviction
Closed

fix(registry): evict stale lease on heartbeat so re-registration can fire#1
atsyplikhin wants to merge 1 commit into
feat/nats-websocket-listenerfrom
fix/registry-stale-lease-eviction

Conversation

@atsyplikhin
Copy link
Copy Markdown
Owner

Summary

Fixes a server-side bug where a device that TTL-expired in etcd would stay missing from the registry until some unrelated event forced a fresh register().

Stacked on arm#41 (which is stacked on arm#38). This PR's own diff is just the registry fix below.

DeviceRegistry.refresh() returned early whenever the process-wide in-memory leases dict still held a handle for the device -- even after etcd had already TTL-expired that lease. The handle is never evicted on its own, so on a "stranded" heartbeat:

  • etcd3gw's Lease.refresh() does not raise against an expired lease. Per the etcd keepalive spec the response omits the TTL field, and etcd3gw returns -1 in that case (see etcd3gw/lease.py).
  • So the stale handle survived, has_lease() kept reporting True, and the heartbeat path in registry/service/main.py never sent requestRegistration.
  • The device stayed absent from the registry until an unrelated fresh register() call (e.g. the edge's NATS-disconnect path) happened to fire.

Fix

refresh() now inspects the return value of lease.refresh() (and still catches transport/server errors): on TTL == -1 or an exception, it drops the stale handle and falls through to the existing recovery code.

if lease:
    try:
        new_ttl = lease.refresh()
    except Exception:
        new_ttl = -1
    if new_ttl >= 0:
        return
    self.leases.pop(lk, None)   # stale -> drop, fall through to recovery

After eviction has_lease() correctly reports False, the server asks the device to re-register, and the edge replies with its registration payload (the requestRegistration wire path already exists on both sides). Recovery completes on the next heartbeat -- no client-side change required.

The architectural choice ("registry asks the device to re-register on a stranded heartbeat") was already made; it was just gated behind a stale-state check that lied.

Test plan

  • test_refresh_evicts_expired_lease_no_raise -- refresh() -> -1, no raise: handle evicted, has_lease() False, no phantom lease created when etcd has no data
  • test_refresh_evicts_stale_lease_on_error -- refresh() raises: handle still evicted
  • test_refresh_recovers_when_data_survives_lease -- stale lease but doc still in etcd: re-issues lease + re-stores data
  • Existing TestRefreshHeartbeat tests updated to return a realistic int TTL from the mocked lease (a bare MagicMock return would not survive the new >= 0 comparison)
  • Full registry suite: 48 passed

🤖 Generated with Claude Code

…fire

DeviceRegistry.refresh() returned early whenever the in-memory leases
dict still held a handle, even after etcd had TTL-expired that lease.
etcd3gw's Lease.refresh() does NOT raise against an expired lease -- it
returns -1 (TTL field absent in the keepalive response). So the stale
handle survived, has_lease() kept reporting True, and the heartbeat path
in the registry service never sent requestRegistration. The device stayed
absent from the registry until an unrelated fresh register() call (e.g.
the edge's NATS-disconnect path) happened to fire.

refresh() now inspects the return value (and still catches transport/
server errors): on TTL == -1 or an exception, it drops the stale handle
and falls through to the existing recovery code. has_lease() then reports
False, the server asks the device to re-register, and the edge replies
with its registration payload -- recovery completes on the next heartbeat
with no client-side change.

Tests cover the no-raise expiry, the raising error, and the
data-survives-lease recovery paths; existing refresh tests updated to
return a realistic int TTL from the mocked lease.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@atsyplikhin
Copy link
Copy Markdown
Owner Author

Reopening against arm/device-connect instead of the fork.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant