fix(registry): evict stale lease on heartbeat so re-registration can fire#1
Closed
atsyplikhin wants to merge 1 commit into
Closed
Conversation
…fire DeviceRegistry.refresh() returned early whenever the in-memory leases dict still held a handle, even after etcd had TTL-expired that lease. etcd3gw's Lease.refresh() does NOT raise against an expired lease -- it returns -1 (TTL field absent in the keepalive response). So the stale handle survived, has_lease() kept reporting True, and the heartbeat path in the registry service never sent requestRegistration. The device stayed absent from the registry until an unrelated fresh register() call (e.g. the edge's NATS-disconnect path) happened to fire. refresh() now inspects the return value (and still catches transport/ server errors): on TTL == -1 or an exception, it drops the stale handle and falls through to the existing recovery code. has_lease() then reports False, the server asks the device to re-register, and the edge replies with its registration payload -- recovery completes on the next heartbeat with no client-side change. Tests cover the no-raise expiry, the raising error, and the data-survives-lease recovery paths; existing refresh tests updated to return a realistic int TTL from the mocked lease. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Owner
Author
|
Reopening against arm/device-connect instead of the fork. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a server-side bug where a device that TTL-expired in etcd would stay missing from the registry until some unrelated event forced a fresh
register().Stacked on arm#41 (which is stacked on arm#38). This PR's own diff is just the registry fix below.
DeviceRegistry.refresh()returned early whenever the process-wide in-memoryleasesdict still held a handle for the device -- even after etcd had already TTL-expired that lease. The handle is never evicted on its own, so on a "stranded" heartbeat:Lease.refresh()does not raise against an expired lease. Per the etcd keepalive spec the response omits theTTLfield, and etcd3gw returns-1in that case (seeetcd3gw/lease.py).has_lease()kept reportingTrue, and the heartbeat path inregistry/service/main.pynever sentrequestRegistration.register()call (e.g. the edge's NATS-disconnect path) happened to fire.Fix
refresh()now inspects the return value oflease.refresh()(and still catches transport/server errors): onTTL == -1or an exception, it drops the stale handle and falls through to the existing recovery code.After eviction
has_lease()correctly reportsFalse, the server asks the device to re-register, and the edge replies with its registration payload (therequestRegistrationwire path already exists on both sides). Recovery completes on the next heartbeat -- no client-side change required.The architectural choice ("registry asks the device to re-register on a stranded heartbeat") was already made; it was just gated behind a stale-state check that lied.
Test plan
test_refresh_evicts_expired_lease_no_raise--refresh() -> -1, no raise: handle evicted,has_lease()False, no phantom lease created when etcd has no datatest_refresh_evicts_stale_lease_on_error--refresh()raises: handle still evictedtest_refresh_recovers_when_data_survives_lease-- stale lease but doc still in etcd: re-issues lease + re-stores dataTestRefreshHeartbeattests updated to return a realistic int TTL from the mocked lease (a bareMagicMockreturn would not survive the new>= 0comparison)🤖 Generated with Claude Code