[Node] Replace IP-based matching with instance-ID-based matching to handle EC2 eventual consistency#694
Open
hehe7318 wants to merge 10 commits into
Open
Conversation
…C2 eventual consistency When launching large numbers of instances, DescribeInstances may return instances with missing PrivateIpAddress due to EC2 API eventual consistency. Previously, clustermgtd matched EC2 instances to Slurm nodes by IP address, causing these instances to be treated as non-existent and terminated. This commit replaces IP-based matching with instance-ID-based matching by leveraging Slurm's native scontrol update InstanceId support (since 23.11).
…ry configurable Part 1 (Slurm 25.11.6 unblock): set InstanceId in the same batched scontrol update command as NodeAddr instead of one scontrol call per node. Slurm 25.11.6 fixes the bug (https://support.schedmd.com/show_bug.cgi?id=24886) where batched per-node InstanceId assignment treated a comma-separated list as a single literal string. Batching removes the x100 slurmctld RPC amplification and the launch/associate race window that the per-node loop introduced. Part 2 (CreateFleet eventual consistency): replace the fixed 5-retry, ~11s-max DescribeInstances backoff in _get_instances_info with a configurable total timeout (instance_info_retrieval_timeout, default 120s) and a per-attempt backoff cap, per EC2 eventual-consistency guidance. The value is plumbed from slurm_resume/clustermgtd config through InstanceManager and FleetManagerFactory into Ec2CreateFleetManager. Attempt count is driven by un-jittered cumulative backoff so it stays deterministic.
…reateFleet retry timeout
e3594e0 to
7bbb2b0
Compare
Set INSTANCE_INFO_RETRIEVAL_TIMEOUT_DEFAULT to 90s (from 120s): it covers the ~27s EC2 eventual-consistency recovery observed in the scaling performance test with margin, while staying far below the default 1800s ResumeTimeout.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of changes
When launching large numbers of instances, DescribeInstances may return instances with missing PrivateIpAddress due to EC2 API eventual consistency. Previously, clustermgtd matched EC2 instances to Slurm nodes by IP address, causing these healthy nodes were treated as missing, replaced, and their instances terminated as orphans, driving the queue into protected mode.
This commit replaces IP-based matching with instance-ID-based matching by leveraging Slurm's native scontrol update InstanceId support (since 23.11).
Changes
1. Match by InstanceId instead of private IP (clustermgtd)
InstanceIdin the same batchedscontrol updateasNodeAddr, and parse it back fromscontrol show nodes.InstanceId(always present inDescribeInstances) rather thanPrivateIpAddress.get_cluster_instancesno longer discards instances with missing IP info; they are kept with empty IP fields so they can still be matched by instance ID. (The only clustermgtd consumer ofprivate_ipis an event log line, which tolerates empty.)InstanceIdreported by Slurm as(null)is normalized toNone.Batched per-node
InstanceIdassignment requires Slurm >= 25.11.6, which is already in develop.2. Configurable CreateFleet DescribeInstances retry timeout
_get_instances_inforetriesDescribeInstancesafter launch to tolerate eventual consistency. Replaced the fixed ~11s budget with a configurable total timeout (instance_info_retrieval_timeout, default 90s) and a per-attempt backoff cap, per EC2 guidance.Tests
Related PRs
Checklist
developadd the branch name as prefix in the PR title (e.g.[release-3.6]).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.