Skip to content

[Node] Replace IP-based matching with instance-ID-based matching to handle EC2 eventual consistency#694

Open
hehe7318 wants to merge 10 commits into
aws:developfrom
hehe7318:wip/replace-privateip-matching-with-instanceid
Open

[Node] Replace IP-based matching with instance-ID-based matching to handle EC2 eventual consistency#694
hehe7318 wants to merge 10 commits into
aws:developfrom
hehe7318:wip/replace-privateip-matching-with-instanceid

Conversation

@hehe7318
Copy link
Copy Markdown
Contributor

@hehe7318 hehe7318 commented Mar 23, 2026

Description of changes

When launching large numbers of instances, DescribeInstances may return instances with missing PrivateIpAddress due to EC2 API eventual consistency. Previously, clustermgtd matched EC2 instances to Slurm nodes by IP address, causing these healthy nodes were treated as missing, replaced, and their instances terminated as orphans, driving the queue into protected mode.

This commit replaces IP-based matching with instance-ID-based matching by leveraging Slurm's native scontrol update InstanceId support (since 23.11).

Changes

1. Match by InstanceId instead of private IP (clustermgtd)

  • Store the EC2 instance ID on each Slurm node during launch by setting InstanceId in the same batched scontrol update as NodeAddr, and parse it back from scontrol show nodes.
  • clustermgtd now associates instances to nodes by InstanceId (always present in DescribeInstances) rather than PrivateIpAddress.
  • get_cluster_instances no longer discards instances with missing IP info; they are kept with empty IP fields so they can still be matched by instance ID. (The only clustermgtd consumer of private_ip is an event log line, which tolerates empty.)
  • An unset InstanceId reported by Slurm as (null) is normalized to None.

Batched per-node InstanceId assignment requires Slurm >= 25.11.6, which is already in develop.

2. Configurable CreateFleet DescribeInstances retry timeout

  • For CreateFleet (flexible instance types), _get_instances_info retries DescribeInstances after launch to tolerate eventual consistency. Replaced the fixed ~11s budget with a configurable total timeout (instance_info_retrieval_timeout, default 90s) and a per-attempt backoff cap, per EC2 guidance.

Tests

  • Unit tests passed
  • Manually tests ongoing

Related PRs

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…C2 eventual consistency

When launching large numbers of instances, DescribeInstances may return
instances with missing PrivateIpAddress due to EC2 API eventual consistency.
Previously, clustermgtd matched EC2 instances to Slurm nodes by IP address,
causing these instances to be treated as non-existent and terminated.

This commit replaces IP-based matching with instance-ID-based matching by
leveraging Slurm's native scontrol update InstanceId support (since 23.11).
@hehe7318 hehe7318 added the 3.x label Mar 23, 2026
@hehe7318 hehe7318 requested review from a team as code owners March 23, 2026 22:58
hehe7318 and others added 5 commits March 26, 2026 14:37
…ry configurable

Part 1 (Slurm 25.11.6 unblock): set InstanceId in the same batched
scontrol update command as NodeAddr instead of one scontrol call per node.
Slurm 25.11.6 fixes the bug (https://support.schedmd.com/show_bug.cgi?id=24886)
where batched per-node InstanceId assignment treated a comma-separated list as
a single literal string. Batching removes the x100 slurmctld RPC amplification
and the launch/associate race window that the per-node loop introduced.

Part 2 (CreateFleet eventual consistency): replace the fixed 5-retry,
~11s-max DescribeInstances backoff in _get_instances_info with a configurable
total timeout (instance_info_retrieval_timeout, default 120s) and a per-attempt
backoff cap, per EC2 eventual-consistency guidance. The value is plumbed from
slurm_resume/clustermgtd config through InstanceManager and FleetManagerFactory
into Ec2CreateFleetManager. Attempt count is driven by un-jittered cumulative
backoff so it stays deterministic.
@hehe7318 hehe7318 force-pushed the wip/replace-privateip-matching-with-instanceid branch from e3594e0 to 7bbb2b0 Compare June 2, 2026 17:16
hehe7318 and others added 3 commits June 3, 2026 12:10
Set INSTANCE_INFO_RETRIEVAL_TIMEOUT_DEFAULT to 90s (from 120s): it covers the
~27s EC2 eventual-consistency recovery observed in the scaling performance test
with margin, while staying far below the default 1800s ResumeTimeout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant