[UpdateWorkflow] Make login node use head-node-driven orchestration for cluster updates by gmarciani · Pull Request #7409 · aws/aws-parallelcluster

gmarciani · 2026-05-27T11:43:55Z

This PR depends on aws/aws-parallelcluster-cookbook#3188

Description of changes

Make login node use head-node-driven orchestration for the update workflow.

With this change, login nodes do not depend on cfn-hup/cfn-init anymore and mirror the same update mechanism already adopted for compute nodes. With this change we expect the update workflow to be more resilient.

The changes in this PR mirrors the changes made for compute fleet in the following PRs:

Replace cfn-hup on compute nodes with systemd timers to signal updates aws-parallelcluster-cookbook#3070: this is the core change that removes cfn-hup
[Scaling] Remove usage of cfn-init in Compute Fleet aws-parallelcluster-cookbook#2875: change in cookbook to remove cfn-init, required by change pasted help documentation into the readme #1
[Scaling] Removing usage of cfn-init for compute fleet #6655: change in cli to remove cfn-init, required by change pasted help documentation into the readme #1
Remove cfn-hup.log from computefleet cloudwatch agent config aws-parallelcluster-cookbook#3093: minor change related to logs

In particular:

files that used ot be written by cfn-init arer now written by user datab directives, such as dna.json.
parameters and permissions required by cfn-init are removed from the login nodes launch template
login nodes launch templates identifications are included in launch-templates-config.json
the validation of the chef attribute in_place_update_on_fleet_enabled is removed as the attribute is not used anymore (see [UpdateWorkflow] Make login node use head-node-driven orchestration for cluster updates aws-parallelcluster-cookbook#3188)

Notes

See comments "Note for the reviewer" for specific explanations about non obvious changes.

Tests

SUCCESS

test-suites:
  update:
    test_update.py::test_update_slurm:
      dimensions:
        - instances:
            - c5.xlarge
          oss:
            - alinux2023
          regions:
            - us-east-1
    test_update.py::test_dynamic_file_systems_update:
      dimensions:
        - instances:
            - c5.xlarge
          oss:
            - alinux2023
          regions:
            - us-east-1
          schedulers:
            - slurm
    test_update.py::test_dynamic_file_systems_update_rollback:
      dimensions:
        - instances:
            - c5.xlarge
          oss:
            - alinux2023
          regions:
            - us-east-1
          schedulers:
            - slurm
    test_update.py::test_login_nodes_update:
      dimensions:
        - instances:
            - c5.xlarge
          oss:
            - alinux2023
          regions:
            - us-east-1
          schedulers:
            - slurm
    test_update_rollback_failure.py::test_update_rollback_failure:
      dimensions:
        - instances:
            - c5.xlarge
          oss:
            - alinux2023
          regions:
            - us-east-1
          schedulers:
            - slurm
    test_update_race_conditions.py::test_update_race_conditions:
      dimensions:
        - instances:
            - c5.xlarge
          oss:
            - alinux2023
          regions:
            - us-east-1
          schedulers:
            - slurm

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

codecov · 2026-05-27T21:22:37Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.93%. Comparing base (6ea6edf) to head (5420ed3).
⚠️ Report is 11 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #7409      +/-   ##
===========================================
- Coverage    90.09%   89.93%   -0.16%     
===========================================
  Files          183      180       -3     
  Lines        16749    16177     -572     
===========================================
- Hits         15090    14549     -541     
+ Misses        1659     1628      -31

Flag	Coverage Δ
unittests	`89.93% <100.00%> (-0.16%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gmarciani · 2026-05-28T13:31:41Z

        ds_config = self._config.directory_service
        ds_generate_keys = str(ds_config.generate_ssh_keys_for_users).lower() if ds_config else "false"

-        if self._pool.instance_profile:


Note for the reviewer: we are removing this profile because it was declared just to be consumed by cfn-init. Since we removed cfn-init, there is no reason to keep it

gmarciani · 2026-05-28T13:32:20Z

+            indent=None,  # Keep indent as None for compact sizing and proper parsing in user_data.sh
        )

-        cfn_init = {


Note for the reviewer: we can entirely remove this because we removed dependency to cfn-init.

gmarciani · 2026-05-28T13:34:21Z

+        # CloudFormation dependency. Instead, we emit deterministic identifiers (LaunchTemplateName
+        # and LogicalId) computed from the pool name; share_dna.py resolves the launch template
+        # at runtime via DescribeLaunchTemplates.
+        if self.config.login_nodes and self.config.login_nodes.pools:


Note for the reviewer: we need this basic info about login nodes because the script share_dna.py in the cookbook (see aws/aws-parallelcluster-cookbook#3188) needs it to retrieve the launch template user data and extract from it the dna.json body.

Non-blocking: Can you update the comment to mention CloudFormation deadlock or CloudFormation circular dependency rather than CloudFormation cycle? Ty!

I understand that we do not want to add the deadlock; can you add the code snippet of where this deadlock happens? Why is the deadlock happening for LoginNode Stack and not for ComputeFleet?

If I am not wrong its because of https://github.com/aws/aws-parallelcluster/blob/develop/cli/src/pcluster/templates/cluster_stack.py#L520-L521. Do we know why this is needed for LN and not ComputeFleet; from my understanding we need this dependency because of head_eni=self._head_eni, which is also needed in ComputeFleet stack, so why an explicit dependency is added only for LN Stack?

Both compute and login nodes, when started, require the head node to be running.
The difference between login nodes and compute nodes is when they are launched and this difference motivates the different dependency.

Login nodes are launched as soon as the login nodes ASG is created. So when the ASG is created, the head node must be running. As such, the actual dependency is: ASG depends on Head Node. However, since ASG lives in nested stack, you cannot establish such cross stack dependencies (there may be hacky tricks, but definitely out of scope for this PR). The result is that the entire login nodes nested stack must depend on the head node.

Compute nodes are launched by the head node. The compute fleet nested stack creates the launch templates, that are then used by the head node to launch the compute nodes. So, in this case, the head node depends on the compute fleet stack.

gmarciani · 2026-05-28T13:35:08Z

 output:
  all: "| tee -a /var/log/cloud-init-output.log | logger -t user-data -s 2>/dev/ttyS0"
 write_files:
+  - path: /tmp/dna.json


Note for the reviewer: the insertion in this file are all coming from what cfn-=init used to do. Now that we removed cfn-init, we need the user data directive/body to do that.

gmarciani · 2026-06-02T10:34:14Z

 - Fix race condition in load balancer lookup by using `tag:GetResources` to resolve load balancer ARNs by tags, 
  which used to cause cluster update failure on clusters with login nodes.
 - Fail `pcluster build-image` early when the downloaded cookbook version does not match the ParallelCluster CLI version.
+- Fix login nodes not mounting `/opt/parallelcluster/shared` when EFS is used as the internal shared storage type.


Note for the review: this fix is required by the current change. The fix is shipped in aws/aws-parallelcluster-cookbook#3188

himani2411 · 2026-06-02T20:11:09Z

+                                "AutoScalingGroupName": f"{self._login_nodes_stack_id}-AutoScalingGroup",
+                                "LaunchingLifecycleHookName": (
+                                    f"{self._login_nodes_stack_id}-LoginNodesLaunchingLifecycleHook"
+                                ),


Note To Self: we removed below from LT because those were needed for Cfn-init command that we were running from user_data.sh; and since we are removing its usage we do not need it in LT.

"LaunchTemplateResourceId": launch_template_id, "CloudFormationUrl": get_service_endpoint("cloudformation", self._config.region), "CfnInitRole": instance_role_name,

…or the update workflow. With this change, login nodes do not depend on cfn-hup/cfn-init anymore and mirror the same update mechanism already adopted for compute nodes. With this change we expect the update workflow to be more resilient. As part of this change we also: 1. Remove chef attribute `in_place_update_on_fleet_enabled` which does not make sense anymore because cfn-hup is now used only by the head node. 2. Clarified the scope of test_update_race_conditions. 3. Increase coverage of user data variables in unit test test_login_nodes_launch_template_properties.

gmarciani marked this pull request as ready for review May 27, 2026 15:45

gmarciani requested review from a team as code owners May 27, 2026 15:45

gmarciani changed the title ~~[UpdateWorkflow] Make login node use head-node-driven orchestration f…~~ [UpdateWorkflow] Make login node use head-node-driven orchestration for cluster updates May 27, 2026

gmarciani force-pushed the wip/mgiacomo/3160/uw-login-nodes-0526-1 branch from fcab2e3 to f8b5484 Compare May 27, 2026 21:16

gmarciani force-pushed the wip/mgiacomo/3160/uw-login-nodes-0526-1 branch 3 times, most recently from 2995cfe to bf40248 Compare May 28, 2026 10:52

gmarciani commented May 28, 2026

View reviewed changes

Comment thread cli/tests/pcluster/templates/test_cluster_stack.py

gmarciani commented May 28, 2026

View reviewed changes

Comment thread tests/integration-tests/tests/update/test_update_race_conditions.py

gmarciani force-pushed the wip/mgiacomo/3160/uw-login-nodes-0526-1 branch 2 times, most recently from 6f315f3 to db2c236 Compare June 1, 2026 13:31

gmarciani mentioned this pull request Jun 2, 2026

[UpdateWorkflow] Make login node use head-node-driven orchestration for cluster updates aws/aws-parallelcluster-cookbook#3188

Merged

gmarciani force-pushed the wip/mgiacomo/3160/uw-login-nodes-0526-1 branch from 38f6190 to 5420ed3 Compare June 2, 2026 10:31

gmarciani commented Jun 2, 2026

View reviewed changes

himani2411 reviewed Jun 2, 2026

View reviewed changes

Comment thread cli/tests/pcluster/templates/test_login_nodes_stack.py Outdated

himani2411 approved these changes Jun 3, 2026

View reviewed changes

gmarciani force-pushed the wip/mgiacomo/3160/uw-login-nodes-0526-1 branch from 5856c03 to 5f0d2ee Compare June 3, 2026 15:30

gmarciani merged commit 54d910a into aws:develop Jun 3, 2026
19 checks passed

gmarciani deleted the wip/mgiacomo/3160/uw-login-nodes-0526-1 branch June 3, 2026 16:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UpdateWorkflow] Make login node use head-node-driven orchestration for cluster updates#7409

[UpdateWorkflow] Make login node use head-node-driven orchestration for cluster updates#7409
gmarciani merged 1 commit into
aws:developfrom
gmarciani:wip/mgiacomo/3160/uw-login-nodes-0526-1

gmarciani commented May 27, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

gmarciani May 28, 2026

Uh oh!

gmarciani May 28, 2026

Uh oh!

gmarciani May 28, 2026

Uh oh!

himani2411 Jun 2, 2026 •

edited

Loading

Uh oh!

himani2411 Jun 2, 2026 •

edited

Loading

Uh oh!

gmarciani Jun 3, 2026

Uh oh!

gmarciani May 28, 2026

Uh oh!

Uh oh!

gmarciani Jun 2, 2026

Uh oh!

himani2411 Jun 2, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gmarciani commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Notes

Tests

Uh oh!

codecov Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

gmarciani May 28, 2026

Choose a reason for hiding this comment

Uh oh!

gmarciani May 28, 2026

Choose a reason for hiding this comment

Uh oh!

gmarciani May 28, 2026

Choose a reason for hiding this comment

Uh oh!

himani2411 Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

himani2411 Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmarciani Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

gmarciani May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gmarciani Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

himani2411 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gmarciani commented May 27, 2026 •

edited

Loading

codecov Bot commented May 27, 2026 •

edited

Loading

himani2411 Jun 2, 2026 •

edited

Loading

himani2411 Jun 2, 2026 •

edited

Loading