Skip to content

[UpdateWorkflow] Make login node use head-node-driven orchestration for cluster updates#7409

Merged
gmarciani merged 1 commit into
aws:developfrom
gmarciani:wip/mgiacomo/3160/uw-login-nodes-0526-1
Jun 3, 2026
Merged

[UpdateWorkflow] Make login node use head-node-driven orchestration for cluster updates#7409
gmarciani merged 1 commit into
aws:developfrom
gmarciani:wip/mgiacomo/3160/uw-login-nodes-0526-1

Conversation

@gmarciani
Copy link
Copy Markdown
Contributor

@gmarciani gmarciani commented May 27, 2026

This PR depends on aws/aws-parallelcluster-cookbook#3188

Description of changes

Make login node use head-node-driven orchestration for the update workflow.

With this change, login nodes do not depend on cfn-hup/cfn-init anymore and mirror the same update mechanism already adopted for compute nodes. With this change we expect the update workflow to be more resilient.

The changes in this PR mirrors the changes made for compute fleet in the following PRs:

  1. Replace cfn-hup on compute nodes with systemd timers to signal updates aws-parallelcluster-cookbook#3070: this is the core change that removes cfn-hup
  2. [Scaling] Remove usage of cfn-init in Compute Fleet aws-parallelcluster-cookbook#2875: change in cookbook to remove cfn-init, required by change pasted help documentation into the readme #1
  3. [Scaling] Removing usage of cfn-init for compute fleet #6655: change in cli to remove cfn-init, required by change pasted help documentation into the readme #1
  4. Remove cfn-hup.log from computefleet cloudwatch agent config aws-parallelcluster-cookbook#3093: minor change related to logs

In particular:

  1. files that used ot be written by cfn-init arer now written by user datab directives, such as dna.json.
  2. parameters and permissions required by cfn-init are removed from the login nodes launch template
  3. login nodes launch templates identifications are included in launch-templates-config.json
  4. the validation of the chef attribute in_place_update_on_fleet_enabled is removed as the attribute is not used anymore (see [UpdateWorkflow] Make login node use head-node-driven orchestration for cluster updates aws-parallelcluster-cookbook#3188)

Notes

  1. See comments "Note for the reviewer" for specific explanations about non obvious changes.

Tests

  • SUCCESS
test-suites:
  update:
    test_update.py::test_update_slurm:
      dimensions:
        - instances:
            - c5.xlarge
          oss:
            - alinux2023
          regions:
            - us-east-1
    test_update.py::test_dynamic_file_systems_update:
      dimensions:
        - instances:
            - c5.xlarge
          oss:
            - alinux2023
          regions:
            - us-east-1
          schedulers:
            - slurm
    test_update.py::test_dynamic_file_systems_update_rollback:
      dimensions:
        - instances:
            - c5.xlarge
          oss:
            - alinux2023
          regions:
            - us-east-1
          schedulers:
            - slurm
    test_update.py::test_login_nodes_update:
      dimensions:
        - instances:
            - c5.xlarge
          oss:
            - alinux2023
          regions:
            - us-east-1
          schedulers:
            - slurm
    test_update_rollback_failure.py::test_update_rollback_failure:
      dimensions:
        - instances:
            - c5.xlarge
          oss:
            - alinux2023
          regions:
            - us-east-1
          schedulers:
            - slurm
    test_update_race_conditions.py::test_update_race_conditions:
      dimensions:
        - instances:
            - c5.xlarge
          oss:
            - alinux2023
          regions:
            - us-east-1
          schedulers:
            - slurm

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@gmarciani gmarciani marked this pull request as ready for review May 27, 2026 15:45
@gmarciani gmarciani requested review from a team as code owners May 27, 2026 15:45
@gmarciani gmarciani changed the title [UpdateWorkflow] Make login node use head-node-driven orchestration f… [UpdateWorkflow] Make login node use head-node-driven orchestration for cluster updates May 27, 2026
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/uw-login-nodes-0526-1 branch from fcab2e3 to f8b5484 Compare May 27, 2026 21:16
@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.93%. Comparing base (6ea6edf) to head (5420ed3).
⚠️ Report is 11 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #7409      +/-   ##
===========================================
- Coverage    90.09%   89.93%   -0.16%     
===========================================
  Files          183      180       -3     
  Lines        16749    16177     -572     
===========================================
- Hits         15090    14549     -541     
+ Misses        1659     1628      -31     
Flag Coverage Δ
unittests 89.93% <100.00%> (-0.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/uw-login-nodes-0526-1 branch 3 times, most recently from 2995cfe to bf40248 Compare May 28, 2026 10:52
Comment thread cli/tests/pcluster/templates/test_cluster_stack.py
ds_config = self._config.directory_service
ds_generate_keys = str(ds_config.generate_ssh_keys_for_users).lower() if ds_config else "false"

if self._pool.instance_profile:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the reviewer: we are removing this profile because it was declared just to be consumed by cfn-init. Since we removed cfn-init, there is no reason to keep it

indent=None, # Keep indent as None for compact sizing and proper parsing in user_data.sh
)

cfn_init = {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the reviewer: we can entirely remove this because we removed dependency to cfn-init.

# CloudFormation dependency. Instead, we emit deterministic identifiers (LaunchTemplateName
# and LogicalId) computed from the pool name; share_dna.py resolves the launch template
# at runtime via DescribeLaunchTemplates.
if self.config.login_nodes and self.config.login_nodes.pools:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the reviewer: we need this basic info about login nodes because the script share_dna.py in the cookbook (see aws/aws-parallelcluster-cookbook#3188) needs it to retrieve the launch template user data and extract from it the dna.json body.

Copy link
Copy Markdown
Contributor

@himani2411 himani2411 Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: Can you update the comment to mention CloudFormation deadlock or CloudFormation circular dependency rather than CloudFormation cycle? Ty!

Copy link
Copy Markdown
Contributor

@himani2411 himani2411 Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that we do not want to add the deadlock; can you add the code snippet of where this deadlock happens? Why is the deadlock happening for LoginNode Stack and not for ComputeFleet?

If I am not wrong its because of https://github.com/aws/aws-parallelcluster/blob/develop/cli/src/pcluster/templates/cluster_stack.py#L520-L521. Do we know why this is needed for LN and not ComputeFleet; from my understanding we need this dependency because of head_eni=self._head_eni, which is also needed in ComputeFleet stack, so why an explicit dependency is added only for LN Stack?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both compute and login nodes, when started, require the head node to be running.
The difference between login nodes and compute nodes is when they are launched and this difference motivates the different dependency.

  • Login nodes are launched as soon as the login nodes ASG is created. So when the ASG is created, the head node must be running. As such, the actual dependency is: ASG depends on Head Node. However, since ASG lives in nested stack, you cannot establish such cross stack dependencies (there may be hacky tricks, but definitely out of scope for this PR). The result is that the entire login nodes nested stack must depend on the head node.
  • Compute nodes are launched by the head node. The compute fleet nested stack creates the launch templates, that are then used by the head node to launch the compute nodes. So, in this case, the head node depends on the compute fleet stack.

output:
all: "| tee -a /var/log/cloud-init-output.log | logger -t user-data -s 2>/dev/ttyS0"
write_files:
- path: /tmp/dna.json
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the reviewer: the insertion in this file are all coming from what cfn-=init used to do. Now that we removed cfn-init, we need the user data directive/body to do that.

Comment thread tests/integration-tests/tests/update/test_update_race_conditions.py
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/uw-login-nodes-0526-1 branch 2 times, most recently from 6f315f3 to db2c236 Compare June 1, 2026 13:31
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/uw-login-nodes-0526-1 branch from 38f6190 to 5420ed3 Compare June 2, 2026 10:31
Comment thread CHANGELOG.md
- Fix race condition in load balancer lookup by using `tag:GetResources` to resolve load balancer ARNs by tags,
which used to cause cluster update failure on clusters with login nodes.
- Fail `pcluster build-image` early when the downloaded cookbook version does not match the ParallelCluster CLI version.
- Fix login nodes not mounting `/opt/parallelcluster/shared` when EFS is used as the internal shared storage type.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the review: this fix is required by the current change. The fix is shipped in aws/aws-parallelcluster-cookbook#3188

"AutoScalingGroupName": f"{self._login_nodes_stack_id}-AutoScalingGroup",
"LaunchingLifecycleHookName": (
f"{self._login_nodes_stack_id}-LoginNodesLaunchingLifecycleHook"
),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note To Self: we removed below from LT because those were needed for Cfn-init command that we were running from user_data.sh; and since we are removing its usage we do not need it in LT.

"LaunchTemplateResourceId": launch_template_id,
                                "CloudFormationUrl": get_service_endpoint("cloudformation", self._config.region),
                                "CfnInitRole": instance_role_name,

Comment thread cli/tests/pcluster/templates/test_login_nodes_stack.py Outdated
…or the update workflow.

With this change, login nodes do not depend on cfn-hup/cfn-init anymore and mirror the same update mechanism already adopted for compute nodes.

With this change we expect the update workflow to be more resilient.

As part of this change we also:
1. Remove chef attribute `in_place_update_on_fleet_enabled` which does not make sense anymore because cfn-hup is now used only by the head node.
2. Clarified the scope of test_update_race_conditions.
3. Increase coverage of user data variables in unit test test_login_nodes_launch_template_properties.
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/uw-login-nodes-0526-1 branch from 5856c03 to 5f0d2ee Compare June 3, 2026 15:30
@gmarciani gmarciani merged commit 54d910a into aws:develop Jun 3, 2026
19 checks passed
@gmarciani gmarciani deleted the wip/mgiacomo/3160/uw-login-nodes-0526-1 branch June 3, 2026 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants