[UpdateWorkflow] Make login node use head-node-driven orchestration for cluster updates#7409
Conversation
fcab2e3 to
f8b5484
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7409 +/- ##
===========================================
- Coverage 90.09% 89.93% -0.16%
===========================================
Files 183 180 -3
Lines 16749 16177 -572
===========================================
- Hits 15090 14549 -541
+ Misses 1659 1628 -31
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
2995cfe to
bf40248
Compare
| ds_config = self._config.directory_service | ||
| ds_generate_keys = str(ds_config.generate_ssh_keys_for_users).lower() if ds_config else "false" | ||
|
|
||
| if self._pool.instance_profile: |
There was a problem hiding this comment.
Note for the reviewer: we are removing this profile because it was declared just to be consumed by cfn-init. Since we removed cfn-init, there is no reason to keep it
| indent=None, # Keep indent as None for compact sizing and proper parsing in user_data.sh | ||
| ) | ||
|
|
||
| cfn_init = { |
There was a problem hiding this comment.
Note for the reviewer: we can entirely remove this because we removed dependency to cfn-init.
| # CloudFormation dependency. Instead, we emit deterministic identifiers (LaunchTemplateName | ||
| # and LogicalId) computed from the pool name; share_dna.py resolves the launch template | ||
| # at runtime via DescribeLaunchTemplates. | ||
| if self.config.login_nodes and self.config.login_nodes.pools: |
There was a problem hiding this comment.
Note for the reviewer: we need this basic info about login nodes because the script share_dna.py in the cookbook (see aws/aws-parallelcluster-cookbook#3188) needs it to retrieve the launch template user data and extract from it the dna.json body.
There was a problem hiding this comment.
Non-blocking: Can you update the comment to mention CloudFormation deadlock or CloudFormation circular dependency rather than CloudFormation cycle? Ty!
There was a problem hiding this comment.
I understand that we do not want to add the deadlock; can you add the code snippet of where this deadlock happens? Why is the deadlock happening for LoginNode Stack and not for ComputeFleet?
If I am not wrong its because of https://github.com/aws/aws-parallelcluster/blob/develop/cli/src/pcluster/templates/cluster_stack.py#L520-L521. Do we know why this is needed for LN and not ComputeFleet; from my understanding we need this dependency because of head_eni=self._head_eni, which is also needed in ComputeFleet stack, so why an explicit dependency is added only for LN Stack?
There was a problem hiding this comment.
Both compute and login nodes, when started, require the head node to be running.
The difference between login nodes and compute nodes is when they are launched and this difference motivates the different dependency.
- Login nodes are launched as soon as the login nodes ASG is created. So when the ASG is created, the head node must be running. As such, the actual dependency is: ASG depends on Head Node. However, since ASG lives in nested stack, you cannot establish such cross stack dependencies (there may be hacky tricks, but definitely out of scope for this PR). The result is that the entire login nodes nested stack must depend on the head node.
- Compute nodes are launched by the head node. The compute fleet nested stack creates the launch templates, that are then used by the head node to launch the compute nodes. So, in this case, the head node depends on the compute fleet stack.
| output: | ||
| all: "| tee -a /var/log/cloud-init-output.log | logger -t user-data -s 2>/dev/ttyS0" | ||
| write_files: | ||
| - path: /tmp/dna.json |
There was a problem hiding this comment.
Note for the reviewer: the insertion in this file are all coming from what cfn-=init used to do. Now that we removed cfn-init, we need the user data directive/body to do that.
6f315f3 to
db2c236
Compare
38f6190 to
5420ed3
Compare
| - Fix race condition in load balancer lookup by using `tag:GetResources` to resolve load balancer ARNs by tags, | ||
| which used to cause cluster update failure on clusters with login nodes. | ||
| - Fail `pcluster build-image` early when the downloaded cookbook version does not match the ParallelCluster CLI version. | ||
| - Fix login nodes not mounting `/opt/parallelcluster/shared` when EFS is used as the internal shared storage type. |
There was a problem hiding this comment.
Note for the review: this fix is required by the current change. The fix is shipped in aws/aws-parallelcluster-cookbook#3188
| "AutoScalingGroupName": f"{self._login_nodes_stack_id}-AutoScalingGroup", | ||
| "LaunchingLifecycleHookName": ( | ||
| f"{self._login_nodes_stack_id}-LoginNodesLaunchingLifecycleHook" | ||
| ), |
There was a problem hiding this comment.
Note To Self: we removed below from LT because those were needed for Cfn-init command that we were running from user_data.sh; and since we are removing its usage we do not need it in LT.
"LaunchTemplateResourceId": launch_template_id,
"CloudFormationUrl": get_service_endpoint("cloudformation", self._config.region),
"CfnInitRole": instance_role_name,
…or the update workflow. With this change, login nodes do not depend on cfn-hup/cfn-init anymore and mirror the same update mechanism already adopted for compute nodes. With this change we expect the update workflow to be more resilient. As part of this change we also: 1. Remove chef attribute `in_place_update_on_fleet_enabled` which does not make sense anymore because cfn-hup is now used only by the head node. 2. Clarified the scope of test_update_race_conditions. 3. Increase coverage of user data variables in unit test test_login_nodes_launch_template_properties.
5856c03 to
5f0d2ee
Compare
This PR depends on aws/aws-parallelcluster-cookbook#3188
Description of changes
Make login node use head-node-driven orchestration for the update workflow.
With this change, login nodes do not depend on cfn-hup/cfn-init anymore and mirror the same update mechanism already adopted for compute nodes. With this change we expect the update workflow to be more resilient.
The changes in this PR mirrors the changes made for compute fleet in the following PRs:
In particular:
launch-templates-config.jsonin_place_update_on_fleet_enabledis removed as the attribute is not used anymore (see [UpdateWorkflow] Make login node use head-node-driven orchestration for cluster updates aws-parallelcluster-cookbook#3188)Notes
Tests
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.