Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ CHANGELOG
3.16.0
------

**ENHANCEMENTS**
- Improve cluster update resiliency on login nodes by reusing the head-node-driven orchestration already in place on compute nodes,
removing the dependency on cfn-hup and cfn-init.

**CHANGES**
- The validator `ClusterNameValidator` now enforces cluster names to be limited to 40 characters when using `ExternalSlurmdbd`,
consistent with the existing limit for `Database`. This prevents runtime failures caused by MySQL's table name length limit.
Expand All @@ -19,6 +23,7 @@ CHANGELOG
- Fix race condition in load balancer lookup by using `tag:GetResources` to resolve load balancer ARNs by tags,
which used to cause cluster update failure on clusters with login nodes.
- Fail `pcluster build-image` early when the downloaded cookbook version does not match the ParallelCluster CLI version.
- Fix login nodes not mounting `/opt/parallelcluster/shared` when EFS is used as the internal shared storage type.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the review: this fix is required by the current change. The fix is shipped in aws/aws-parallelcluster-cookbook#3188


**DEPRECATIONS**
- Amazon Linux 2 is no longer supported.
Expand Down
15 changes: 13 additions & 2 deletions cli/src/pcluster/resources/login_node/user_data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,16 @@ datasource_list: [ Ec2, None ]
output:
all: "| tee -a /var/log/cloud-init-output.log | logger -t user-data -s 2>/dev/ttyS0"
write_files:
- path: /tmp/dna.json
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the reviewer: the insertion in this file are all coming from what cfn-=init used to do. Now that we removed cfn-init, we need the user data directive/body to do that.

permissions: '0644'
owner: root:root
content: |
${DnaJson}
- path: /tmp/extra.json
permissions: '0644'
owner: root:root
content: |
${ExtraJson}
- path: /tmp/bootstrap.sh
permissions: '0744'
owner: root:root
Expand Down Expand Up @@ -78,8 +88,6 @@ write_files:

[ -f /etc/profile.d/proxy.sh ] && . /etc/profile.d/proxy.sh

$CFN_BOOTSTRAP_VIRTUALENV_PATH/cfn-init -s ${AWS::StackName} -v -c deployFiles -r ${LaunchTemplateResourceId} --region ${AWS::Region} --url ${CloudFormationUrl} --role ${CfnInitRole} || error_exit 'Failed to bootstrap the login node. Please check /var/log/cfn-init.log in the login node or in CloudWatch logs. Please refer to https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html#troubleshooting-v3-get-logs for more details on ParallelCluster logs.'

# Configure AWS CLI using the expected overrides, if any.
[ -f /etc/profile.d/aws-cli-default-config.sh ] && . /etc/profile.d/aws-cli-default-config.sh

Expand Down Expand Up @@ -129,7 +137,10 @@ write_files:
vendor_cookbook
fi
cd /tmp
mkdir -p /etc/chef/ohai/hints
touch /etc/chef/ohai/hints/ec2.json

jq -s ".[0] * .[1]" /tmp/dna.json /tmp/extra.json > /etc/chef/dna.json || ( echo "jq not installed"; cp /tmp/dna.json /etc/chef/dna.json )
{
CINC_CMD="cinc-client --local-mode --config /etc/chef/client.rb --log_level info --logfile /var/log/chef-client.log --force-formatter --no-color --chef-zero-port 8889 --json-attributes /etc/chef/dna.json"
pushd /etc/chef &&
Expand Down
16 changes: 16 additions & 0 deletions cli/src/pcluster/templates/cluster_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -1592,6 +1592,22 @@ def _get_launch_templates_config(self):
}
}

# LoginPools record Name (for DescribeLaunchTemplateVersions lookup) and LogicalId
# (filename suffix for the shared dna.json, must match the login node's own
# launch_template_id). Id/Version are omitted to avoid a CloudFormation circular dependency:
# head node LT metadata -> LoginNodes nested stack -> head node.
if self.config.login_nodes and self.config.login_nodes.pools:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the reviewer: we need this basic info about login nodes because the script share_dna.py in the cookbook (see aws/aws-parallelcluster-cookbook#3188) needs it to retrieve the launch template user data and extract from it the dna.json body.

Copy link
Copy Markdown
Contributor

@himani2411 himani2411 Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: Can you update the comment to mention CloudFormation deadlock or CloudFormation circular dependency rather than CloudFormation cycle? Ty!

Copy link
Copy Markdown
Contributor

@himani2411 himani2411 Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that we do not want to add the deadlock; can you add the code snippet of where this deadlock happens? Why is the deadlock happening for LoginNode Stack and not for ComputeFleet?

If I am not wrong its because of https://github.com/aws/aws-parallelcluster/blob/develop/cli/src/pcluster/templates/cluster_stack.py#L520-L521. Do we know why this is needed for LN and not ComputeFleet; from my understanding we need this dependency because of head_eni=self._head_eni, which is also needed in ComputeFleet stack, so why an explicit dependency is added only for LN Stack?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both compute and login nodes, when started, require the head node to be running.
The difference between login nodes and compute nodes is when they are launched and this difference motivates the different dependency.

  • Login nodes are launched as soon as the login nodes ASG is created. So when the ASG is created, the head node must be running. As such, the actual dependency is: ASG depends on Head Node. However, since ASG lives in nested stack, you cannot establish such cross stack dependencies (there may be hacky tricks, but definitely out of scope for this PR). The result is that the entire login nodes nested stack must depend on the head node.
  • Compute nodes are launched by the head node. The compute fleet nested stack creates the launch templates, that are then used by the head node to launch the compute nodes. So, in this case, the head node depends on the compute fleet stack.

lt_config["LoginPools"] = {
pool.name: {
"LaunchTemplate": {
"Name": f"{self._stack_name}-{pool.name}",
"Version": "$Latest",
"LogicalId": f"LoginNodeLaunchTemplate{create_hash_suffix(pool.name)}",
}
}
for pool in self.config.login_nodes.pools
}

return lt_config

# -- Conditions -------------------------------------------------------------------------------------------------- #
Expand Down
187 changes: 57 additions & 130 deletions cli/src/pcluster/templates/login_nodes_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,6 @@
from pcluster.utils import (
get_attr,
get_http_tokens_setting,
get_resource_name_from_resource_arn,
get_service_endpoint,
is_feature_supported,
)

Expand Down Expand Up @@ -129,79 +127,7 @@ def _add_login_nodes_pool_launch_template(self):
ds_config = self._config.directory_service
ds_generate_keys = str(ds_config.generate_ssh_keys_for_users).lower() if ds_config else "false"

if self._pool.instance_profile:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the reviewer: we are removing this profile because it was declared just to be consumed by cfn-init. Since we removed cfn-init, there is no reason to keep it

instance_profile_name = get_resource_name_from_resource_arn(self._pool.instance_profile)
instance_role_name = (
AWSApi.instance()
.iam.get_instance_profile(instance_profile_name)
.get("InstanceProfile")
.get("Roles")[0]
.get("RoleName")
)
elif self._pool.instance_role:
instance_role_name = get_resource_name_from_resource_arn(self._pool.instance_role)
else:
instance_role_name = self._instance_role.ref

launch_template_id = f"LoginNodeLaunchTemplate{create_hash_suffix(self._pool.name)}"
launch_template = ec2.CfnLaunchTemplate(
Stack.of(self),
launch_template_id,
launch_template_name=f"{self.stack_name}-{self._pool.name}",
launch_template_data=ec2.CfnLaunchTemplate.LaunchTemplateDataProperty(
block_device_mappings=self._launch_template_builder.get_block_device_mappings(
self._pool.local_storage.root_volume,
AWSApi.instance().ec2.describe_image(self._config.login_nodes_ami[self._pool.name]).device_name,
),
image_id=self._config.login_nodes_ami[self._pool.name],
instance_type=self._pool.instance_type,
metadata_options=ec2.CfnLaunchTemplate.MetadataOptionsProperty(
http_tokens=get_http_tokens_setting(self._config.imds.imds_support)
),
iam_instance_profile=ec2.CfnLaunchTemplate.IamInstanceProfileProperty(name=self._instance_profile),
user_data=Fn.base64(
Fn.sub(
get_user_data_content("../resources/login_node/user_data.sh"),
{
**{
"Timeout": str(
get_attr(
self._config,
"dev_settings.timeouts.compute_node_bootstrap_timeout",
NODE_BOOTSTRAP_TIMEOUT,
)
),
"AutoScalingGroupName": f"{self._login_nodes_stack_id}-AutoScalingGroup",
"LaunchingLifecycleHookName": (
f"{self._login_nodes_stack_id}-LoginNodesLaunchingLifecycleHook"
),
"LaunchTemplateResourceId": launch_template_id,
"CloudFormationUrl": get_service_endpoint("cloudformation", self._config.region),
"CfnInitRole": instance_role_name,
},
**get_common_user_data_env(self._pool, self._config),
},
)
),
network_interfaces=login_nodes_pool_lt_nw_interface,
tag_specifications=[
ec2.CfnLaunchTemplate.TagSpecificationProperty(
resource_type="instance",
tags=get_default_instance_tags(
self.stack_name, self._config, self._pool, "LoginNode", self._shared_storage_infos
)
+ [CfnTag(key=PCLUSTER_LOGIN_NODES_POOL_NAME_TAG, value=self._pool.name)]
+ get_custom_tags(self._config),
),
ec2.CfnLaunchTemplate.TagSpecificationProperty(
resource_type="volume",
tags=get_default_volume_tags(self.stack_name, "LoginNode")
+ [CfnTag(key=PCLUSTER_LOGIN_NODES_POOL_NAME_TAG, value=self._pool.name)]
+ get_custom_tags(self._config),
),
],
),
)

dna_json = json.dumps(
{
Expand Down Expand Up @@ -299,65 +225,66 @@ def _add_login_nodes_pool_launch_template(self):
"launch_template_id": launch_template_id,
}
},
indent=4,
indent=None, # Keep indent as None for compact sizing and proper parsing in user_data.sh
)

cfn_init = {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the reviewer: we can entirely remove this because we removed dependency to cfn-init.

"configSets": {
"deployFiles": ["deployConfigFiles"],
"update": ["deployConfigFiles", "chefUpdate"],
},
"deployConfigFiles": {
"files": {
# A nosec comment is appended to the following line in order to disable the B108 check.
# The file is needed by the product
# [B108:hardcoded_tmp_directory] Probable insecure usage of temp file/directory.
"/tmp/dna.json": { # nosec B108
"content": dna_json,
"mode": "000644",
"owner": "root",
"group": "root",
"encoding": "plain",
},
# A nosec comment is appended to the following line in order to disable the B108 check.
# The file is needed by the product
# [B108:hardcoded_tmp_directory] Probable insecure usage of temp file/directory.
"/tmp/extra.json": { # nosec B108
"mode": "000644",
"owner": "root",
"group": "root",
"content": self._config.extra_chef_attributes,
},
},
"commands": {
"mkdir": {"command": "mkdir -p /etc/chef/ohai/hints"},
"touch": {"command": "touch /etc/chef/ohai/hints/ec2.json"},
"jq": {
"command": (
'jq -s ".[0] * .[1]" /tmp/dna.json /tmp/extra.json > /etc/chef/dna.json '
'|| ( echo "jq not installed"; cp /tmp/dna.json /etc/chef/dna.json )'
launch_template = ec2.CfnLaunchTemplate(
Stack.of(self),
launch_template_id,
launch_template_name=f"{self.stack_name}-{self._pool.name}",
launch_template_data=ec2.CfnLaunchTemplate.LaunchTemplateDataProperty(
block_device_mappings=self._launch_template_builder.get_block_device_mappings(
self._pool.local_storage.root_volume,
AWSApi.instance().ec2.describe_image(self._config.login_nodes_ami[self._pool.name]).device_name,
),
image_id=self._config.login_nodes_ami[self._pool.name],
instance_type=self._pool.instance_type,
metadata_options=ec2.CfnLaunchTemplate.MetadataOptionsProperty(
http_tokens=get_http_tokens_setting(self._config.imds.imds_support)
),
iam_instance_profile=ec2.CfnLaunchTemplate.IamInstanceProfileProperty(name=self._instance_profile),
user_data=Fn.base64(
Fn.sub(
get_user_data_content("../resources/login_node/user_data.sh"),
{
**{
"Timeout": str(
get_attr(
self._config,
"dev_settings.timeouts.compute_node_bootstrap_timeout",
NODE_BOOTSTRAP_TIMEOUT,
)
),
"AutoScalingGroupName": f"{self._login_nodes_stack_id}-AutoScalingGroup",
"LaunchingLifecycleHookName": (
f"{self._login_nodes_stack_id}-LoginNodesLaunchingLifecycleHook"
),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note To Self: we removed below from LT because those were needed for Cfn-init command that we were running from user_data.sh; and since we are removing its usage we do not need it in LT.

"LaunchTemplateResourceId": launch_template_id,
                                "CloudFormationUrl": get_service_endpoint("cloudformation", self._config.region),
                                "CfnInitRole": instance_role_name,

"DnaJson": dna_json,
"ExtraJson": self._config.extra_chef_attributes,
},
**get_common_user_data_env(self._pool, self._config),
},
)
),
network_interfaces=login_nodes_pool_lt_nw_interface,
tag_specifications=[
ec2.CfnLaunchTemplate.TagSpecificationProperty(
resource_type="instance",
tags=get_default_instance_tags(
self.stack_name, self._config, self._pool, "LoginNode", self._shared_storage_infos
)
},
},
},
"chefUpdate": {
"commands": {
"chef": {
"command": (
". /etc/parallelcluster/pcluster_cookbook_environment.sh; "
"cinc-client --local-mode --config /etc/chef/client.rb --log_level info"
" --logfile /var/log/chef-client.log --force-formatter --no-color"
" --chef-zero-port 8889 --json-attributes /etc/chef/dna.json"
" --override-runlist aws-parallelcluster-entrypoints::update &&"
" /opt/parallelcluster/scripts/fetch_and_run -postupdate"
),
"cwd": "/etc/chef",
}
}
},
}

launch_template.add_metadata("AWS::CloudFormation::Init", cfn_init)
+ [CfnTag(key=PCLUSTER_LOGIN_NODES_POOL_NAME_TAG, value=self._pool.name)]
+ get_custom_tags(self._config),
),
ec2.CfnLaunchTemplate.TagSpecificationProperty(
resource_type="volume",
tags=get_default_volume_tags(self.stack_name, "LoginNode")
+ [CfnTag(key=PCLUSTER_LOGIN_NODES_POOL_NAME_TAG, value=self._pool.name)]
+ get_custom_tags(self._config),
),
],
),
)

return launch_template

Expand Down
31 changes: 0 additions & 31 deletions cli/src/pcluster/validators/dev_settings_validators.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@
from pcluster.validators.utils import dig, is_boolean_string, str_to_bool

EXTRA_CHEF_ATTRIBUTES_PATH = "DevSettings/Cookbook/ExtraChefAttributes"
ATTR_IN_PLACE_UPDATE_ON_FLEET_ENABLED = "in_place_update_on_fleet_enabled"
ATTR_RECONFIGURE_TIMEOUT = "cluster.slurm.reconfigure_timeout"
ATTR_CLUSTER_READINESS_CHECK_ENABLED = "cluster.cluster_readiness_check_enabled"
ATTR_CLUSTER_READINESS_CHECK_IGNORE_FAILURE = "cluster.cluster_readiness_check_ignore_failure"
Expand All @@ -35,40 +34,10 @@ def _validate(self, extra_chef_attributes: str = None):
return

attrs = json.loads(extra_chef_attributes)
self._validate_in_place_update_on_fleet_enabled(attrs)
self._validate_slurm_reconfigure_timeout(attrs)
self._validate_cluster_readiness_check_enabled(attrs)
self._validate_cluster_readiness_check_ignore_failure(attrs)

def _validate_in_place_update_on_fleet_enabled(self, extra_chef_attributes: dict = None):
"""Validate attribute cluster.in_place_update_on_fleet_enabled.

It returns an error if the attribute is set to a non-boolean value.
It returns a warning if the in-place update is disabled.

Args:
extra_chef_attributes: Dictionary of Chef attributes to validate.
"""
in_place_update_on_fleet_enabled = dig(extra_chef_attributes, "cluster", ATTR_IN_PLACE_UPDATE_ON_FLEET_ENABLED)

if in_place_update_on_fleet_enabled is None:
return

if not is_boolean_string(str(in_place_update_on_fleet_enabled)):
self._add_failure(
f"Invalid value in {EXTRA_CHEF_ATTRIBUTES_PATH}: "
f"attribute '{ATTR_IN_PLACE_UPDATE_ON_FLEET_ENABLED}' must be a boolean value.",
FailureLevel.ERROR,
)
return

if str_to_bool(str(in_place_update_on_fleet_enabled)) is False:
self._add_failure(
"When in-place updates are disabled, cluster updates are applied "
"by replacing compute and login nodes according to the selected QueueUpdateStrategy.",
FailureLevel.WARNING,
)

def _validate_cluster_readiness_check_enabled(self, extra_chef_attributes: dict = None):
"""Validate attribute cluster.cluster_readiness_check_enabled.

Expand Down
Loading
Loading