Skip to content

[integ-tests] Fix assert_no_errors_in_logs to skip ICE-caused "setting nodes to DOWN" errors#7423

Merged
hanwen-cluster merged 1 commit into
aws:developfrom
hanwen-cluster:testworkflowdevelop
Jun 5, 2026
Merged

[integ-tests] Fix assert_no_errors_in_logs to skip ICE-caused "setting nodes to DOWN" errors#7423
hanwen-cluster merged 1 commit into
aws:developfrom
hanwen-cluster:testworkflowdevelop

Conversation

@hanwen-cluster
Copy link
Copy Markdown
Contributor

Description of changes

When skip_ice=True, assert_no_errors_in_logs filtered log lines by matching EC2 capacity message fragments (e.g. "InsufficientInstanceCapacity") against each line. This caught the root-cause ERRORs, but missed the downstream ERROR:

  ERROR - Failed to launch following nodes, setting nodes to DOWN: [...]

That line carries no capacity keyword or error code, so it was never filtered. The EC2 error code only appears on the companion INFO line written by _handle_failed_nodes ("(Code:InsufficientInstanceCapacity)Failure when resuming nodes..."). As a result, tests calling skip_ice=True could still fail intermittently whenever ICE struck on the dynamic-node resume path.

This change pairs the untagged "setting nodes to DOWN" ERROR with the "(Code:<error_code>)" reasons found in the same log file. When skip_ice is set, the line is ignored only if at least one code is present and every code in that file is a known ICE code; otherwise the line is kept so genuine non-ICE launch failures still fail the assertion.

The ICE code set mirrors SlurmNode.EC2_ICE_ERROR_CODES from aws-parallelcluster-node plus LimitedInstanceCapacity (used for partial all-or-nothing/best-effort launches).

Tests

  • Tests will run after PR is merged

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…g nodes to DOWN" errors

When skip_ice=True, assert_no_errors_in_logs filtered log lines by matching
EC2 capacity message fragments (e.g. "InsufficientInstanceCapacity") against
each line. This caught the root-cause ERRORs, but missed the downstream ERROR:
```
  ERROR - Failed to launch following nodes, setting nodes to DOWN: [...]
```
That line carries no capacity keyword or error code, so it was never filtered.
The EC2 error code only appears on the companion INFO line written by
_handle_failed_nodes ("(Code:InsufficientInstanceCapacity)Failure when
resuming nodes..."). As a result, tests calling skip_ice=True could still fail
intermittently whenever ICE struck on the dynamic-node resume path.

This change pairs the untagged "setting nodes to DOWN" ERROR with the
"(Code:<error_code>)" reasons found in the same log file. When skip_ice is set,
the line is ignored only if at least one code is present and every code in that
file is a known ICE code; otherwise the line is kept so genuine non-ICE launch
failures still fail the assertion.

The ICE code set mirrors SlurmNode.EC2_ICE_ERROR_CODES from
aws-parallelcluster-node plus LimitedInstanceCapacity (used for partial
all-or-nothing/best-effort launches).
@hanwen-cluster hanwen-cluster requested review from a team as code owners June 5, 2026 15:47
@hanwen-cluster hanwen-cluster added skip-changelog-update Disables the check that enforces changelog updates in PRs cherry-pick:release-3.15 labels Jun 5, 2026
@hanwen-cluster hanwen-cluster merged commit 87d0fa1 into aws:develop Jun 5, 2026
26 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog-update Disables the check that enforces changelog updates in PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants