Skip to content

DAOS-19008 control: Erase formatting after failed format --replace#18446

Merged
daltonbohning merged 9 commits into
masterfrom
tanabarr/control-fmtreplace-rank-erase
Jun 26, 2026
Merged

DAOS-19008 control: Erase formatting after failed format --replace#18446
daltonbohning merged 9 commits into
masterfrom
tanabarr/control-fmtreplace-rank-erase

Conversation

@tanabarr

@tanabarr tanabarr commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Prevent a scenario where an engine may rejoin with an unintended new
rank following a failed dmg storage format --replace operation.

If the join-replace fails, the storage remains formatted and the
engine is left in a state where it is treated as a fresh instance and
will join the system with a new rank ID when restarted. Avoid this
inadvertent creation of a new rank by removing the engine's superblock
immediately after join-replace failure. Then the engine will require
an explicit format command before being able to join the system.

Also simplify and deduplicate server unit test setup code

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@tanabarr tanabarr requested review from a team as code owners June 5, 2026 14:11
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

Ticket title is 'Aurora daos_user: PMEM Device should Unmount and revert the --replace operation fully if it fails'
Status is 'In Review'
Labels: 'ALCF'
https://daosio.atlassian.net/browse/DAOS-19008

@daosbuild3

Copy link
Copy Markdown
Collaborator

@daosbuild3

Copy link
Copy Markdown
Collaborator

Comment thread src/control/cmd/dmg/storage.go Outdated
cmd.Debugf("Invoking SystemErase to clean up after failed format operation")

eraseReq := &control.SystemEraseReq{}
eraseResp, err := control.SystemErase(ctx, cmd.ctlInvoker, eraseReq)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work... SystemErase doesn't allow you to choose ranks or nodes.

I think you'll need to handle this from the daos_server that owns the engine. If the engine fails to join, and it's a replace operation, blow the storage away. The failure that triggered this request was happening at the join stage.

If the format itself fails, I don't think there's any risk of the engine coming up. If there's a partial failure, it's not a bad idea to clean up, but I think that would have to happen from the server side, too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right , this needs a rework

@tanabarr tanabarr force-pushed the tanabarr/control-fmtreplace-rank-erase branch 3 times, most recently from d8cf409 to 6785032 Compare June 8, 2026 20:46
@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18446/4/testReport/

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18446/4/execution/node/1394/log

@tanabarr tanabarr force-pushed the tanabarr/control-fmtreplace-rank-erase branch from 6785032 to 309615f Compare June 9, 2026 11:34
@daosbuild3

Copy link
Copy Markdown
Collaborator

@tanabarr tanabarr force-pushed the tanabarr/control-fmtreplace-rank-erase branch from 309615f to a593fcb Compare June 9, 2026 13:55
@daosbuild3

Copy link
Copy Markdown
Collaborator

@tanabarr tanabarr changed the title Erase formatting after failed format --replace DAOS-19008 control: Erase formatting after failed format --replace Jun 9, 2026
@tanabarr tanabarr force-pushed the tanabarr/control-fmtreplace-rank-erase branch 2 times, most recently from 21e8fb7 to 862f9d5 Compare June 9, 2026 15:57
@daosbuild3

Copy link
Copy Markdown
Collaborator

tanabarr added 3 commits June 10, 2026 12:25
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr force-pushed the tanabarr/control-fmtreplace-rank-erase branch from 862f9d5 to 550d853 Compare June 12, 2026 20:06
}
if len(storageCfg.Tiers) != 0 {
if err := ei.WriteSuperblock(); err != nil {
//storage.FormatControlMetadata([]uint{0}); err != nil {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented code removed in follow-up PR #18127

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed in this PR because of repush

@tanabarr tanabarr requested review from kjacque, knard38 and mjmac June 15, 2026 11:15

@mjmac mjmac left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally fine to me... Only potential blocker is the removal of the guard against allowing an AdminExcluded rank to join. Not sure if that was intentional as a convenience or an oversight, but IMO it should be a requirement for the admin to explicitly unexclude a rank in order to avoid accidents.

t.Error("unexpected unmount call")
}

if !tc.expSBCreated {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no test case that sets this to be true, so the remainder of the test is dead code. Maybe copy/paste from another test?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the test cases that used it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines -201 to -203
if cm.State == MemberStateAdminExcluded {
return nil, ErrJoinAdminExcluded(cm.UUID, cm.Rank)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this intentional? An admin-excluded rank should have to be explicitly admin-unexcluded before it can rejoin, no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only allowing during dmg storage format --replace (in joinReplace()), this change is the missing piece from https://daosio.atlassian.net/browse/DAOS-18472 which unfortunately only contained a partial fix because I had the rest (the bit you are referring to) in a working directory which didn't get committed! The change will not affect the normal join case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I read over the various tickets. I understand the intent behind this change, which is to interpret the admin's action of running a storage format --replace as an implicit clearing of the AdminExcluded state on a rank. I agree that usually this will be correct, and that it seems like a nice convenience.

However, I am concerned that this policy change weakens AdminExcluded from being a state with clearly-defined entry and exit criteria to being a bit more wishy-washy. It sets a precedent that the state can be automatically cleared based on criteria that may not be obvious without reading code, and in my mind this erodes the value of having an AdminExcluded state in addition to the programmatically-toggled Excluded state.

The correct response here, IMO, is not to weaken AdminExcluded in order to work around an admin error or UX deficiency, it's to figure out how to make the correct order of operations (admin excludes rank, repairs rank, unexcludes rank, rejoins rank) clearer to the admin.

If there is consensus that there is not strong value in having an explicit AdminExcluded state, then IMO it would be better to just remove it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree that this weakens the position of the state but given I'm unlikely to get a consensus I will revert this change. You are correct that this was a workaround for admin error scenarios but in my mind it strengthens the UX rather than weakening it. Check replaced.

Comment thread src/control/server/instance_storage.go Outdated
storageProv := ei.GetStorage()

// Get SCM config to access mount point and class
scmCfg, err := storageProv.GetScmConfig()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scmCfg isn't really used for anything here, as far as I can tell. GetScmConfig() will return a non-nil error if there's no SCM config (or at least it should), so checking for a nil scmCfg seems like an anti pattern.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cruft removed from stale implementation

"github.com/daos-stack/daos/src/control/lib/ranklist"
"github.com/daos-stack/daos/src/control/logging"
"github.com/daos-stack/daos/src/control/provider/system"
sysprov "github.com/daos-stack/daos/src/control/provider/system"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a duplicate import of the provider/system package.

@tanabarr tanabarr Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed "system" import because of ambiguity with the package of the same name

tanabarr added 2 commits June 16, 2026 14:07
Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…treplace-rank-erase

Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr requested a review from mjmac June 16, 2026 15:07
mjmac
mjmac previously approved these changes Jun 16, 2026

@mjmac mjmac left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm open to the idea of having a discussion about implicit clear of AdminExcluded, or the removal of the state entirely if we don't think it's actually useful. But I think that should be a separate conversation and patch rather than sneaking it in as part of something else.

@tanabarr tanabarr self-assigned this Jun 17, 2026
@tanabarr tanabarr added the control-plane work on the management infrastructure of the DAOS Control Plane label Jun 17, 2026

@knard38 knard38 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, the note of the section "Storage Format Replace" of the file docs/admin/administration.md is not correct.
According to the discussion with @mjmac, the note at line 1154 says --replace can be used on AdminExcluded ranks — but checking membership.go:201 the ErrJoinAdminExcluded guard is still there and this PR doesn't change it, so that note appears to be incorrect.

Comment thread src/control/server/instance_storage.go Outdated
Comment thread src/control/server/util_test.go Outdated
Comment thread src/control/server/instance_storage_test.go
tanabarr added 2 commits June 18, 2026 12:35
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr requested a review from a team as a code owner June 18, 2026 12:55
…treplace-rank-erase

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr requested review from knard38 and mjmac June 18, 2026 13:01
@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18446/14/testReport/

@tanabarr tanabarr requested a review from a team June 23, 2026 09:58
@tanabarr tanabarr added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jun 23, 2026
@tanabarr

Copy link
Copy Markdown
Contributor Author

CI failed on known issue dfuse test_dfuse_daos_build_wb failure

@daltonbohning

Copy link
Copy Markdown
Contributor

CI failed on known issue dfuse test_dfuse_daos_build_wb failure

This test should already be fixed in master, including in the commit that this PR is based on. It looks to have failed with a timeout on curl. It might just be network flakiness, but can you please re-run this to make sure this PR is not introducing a new issue?

@daltonbohning daltonbohning removed the request for review from a team June 23, 2026 14:22
@tanabarr

tanabarr commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

CI failed on known issue dfuse test_dfuse_daos_build_wb failure

This test should already be fixed in master, including in the commit that this PR is based on. It looks to have failed with a timeout on curl. It might just be network flakiness, but can you please re-run this to make sure this PR is not introducing a new issue?

https://jenkins.daos.hpc.amslabs.hpecorp.net/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-18446/15

fyi, same test is also failing here: https://jenkins.daos.hpc.amslabs.hpecorp.net/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-18486/8/tests

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18446/15/testReport/

@tanabarr tanabarr requested a review from a team June 26, 2026 09:46
@daltonbohning daltonbohning removed the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jun 26, 2026
@daltonbohning daltonbohning merged commit 42cabc9 into master Jun 26, 2026
47 checks passed
@daltonbohning daltonbohning deleted the tanabarr/control-fmtreplace-rank-erase branch June 26, 2026 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

control-plane work on the management infrastructure of the DAOS Control Plane

Development

Successfully merging this pull request may close these issues.

6 participants