Skip to content

[Bug]: wf_tf_deploy force-unlock fail-hook aborts on SOPS-encrypted secrets #385

@sterling-bambee

Description

@sterling-bambee

Prior Search

  • I have already searched this project's issues to determine if a bug report has already been made.

What happened?

When wf_tf_deploy fails in a way that leaves a Terraform state lock in DynamoDB, the force-unlock fail-hook cannot release it. It dies during pf config get because SOPS decrypt of *.secrets.yaml fails looking for an AWS profile that exists only on developer laptops.

force-unlock.sh Step 3 calls pf config get --directory $TF_APPLY_DIR, which makes terragrunt evaluate locals, which pulls SOPS-encrypted YAML. SOPS reads sops.aws_profile from the file's metadata header (named by whichever developer last encrypted the file, e.g. production-superuser) and looks for it in the runner's $AWS_CONFIG_FILE, which only contains [profile ci]. SOPS aborts before it ever talks to KMS — the KMS arns in the error are a red herring; the actionable line is could not load AWS config: failed to get shared config | profile, <name>.

deploy.sh solves this with pf wf sops-set-profile . ci at Step 5, rewriting sops.aws_profile in every *.secrets.yaml to ci before any terragrunt evaluation. The equivalent step is missing from force-unlock.sh.

Regression from dcfa7211 (bash→TS CLI refactor). Pre-refactor, force-unlock.sh Step 3 called pf-get-terragrunt-variables, which did not trigger SOPS evaluation. Its replacement pf config get does. deploy.sh already had the sops-set-profile step pre-refactor (renamed in place from pf-sops-set-profile); force-unlock.sh was never given an equivalent step.

The fix is one line: insert pf wf sops-set-profile . ci between Step 2 (the AWS_CONFIG_FILE write) and Step 3 (pf config get).

Steps to Reproduce

  1. Set up wf_tf_deploy in a consumer repo whose *.secrets.yaml files were last encrypted by a developer (so sops.aws_profile in the file metadata is something other than ci, e.g. production-superuser).
  2. Submit a workflow that will fail in a way that leaves a TF state lock — e.g. a terragrunt apply --all whose underlying module is broken or where vault is unreachable mid-apply.
  3. Observe the force-unlock fail-hook firing automatically.
  4. Watch it abort during pf config get with a SOPS / KMS error, leaving the lock in DynamoDB.

Relevant log output

arn:aws:kms:us-west-2:<account>:key/mrk-<key-id>||<dev-profile-name>: FAILED
    - | could not load AWS config: failed to get shared config
      | profile, <dev-profile-name>

  Error: could not decrypt sops file

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions