Skip to content

Enable soft delete for workspace backup Recovery Services vaults and fix core public IP lifecycle issues#4907

Open
Copilot wants to merge 15 commits into
mainfrom
copilot/fix-recovery-service-vault-deployment
Open

Enable soft delete for workspace backup Recovery Services vaults and fix core public IP lifecycle issues#4907
Copilot wants to merge 15 commits into
mainfrom
copilot/fix-recovery-service-vault-deployment

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 21, 2026

What is being addressed

Workspace creation with backup enabled can fail when the workspace backup Recovery Services Vault is created with soft delete disabled. Azure now requires soft delete to be enabled for new vaults, so attempts to create the vault with soft_delete_enabled = false can be rejected and leave the workspace deployment stuck.

This change also needs to preserve current workspace delete behavior so backup-related resources do not block cleanup. Workspace destroy can also fail with BMSUserErrorContainerNotUndeleted when the backup container is in a transitional state during parallel destroy of azurerm_backup_protected_file_share.

Additionally, Azure automatically sets ip_tags on public IPs (e.g. "FirstPartyUsage": "/Unprivileged"), which causes Terraform to force replacement of the firewall transit public IP and the application gateway public IP on subsequent applies. This cascades into unnecessary recreation of dependent resources such as private DNS zones.

How is this addressed

  • Recovery Services Vault configuration

    • Keep the workspace backup vault on azurerm_recovery_services_vault
    • Remove the deprecated soft_delete_enabled argument (removed in AzureRM v4.64.0+); soft delete is now enabled by default on the platform
    • Preserve the existing Zone Redundant storage configuration
    • Preserve the existing CMK configuration on the AzureRM vault resource
  • Delete behavior

    • Configure the workspace base AzureRM provider recovery_service features to purge protected items from the vault on destroy
    • Set vm_backup_stop_protection_and_retain_data_on_destroy = false so workspace cleanup continues to remove backup data rather than retaining it
    • Add terraform_data.wait_for_backup_container_ready resource in storage.tf with a when = destroy provisioner that polls az backup container show ... --query properties.registrationStatus every 30 seconds for up to 10 minutes, exiting when the status returns Registered or NotFound
    • wait_for_backup_container_ready depends on azurerm_backup_protected_file_share.file_share so that on destroy Terraform's reversed dependency ordering causes the poller to run first (before the file share unprotect), preventing BMSUserErrorContainerNotUndeleted failures
    • Use lowercase storage in the backup container name (StorageContainer;storage;<rg>;<account>) to match the actual Azure container naming convention and ensure az backup container show finds the container rather than returning NotFound immediately
  • Core infrastructure — public IP lifecycle fixes

    • Add ip_tags to lifecycle { ignore_changes = [...] } on azurerm_public_ip.fwtransit in core/terraform/firewall/firewall.tf to prevent forced replacement of the firewall transit public IP
    • Add ip_tags to lifecycle { ignore_changes = [...] } on azurerm_public_ip.appgwpip in core/terraform/appgateway/appgateway.tf to prevent forced replacement of the application gateway public IP and the cascading recreation of dependent DNS zones
  • Provider/version updates

    • Update the workspace base AzureRM provider pin from 4.57.0 to 4.73.0 (latest)
    • Update the workspace base Terraform lock file to match the pinned AzureRM provider version
    • Bump core/version.txt to 0.16.17 to satisfy CI version-change enforcement for the core Terraform changes
  • Template/versioning

    • Bump tre-workspace-base to 2.8.5
    • Update CHANGELOG.md to describe the AzureRM soft-delete-based fix, public IP lifecycle changes, and backup container destroy polling

Copilot AI changed the title [WIP] Fix workspace deployment failure on Recovery Service Vault Use AzAPI for workspace backup Recovery Services vault creation May 21, 2026
Copilot AI requested a review from marrobi May 21, 2026 15:23
@marrobi
Copy link
Copy Markdown
Member

marrobi commented May 21, 2026

@copilot no, dont use azapi:

Enabling soft delete does not prevent you from deleting the Backup Vault itself, as long as there are no protected (or soft‑deleted) backup items still in it.
What's actually happening
Azure now requires soft delete to be enabled on Microsoft.DataProtection/BackupVaults — the API rejects vaults created with soft_delete = "Off". That's why your azurerm_data_protection_backup_vault deploy is failing. The supported values are:
• AlwaysOn – soft delete enabled and immutable (cannot be turned off). ⚠️ This makes the vault very hard to remove.
• On – soft delete enabled (default) but can be disabled later. ✅ Safe choice.
• Off – no longer accepted by the platform for new vaults.

So set it explicitly to on. But consider side effects and need to maintain current functionality. Also ensure the azureRm provider is the latest release (pinned).

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 21, 2026

@copilot no, dont use azapi:

Enabling soft delete does not prevent you from deleting the Backup Vault itself, as long as there are no protected (or soft‑deleted) backup items still in it....

Addressed in 3d11658. The vault is back on AzureRM, soft_delete_enabled is set to true, the workspace base AzureRM pin/lock were updated to 4.58.0, and the provider now enables recovery_service.purge_protected_items_from_vault_on_destroy with non-retaining VM cleanup so workspace deletes keep the expected behavior.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • checkpoint-api.hashicorp.com
    • Triggering command: /tmp/terraform-bin/terraform /tmp/terraform-bin/terraform fmt -check -recursive (dns block)
    • Triggering command: /tmp/terraform-bin/terraform /tmp/terraform-bin/terraform -chdir=templates/workspaces/base/terraform init -backend=false -upgrade (dns block)
    • Triggering command: /tmp/terraform-bin/terraform /tmp/terraform-bin/terraform -chdir=templates/workspaces/base/terraform validate (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI changed the title Use AzAPI for workspace backup Recovery Services vault creation Enable soft delete for workspace backup Recovery Services vaults May 21, 2026
Comment thread templates/workspaces/base/terraform/providers.tf Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the workspace base Terraform template to comply with Azure’s requirement that soft delete is enabled on newly created Recovery Services Vaults, while preserving current workspace deletion/cleanup behavior by ensuring protected items are purged on destroy.

Changes:

  • Enable soft_delete_enabled = true for the workspace backup azurerm_recovery_services_vault.
  • Configure AzureRM provider recovery_service features to purge protected items and avoid retaining VM backup data on destroy.
  • Bump the workspace base AzureRM provider pin/lockfile to 4.58.0 and update the template version + changelog entry.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
templates/workspaces/base/terraform/providers.tf Bumps AzureRM to 4.58.0 and configures provider recovery services destroy behavior.
templates/workspaces/base/terraform/backup/backup.tf Enables soft delete on the workspace backup Recovery Services Vault.
templates/workspaces/base/terraform/.terraform.lock.hcl Updates AzureRM provider lock to 4.58.0 and corresponding hashes.
templates/workspaces/base/porter.yaml Bumps tre-workspace-base bundle version to 2.8.4.
CHANGELOG.md Documents the fix under BUG FIXES.
Files not reviewed (1)
  • templates/workspaces/base/terraform/.terraform.lock.hcl: Language not supported

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 21, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • checkpoint-api.hashicorp.com
    • Triggering command: ./terraform ./terraform version (dns block)
    • Triggering command: /tmp/terraform /tmp/terraform providers lock -platform=linux_amd64 -platform=linux_arm64 -platform=darwin_amd64 -platform=darwin_arm64 -platform=windows_amd64 hashicorp/azurerm (dns block)
    • Triggering command: /tmp/terraform /tmp/terraform fmt -check -recursive (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI requested a review from marrobi May 21, 2026 18:02
@marrobi
Copy link
Copy Markdown
Member

marrobi commented May 21, 2026

/test-extended 43c865c

@github-actions
Copy link
Copy Markdown

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26244326090 (with refid 14403c22)

(in response to this comment from @marrobi)

Copilot AI changed the title Enable soft delete for workspace backup Recovery Services vaults Enable soft delete for workspace backup Recovery Services vaults and fix core public IP lifecycle issues May 22, 2026
Copilot AI requested a review from marrobi May 22, 2026 21:24
@marrobi
Copy link
Copy Markdown
Member

marrobi commented May 22, 2026

/test-extended 713eedd

@github-actions
Copy link
Copy Markdown

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26312937429 (with refid 14403c22)

(in response to this comment from @marrobi)

@marrobi
Copy link
Copy Markdown
Member

marrobi commented May 23, 2026

@copilot

Workspace destroy fails with BMSUserErrorContainerNotUndeleted when the AzureRM provider deletes azurerm_backup_protected_file_share.file_share, because the backup container is in a transitional state during parallel destroy. Fix: in templates/workspaces/base/terraform/storage.tf, add a terraform_data "wait_for_backup_container_ready" resource (mirror the style of the existing wait_for_backup_cleanup block) with a when = destroy local-exec provisioner that polls az backup container show ... --query properties.registrationStatus every 30s for up to 10 min until it returns Registered or NotFound, then exits. Add this resource to the depends_on list of azurerm_backup_protected_file_share.file_share so the poll runs at the right point in the destroy order.

…rorContainerNotUndeleted on destroy

Agent-Logs-Url: https://github.com/microsoft/AzureTRE/sessions/6e829c09-2452-4387-8616-b45ad3753752

Co-authored-by: marrobi <17089773+marrobi@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 23, 2026

@copilot

Workspace destroy fails with BMSUserErrorContainerNotUndeleted when the AzureRM provider deletes azurerm_backup_protected_file_share.file_share, because the backup container is in a tran...

Added terraform_data.wait_for_backup_container_ready in storage.tf (commit ce441e0). It polls az backup container show ... --query properties.registrationStatus every 30s for up to 10 minutes, exiting when the status is Registered or NotFound. The resource is added to depends_on of azurerm_backup_protected_file_share.file_share so the poll runs at the right point in the destroy order (after file share backup is unprotected, before the container is deregistered). Also bumped tre-workspace-base to 2.8.5 and updated the CHANGELOG.

@marrobi
Copy link
Copy Markdown
Member

marrobi commented May 23, 2026

/test-extended ce441e0

@github-actions
Copy link
Copy Markdown

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26328665194 (with refid 14403c22)

(in response to this comment from @marrobi)

@marrobi
Copy link
Copy Markdown
Member

marrobi commented May 26, 2026

/test-extended ce441e0

@github-actions
Copy link
Copy Markdown

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26452518221 (with refid 14403c22)

(in response to this comment from @marrobi)

1 similar comment
@github-actions
Copy link
Copy Markdown

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26452518221 (with refid 14403c22)

(in response to this comment from @marrobi)

@marrobi
Copy link
Copy Markdown
Member

marrobi commented May 26, 2026

@copilot

Fix not working — destroy ordering issue (and possible container name casing)

Looking at the failing e2e run #26452518221 (which ran against this PR's head), the same BMSUserErrorContainerNotUndeleted is still occurring. I think the poll isn't actually running before the file share delete:

1. Destroy ordering is reversed. Currently:

resource "azurerm_backup_protected_file_share" "file_share" {
  depends_on = [ terraform_data.wait_for_backup_container_ready ]
}

On destroy, Terraform reverses dependencies — so file_share is destroyed first, then wait_for_backup_container_ready's when = destroy provisioner runs after the failed delete. The poll needs to run before the file share delete.

Swap the dependency direction:

resource "terraform_data" "wait_for_backup_container_ready" {
  # ...
  depends_on = [
    azurerm_backup_container_storage_account.storage_account,
    azurerm_backup_protected_file_share.file_share,   # ← key change
  ]
}

resource "azurerm_backup_protected_file_share" "file_share" {
  depends_on = [
    azurerm_backup_container_storage_account.storage_account,
    azapi_resource.shared_storage,
    azurerm_private_endpoint.stgfilepe,
    # remove terraform_data.wait_for_backup_container_ready
  ]
}

…file_share not vice versa

Agent-Logs-Url: https://github.com/microsoft/AzureTRE/sessions/9233b422-0f60-414c-af10-9fe74ed1f471

Co-authored-by: marrobi <17089773+marrobi@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 26, 2026

@copilot

Fix not working — destroy ordering issue (and possible container name casing)...

Fixed the dependency direction in commit 2834a83. wait_for_backup_container_ready now depends on azurerm_backup_protected_file_share.file_share instead of the other way around. On destroy, Terraform reverses the dependency graph, so wait_for_backup_container_ready is destroyed first (running the when = destroy poller), and only then is file_share deleted — ensuring the container is in a ready state before the unprotect operation.

@marrobi
Copy link
Copy Markdown
Member

marrobi commented May 26, 2026

/test-extended 2834a83

@github-actions
Copy link
Copy Markdown

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26467689287 (with refid 14403c22)

(in response to this comment from @marrobi)

@marrobi
Copy link
Copy Markdown
Member

marrobi commented May 26, 2026

@copilot The error message in the latest run confirms the real container name uses lowercase storage:

Protection Container Name: "StorageContainer;storage;rg-***-ws-6e1c;stgws6e1c"

…but the script has capital Storage:

container_name="StorageContainer;Storage;${self.input.resource_group_name};${self.input.storage_account_name}"

So az backup container show returns NotFound immediately, the poll exits as a no-op, and the file_share delete proceeds straight into the race. Change Storagestorage in storage.tf

@marrobi
Copy link
Copy Markdown
Member

marrobi commented May 26, 2026

/test-extended 6cf6d9b

@github-actions
Copy link
Copy Markdown

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26476846339 (with refid 14403c22)

(in response to this comment from @marrobi)

2 similar comments
@github-actions
Copy link
Copy Markdown

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26476846339 (with refid 14403c22)

(in response to this comment from @marrobi)

@github-actions
Copy link
Copy Markdown

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/26476846339 (with refid 14403c22)

(in response to this comment from @marrobi)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Workspace deployment fails on Recovery Service Vault - TF Provider issue.

4 participants