Skip to content

fix(k8s): delete auto-created PVCs on sandbox deletion when opted in#880

Open
xfgong wants to merge 1 commit into
alibaba:mainfrom
xfgong:fix/k8s-pvc-managed-cleanup
Open

fix(k8s): delete auto-created PVCs on sandbox deletion when opted in#880
xfgong wants to merge 1 commit into
alibaba:mainfrom
xfgong:fix/k8s-pvc-managed-cleanup

Conversation

@xfgong
Copy link
Copy Markdown
Contributor

@xfgong xfgong commented May 13, 2026

Summary

Follow-up to #661, addressing the post-merge review by @yinsyim in #661 (comment).

The original PR assumed StorageClass.reclaimPolicy would clean up auto-created PVCs, but reclaimPolicy only governs what happens to the PV after the PVC is deleted — it does not delete the PVC itself when the sandbox/workload terminates:

-> delete pvc -> delete pv -> CSI cleanup

…not:

-> delete deployment/pod -> delete pvc

So auto-created PVCs were leaking on every sandbox deletion. This PR fixes that by mirroring the existing Docker volume-managed-by cleanup pattern for Kubernetes, gated on the same deleteOnSandboxTermination PVC field (default false, opt-in) so semantics are uniform across runtimes.

Changes

  • Server (services/k8s/):

    • client.py: add list_pvcs and delete_pvc (404 swallowed on delete).
    • kubernetes_service.py:
      • _ensure_pvc_volumes stamps opensandbox.io/volume-managed-by=server and opensandbox.io/id=<sandbox_id> labels on auto-created PVCs, only when deleteOnSandboxTermination=true. Pre-existing and opt-out PVCs are left unlabeled.
      • delete_sandbox calls a new _cleanup_managed_pvcs helper that lists PVCs by that selector and deletes them. Cleanup runs only when the workload is actually gone (success or 404), never on transient errors. List/delete failures are tolerated (best-effort) so they never block the API response.
  • Spec + schema docs: specs/sandbox-lifecycle.yml and server/opensandbox_server/api/schema.py previously stated deleteOnSandboxTermination was "Docker only … no effect on Kubernetes PVCs, whose lifecycle is managed by the StorageClass reclaim policy" — the exact assumption @yinsyim disproved. Updated to describe the correct cross-backend semantics; the reclaim-policy reference is kept but reframed (the PVC delete triggers it).

  • SDKs: regenerated Python lifecycle (api/lifecycle/models/pvc.py) and JS (api/lifecycle.ts) from the spec. Hand-edited the hand-written Python sandbox model, C# model, and Kotlin model docstrings.

  • Helm RBAC (kubernetes/charts/opensandbox-server/templates/server.yaml): added delete and list verbs on persistentvolumeclaims. The cleanup degrades to a logged warning on 403, but the stock chart should be correct by default.

  • Tests: cover label stamping for both opt-in and opt-out paths, cleanup on success / 404 sweep / no-cleanup on transient error, best-effort tolerance of list_pvcs failure, and no-touch for pre-existing PVCs.

Compatibility

Backward-compatible. deleteOnSandboxTermination defaults to false, so existing PVC-using sandboxes behave exactly as before unless callers explicitly opt in. The added RBAC verbs are additive.

Test plan

  • cd server && uv run pytest tests/k8s — 399 passed
  • cd server && uv run ruff check on changed files — clean
  • cd sdks/sandbox/python && uv run python scripts/generate_api.py — regenerated cleanly
  • cd sdks/sandbox/javascript && pnpm run gen:api — regenerated cleanly
  • Manual K8s e2e (Helm chart + dynamic provisioner): create sandbox with a PVC volume where createIfNotExists=true, deleteOnSandboxTermination=true, delete sandbox, verify PVC is deleted and PV is reclaimed per StorageClass policy
  • Manual K8s e2e: same volume but deleteOnSandboxTermination=false — PVC must persist after sandbox deletion
  • Manual K8s e2e: pre-existing PVC (no managed-by label) is untouched on sandbox deletion

Notes

  • The label selector is opensandbox.io/volume-managed-by=server,opensandbox.io/id=<sandbox-id>. Both must match — extra safety against accidentally sweeping a user-managed PVC that happens to share our sandbox ID.
  • Cleanup runs after workload deletion so the kubelet has released the kubernetes.io/pvc-protection finalizer. If pods are still terminating, K8s puts the PVC into Terminating and reaps it once pods are gone — also safe.
  • Sharing the same opted-in PVC across multiple sandboxes is discouraged (first deletion will sweep it). Callers wanting to share storage should pre-create the PVC out-of-band, which leaves it unlabeled and untouched by the cleanup path. Same semantics as Docker auto-created volumes.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1e3206bb19

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
Comment thread specs/sandbox-lifecycle.yml
Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
@xfgong xfgong force-pushed the fix/k8s-pvc-managed-cleanup branch from 1e3206b to 2be810e Compare May 13, 2026 07:54
@xfgong
Copy link
Copy Markdown
Contributor Author

xfgong commented May 13, 2026

Thanks for the review. Force-pushed 2be810ee addressing all four P2 comments:

P2-1 (kubernetes_service.py L455): Clean up managed PVCs when creation fails

create_sandbox now sets a managed_pvcs_may_exist flag before calling _ensure_pvc_volumes and clears it only on the success path, with a finally clause invoking _cleanup_managed_pvcs(sandbox_id). This covers:

  • partial failure inside _ensure_pvc_volumes (some PVCs created, others not)
  • create_workload raising
  • _wait_for_sandbox_ready raising

_cleanup_managed_pvcs is scoped to PVCs labeled with this sandbox_id, so it can't touch unrelated state. New tests cover all three failure paths plus the no-cleanup-on-success case.

P2-3 (kubernetes_service.py L352): Cover TTL expiry cleanup for managed PVCs

You're right — the delete_sandbox API is bypassed when the BatchSandbox controller deletes an expired CR via r.Delete(ctx, batchSbx). Rather than wire the controller into the server's label semantics, I went with K8s-native ownerReferences: after create_workload succeeds, the server patches each auto-created PVC with ownerReferences pointing at the just-created CR. K8s garbage collection then deletes the PVC whenever the CR goes away — TTL expiry, delete_sandbox API, or any other cascade.

The label-based sweep in delete_sandbox remains as a fallback for cases where the ownerRef patch fails (e.g. transient API errors). Both workload providers (BatchSandboxProvider, AgentSandboxProvider) now return apiVersion + kind alongside name + uid so the ownerReference can be constructed.

blockOwnerDeletion=false so a stuck PVC delete never stalls the CR delete. End-to-end verified on a real cluster: deleting the BatchSandbox CR directly (simulating TTL) garbage-collects the labeled PVC; opt-out PVCs have no ownerReferences and survive any CR delete.

P2-4 (kubernetes_service.py L352): Reject conflicting cleanup flags for repeated PVC claims

Added an up-front pass at the top of _ensure_pvc_volumes that rejects 400 when the same claimName appears in multiple Volume entries with inconsistent createIfNotExists or deleteOnSandboxTermination values. Validation runs before any side effects. Mounting the same claim at multiple paths is still allowed when flags match.

P2-2 (spec L1510): Regenerate public spec docs

Ran pnpm docs:specdocs/public/api/spec-inline.js is in the squashed commit and now reflects the corrected deleteOnSandboxTermination description (no more "Docker only / Kubernetes PVCs unaffected" wording).


Verification:

  • uv run pytest tests/k8s → 409 passed (added 11 new tests for the fixes above)
  • uv run ruff check → clean
  • Cluster e2e in a test namespace: opt-in PVC gets both labels + ownerReferences; delete_sandbox API cleans via label sweep; direct CR delete cleans via K8s GC; opt-out PVC survives any CR delete.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2be810eee9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread kubernetes/charts/opensandbox-server/templates/server.yaml Outdated
@xfgong xfgong force-pushed the fix/k8s-pvc-managed-cleanup branch from 2be810e to 2226a30 Compare May 13, 2026 09:59
@xfgong
Copy link
Copy Markdown
Contributor Author

xfgong commented May 13, 2026

@chatgpt-codex-connector good catch on the missing patch verb. Force-pushed 2226a301 adding patch to the persistentvolumeclaims verbs in kubernetes/charts/opensandbox-server/templates/server.yaml, so the stock chart now permits the ownerReference attach (and not just create/get/list/delete). The runtime path already degrades to a logged warning on 403, but the chart should be correct by default.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create an environment for this repo.

@xfgong
Copy link
Copy Markdown
Contributor Author

xfgong commented May 13, 2026

The Codex auto-response above is just the setup prompt for repo-side Codex environments — no action needed from me on it. The latest revision (2226a301) addresses all five Codex P2 findings across the two review passes; no outstanding bot comments remain.

Ready for human review. Could a maintainer apply the recommended labels (component/k8s, component/server, documentation, sdks, bug)? I don't have label-edit permission on the upstream repo.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2226a3016f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
@xfgong xfgong force-pushed the fix/k8s-pvc-managed-cleanup branch from 2226a30 to 6fc5641 Compare May 14, 2026 04:27
@xfgong
Copy link
Copy Markdown
Contributor Author

xfgong commented May 14, 2026

Force-pushed 6fc56418 addressing the new P2 finding from the latest Codex pass.

P2 (kubernetes_service.py L656): Avoid deleting PVCs when workload rollback fails

Good catch. The previous finally swept managed PVCs unconditionally whenever managed_pvcs_may_exist was True, which meant the case where _wait_for_sandbox_ready raised and the inner rollback delete_workload also raised (transient API error, missing delete RBAC, etc.) left a live BatchSandbox/Sandbox CR with its pods, but its PVCs got reaped — exactly the inconsistency delete_sandbox explicitly avoids.

Fix: track workload_left_alive. It flips True the moment create_workload returns, and back to False on the success path or after the rollback delete_workload returns without raising. The finally now skips the label sweep when workload_left_alive is still True and logs a warning explaining that the previously-attached ownerReference will let K8s GC reclaim the PVCs whenever the CR is eventually removed (TTL expiry, manual delete, or a delete_sandbox retry). The CR-creation-failed and _ensure_pvc_volumes-partial-failure paths still trigger the sweep, since workload_left_alive was never set in those cases.

New test test_pvc_cleanup_skipped_when_rollback_delete_workload_fails covers the regression; existing failure-cleanup and success-no-cleanup tests still pass unchanged.

Other P2 (specs/sandbox-lifecycle.yml L1510): Regenerate public spec docs

This is the same regen finding from the prior pass — docs/public/api/spec-inline.js is already updated in the squashed commit and now reads "the volume is automatically removed when the sandbox is deleted ... For Kubernetes, the resulting PVC delete triggers the bound PV's reclaim policy" with no remaining "Docker only / Kubernetes PVCs unaffected" wording. Codex appears to have re-flagged it; no further action.


Verification:

  • uv run pytest tests/k8s → 410 passed (added 1 new test for the rollback skip path)
  • uv run ruff check → clean

@hittyt hittyt added bug Something isn't working component/k8s For kubernetes runtime labels May 17, 2026
@hittyt
Copy link
Copy Markdown
Collaborator

hittyt commented May 17, 2026

The failing K8s mini E2E looks unrelated to this PR. It was caused by the K8s metadata patch issue fixed in #899, which has now been merged.

Could you please merge/rebase latest main and rerun CI? Thanks.

@xfgong xfgong force-pushed the fix/k8s-pvc-managed-cleanup branch from 6fc5641 to 4c6a7bd Compare May 18, 2026 01:36
@xfgong
Copy link
Copy Markdown
Contributor Author

xfgong commented May 18, 2026

@hittyt thanks for the heads-up on #899. Rebased onto latest main (now 4c6a7bd0); clean rebase, no conflicts. uv run pytest tests/k8s → 442 passed locally. CI should pick up the new push automatically.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4c6a7bd0ab

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py
Comment thread server/opensandbox_server/services/k8s/client.py
@xfgong xfgong force-pushed the fix/k8s-pvc-managed-cleanup branch from 4c6a7bd to c137fe4 Compare May 18, 2026 03:23
@xfgong
Copy link
Copy Markdown
Contributor Author

xfgong commented May 18, 2026

Force-pushed c137fe49 addressing the three new P2 findings from the latest Codex pass.

P2 (client.py L324): Send a merge patch when attaching PVC owners

Critical catch — verified by inspect.getsource(CoreV1Api.patch_namespaced_persistent_volume_claim_with_http_info): the kubernetes-client select_header_content_type list is ['application/json-patch+json', 'application/merge-patch+json', 'application/strategic-merge-patch+json', ...], so without an explicit override the merge-shaped body {"metadata": {"ownerReferences": [...]}} is sent as JSON Patch and rejected. The _attach_pvc_owner_references swallow path was masking this in prod, so TTL/controller-driven CR deletion was still leaking opted-in PVCs even though the label sweep on delete_sandbox worked.

Fix: K8sClient.patch_pvc now pins _content_type="application/strategic-merge-patch+json". Strategic merge is required (vs. plain merge) so ownerReferences merge by key rather than replacing the whole list. New unit test in test_k8s_client.py asserts the kwarg is passed through to the underlying API.

(My earlier cluster smoke test claimed ownerReferences were attached — I'd verified the label sweep, not the ownerReference, on the namespace I tested. Mea culpa; this would have leaked on real TTL deletion until now.)

P2 (kubernetes_service.py L570): Track partial workload creation before PVC cleanup

You're right that BatchSandboxProvider.create_workload writes the CR before the imagePullSecret, so a secret-creation failure whose rollback delete_workload also fails leaves the CR alive — but my flip-after-return logic never sets workload_left_alive=True on that path, so the finally sweeps PVCs that the live CR still references.

Fix: set workload_left_alive = True before calling create_workload. Any raise from there is treated as "CR may exist", and the PVC sweep is skipped. The trade-off is that a clean failure (no CR was actually created) also loses automatic cleanup — but delete_sandbox on the same id will 404 and the existing 404-branch then triggers the label sweep, so PVCs are still recoverable via user retry. Updated test test_pvc_cleanup_skipped_when_create_workload_raises (renamed from _runs_) covers the new behavior.

P2 (kubernetes_service.py L822): Avoid deleting managed PVCs that other sandboxes mounted

Real cross-sandbox bug. If A creates claim=data with deleteOnSandboxTermination=true and B later mounts the same name, B was silently treated as "PVC pre-exists, reuse" — no labels on B, A's labels intact — and A's delete_sandbox (or A's CR TTL via ownerReference GC) would yank the volume out from under B.

Fix: in _ensure_pvc_volumes, when an existing PVC carries volume-managed-by=server with id=<other_sandbox_id>, reject with 409 INVALID_PARAMETER and a message instructing the user to pre-create a shared PVC outside the auto-managed path. Same-sandbox idempotent retries (matching id) and unlabeled user-managed PVCs are still allowed. Two new tests cover both branches.

Other P2 (specs/sandbox-lifecycle.yml L1519): Regenerate public spec docs

Same finding that came back in the previous two passes — docs/public/api/spec-inline.js is regenerated in this branch's squashed commit. grep deleteOnSandboxTermination docs/public/api/spec-inline.js shows the new wording ("When true, the volume is automatically removed when the sandbox is deleted …"); the stale "Docker only / Kubernetes PVCs unaffected" text is gone. Codex appears to be flagging from the spec edit without inspecting the regenerated artifact in the same commit.


Verification:

  • uv run pytest tests/k8s → 445 passed (+3 new)
  • uv run ruff check → clean

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c137fe495c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py
Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
@xfgong xfgong force-pushed the fix/k8s-pvc-managed-cleanup branch from c137fe4 to 54c1f3f Compare May 18, 2026 04:40
@xfgong
Copy link
Copy Markdown
Contributor Author

xfgong commented May 18, 2026

Force-pushed 54c1f3fc addressing the three new P2 findings.

P2 (kubernetes_service.py L599): Clean up PVCs after pre-CR workload failures

Real regression from the previous pessimistic flip. BatchSandboxProvider.create_workload rejects poolRef + volumes (and poolRef + platform) by raising ValueError before touching the CR, but with workload_left_alive = True set unconditionally before the call, the finally would then skip the label sweep and leak the just-created PVCs.

Fix: validate the poolRef + volumes incompatibility in the outer service before _ensure_pvc_volumes runs, so the 400 is raised with zero side effects. New test test_pool_ref_with_volumes_rejected_before_pvc_creation asserts no PVC create or workload create is attempted. Kept the pessimistic flag for cases where the provider does touch the CR partway (the original imagePullSecret rationale still holds).

P2 (kubernetes_service.py L853 / L378): Reject managed PVCs even when not auto-creating

Right — the previous revision only checked ownership inside the createIfNotExists branch, so createIfNotExists=false mounting a PVC labeled by another sandbox bypassed the guard. Refactored _ensure_pvc_volumes into:

  1. Pre-pass over every vol.pvc (regardless of createIfNotExists): get_pvc once, cache the result, and reject 409 via the new _reject_pvc_owned_by_other_sandbox helper if labeled by a different sandbox.
  2. Create loop: only iterates create-if-not-exists volumes, reads the cache instead of re-fetching, so RBAC permission errors still bail safely from the pre-pass.

New test test_existing_pvc_owned_by_other_is_rejected_even_without_create_if_not_exists covers the previously-unguarded branch.

P2 (kubernetes_service.py L435): Re-check concurrently-created PVC ownership

Good race. If two sandboxes race on the same claimName, both see absent in their pre-pass, sandbox A wins create_pvc and labels it for A, sandbox B gets 409 and previously just logged and continued — mounting A's labeled PVC, then losing the storage when A's cleanup fires.

Fix: on 409 from create_pvc, re-fetch the PVC and re-run _reject_pvc_owned_by_other_sandbox. If the racer left it labeled for another sandbox, B now gets a 409 response instead of silently mounting hostile storage. If the post-race fetch itself fails (rare network blip), log and proceed without the re-check — same behavior as before, but at least the common path is protected. New test test_create_race_409_re_checks_ownership_and_rejects_other_winner covers it.

Repeats (already addressed in c137fe49):

  • client.py:330 content-type — fixed via explicit _content_type="application/strategic-merge-patch+json"
  • kubernetes_service.py:602 workload_left_alive after partial create — fixed by setting the flag before the call
  • kubernetes_service.py:853 managed-by-other check — was inside create branch in the previous revision; this revision moves it to the pre-pass so it covers all PVC mounts
  • specs/sandbox-lifecycle.yml:1519 spec regen — docs/public/api/spec-inline.js is updated in the squashed commit; grep confirms the new wording is present and the old "Docker only / Kubernetes PVCs unaffected" wording is gone

Verification:

  • uv run pytest tests/k8s → 448 passed (+3 new)
  • uv run ruff check → clean

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54c1f3fc21

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
@xfgong xfgong force-pushed the fix/k8s-pvc-managed-cleanup branch from 54c1f3f to 9699382 Compare May 18, 2026 05:23
@xfgong
Copy link
Copy Markdown
Contributor Author

xfgong commented May 18, 2026

Force-pushed 9699382b addressing the three new P2 findings.

P2 (kubernetes_service.py L660): Clean up PVCs after provider preflight failures

Right — the pessimistic pre-call flag was too broad. Other providers have preflight rejections too (e.g. AgentSandboxProvider on windows platform), all by convention raising ValueError before touching any CR. Replaced the single pre-call flag set with an explicit try/except around create_workload:

  • ValueError → preflight rejection by convention, no CR. Re-raise; flag stays False so the finally sweeps the labeled PVCs.
  • Other Exception → CR may exist with partial state. Attempt rollback delete_workload: on success, flag stays False and PVCs are swept; on rollback failure, set the flag True so the sweep is skipped (matching delete_sandbox's non-404 semantics).
  • Success → flag becomes True, so subsequent _wait_for_sandbox_ready failures still require rollback before sweeping (existing inner try/except path unchanged).

Three new tests cover: rollback success → sweep; rollback fail → skip; ValueError preflight → sweep with no rollback.

P2 (kubernetes_service.py L766): Don't rely on ownerReferences before they are attached

Good catch — when create_workload raises after creating the CR, _attach_pvc_owner_references never runs, so the "K8s GC will eventually reclaim" promise is wrong. The combined fix for #1 above already addresses the common path (rollback + sweep). For the case where rollback also fails, updated the warning log to drop the misleading ownerReference reassurance and accurately point at delete_sandbox retry as the recovery path.

P2 (kubernetes_service.py L442): Fail closed when post-race PVC ownership cannot be checked

Agreed. Changed the 409 race path: if the post-race get_pvc re-fetch itself fails (transient API error, 403, etc.), raise 503 SERVICE_UNAVAILABLE instead of logging-and-proceeding. The previous "log and continue" trusted the worst-case scenario; now we fail loud and the caller can retry. New test test_create_race_409_fails_closed_when_re_fetch_fails.

Repeats (already addressed in prior revisions, Codex re-flagging):

  • specs/sandbox-lifecycle.yml:1519 spec regen — docs/public/api/spec-inline.js is regenerated in this branch; grep deleteOnSandboxTermination shows the new wording, the "Docker only" line is gone. Codex appears to flag from the spec edit without checking the regenerated artifact.
  • kubernetes_service.py:663 workload_left_alive partial-create — this revision restructures into explicit try/except with rollback-then-sweep, which is the proper fix the comment was asking for.
  • kubernetes_service.py:914 managed-by-other check — the previous revision moved this into the pre-pass at L359-378 which iterates vol.pvc is None (covering createIfNotExists=false mounts too); the L914 location is the now-redundant guard's old call site.
  • client.py:330 content-type — fixed in c137fe49 via explicit application/strategic-merge-patch+json.
  • kubernetes_service.py:660/446 previous wordings — covered by the new fixes above.

Verification:

  • uv run pytest tests/k8s → 451 passed (+3 net new)
  • uv run ruff check → clean

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9699382b4d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
Comment thread server/opensandbox_server/services/k8s/kubernetes_service.py Outdated
@xfgong xfgong force-pushed the fix/k8s-pvc-managed-cleanup branch from 9699382 to 3dce8b3 Compare May 18, 2026 05:37
@xfgong
Copy link
Copy Markdown
Contributor Author

xfgong commented May 18, 2026

Force-pushed 3dce8b3f addressing the three new P2 findings.

P2 (kubernetes_service.py L713): Treat not-found rollbacks as safe for PVC cleanup

Good catch on the BatchSandboxProvider self-rollback flow. Imported _is_not_found_error and changed the create_workload rollback-failure branch: when the rollback raises a not-found error (e.g. the provider's own imagePullSecret cleanup already deleted the CR before re-raising), treat it as a successful delete — CR is gone, workload_left_alive stays False, PVCs get swept. Only genuine non-404 failures keep the defensive skip behavior. New test test_pvc_cleanup_runs_when_rollback_reports_not_found.

P2 (kubernetes_service.py L773): Sweep PVCs when readiness cleanup finds no workload

Same pattern in the inner _wait_for_sandbox_ready handler. Updated the rollback except to detect 404 via _is_not_found_error and flip workload_left_alive = False, mirroring the create-time fix. Concrete path covered: short sandbox timeout < sandbox_create_timeout_seconds → controller TTL-deletes the CR while we're polling → our rollback gets not-found → PVCs are correctly reaped. New test test_pvc_cleanup_runs_when_wait_for_ready_rollback_reports_not_found.

P2 (kubernetes_service.py L458): Retry PVC creation when the race winner disappears

Right — a None re-fetch after 409 isn't "PVC exists for us to use", it's "winner created and then deleted it". Continuing would let the workload reference a missing claim. Changed the racer is None branch to raise 503 so the caller can retry; their retry's pre-pass sees the absent PVC and creates it cleanly. New test test_create_race_409_fails_closed_when_winner_pvc_already_gone.

Repeats (already addressed in prior revisions):

  • specs/sandbox-lifecycle.yml:1519 spec regen — docs/public/api/spec-inline.js is in the squashed commit with the new wording; the old "Docker only" line is gone.
  • kubernetes_service.py:954 managed-by-other — covered by the pre-pass at the top of _ensure_pvc_volumes which iterates every vol.pvc regardless of createIfNotExists.
  • client.py:330 content-type — pinned to application/strategic-merge-patch+json in c137fe49.
  • kubernetes_service.py:458 re-fetch on 409 — present since 54c1f3fc; this revision adds the missing fail-closed branches for None and fetch-failure.

Verification:

  • uv run pytest tests/k8s → 454 passed (+3 net new)
  • uv run ruff check → clean

@xfgong
Copy link
Copy Markdown
Contributor Author

xfgong commented May 18, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Breezy!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@xfgong
Copy link
Copy Markdown
Contributor Author

xfgong commented May 18, 2026

Thanks — confirms the 7 rounds of fixes converged. Ready for human review at commit 3dce8b3f. Could a maintainer apply the recommended labels (component/k8s, component/server, documentation, sdks, bug)? I don't have label-edit permission upstream.

@xfgong xfgong force-pushed the fix/k8s-pvc-managed-cleanup branch 2 times, most recently from fdc5a0a to ccbfa9c Compare May 26, 2026 07:01
Follow-up to alibaba#661, addressing the post-merge review by @yinsyim
(alibaba#661 (comment)).

The original PR assumed StorageClass.reclaimPolicy would clean up
auto-created PVCs, but reclaimPolicy only governs what happens to the PV
*after* the PVC is deleted — it does not delete the PVC when the
sandbox/workload terminates. Auto-created PVCs were leaking on every
sandbox deletion.

This PR fixes that by:

- Gating cleanup on the existing deleteOnSandboxTermination PVC field
  (default false, opt-in), uniform with Docker semantics. Opt-in PVCs are
  labeled with `opensandbox.io/volume-managed-by=server` and
  `opensandbox.io/id=<sandbox_id>`. Pre-existing PVCs and opt-out PVCs are
  never deleted by the server.

- Attaching ownerReferences from each opt-in PVC to the just-created
  workload CR after create_workload succeeds. Kubernetes garbage
  collection then deletes the PVC whenever the CR goes away — by the
  delete_sandbox API, by TTL expiry handled in the controller, or by any
  cascade. The label-based sweep in delete_sandbox remains as a fallback.

- Sweeping auto-created PVCs when sandbox creation fails before
  returning. Previously create_workload / _wait_for_sandbox_ready
  failures left labeled PVCs orphaned because the caller has no sandbox
  id to call delete_sandbox with.

- Rejecting 400 when the same claim_name appears in multiple Volume
  entries with inconsistent createIfNotExists / deleteOnSandboxTermination
  flags (first-wins previously caused either silent leaks or silent
  deletions).

Server changes:
- services/k8s/client.py: add list_pvcs, delete_pvc, patch_pvc.
- services/k8s/kubernetes_service.py:
  - _ensure_pvc_volumes: label only on opt-in; reject conflicting flags;
    return the list of newly auto-created managed PVCs.
  - new _attach_pvc_owner_references: best-effort patch after
    create_workload, gated on workload_info exposing apiVersion/kind/uid.
  - new _cleanup_managed_pvcs: list-by-selector + delete; best-effort.
  - delete_sandbox: call cleanup after workload deletion succeeds or 404s,
    never on transient errors.
  - create_sandbox: finally-sweep on any failure before return.
- services/k8s/batchsandbox_provider.py and agent_sandbox_provider.py:
  return apiVersion and kind alongside name and uid so the
  ownerReference can be constructed.

Spec, docs, SDKs, RBAC:
- specs/sandbox-lifecycle.yml + server/api/schema.py: replace the stale
  "Docker only / no effect on K8s / StorageClass reclaim policy handles
  it" description with correct cross-backend semantics. The reclaim-policy
  reference remains but is reframed (the PVC delete triggers it).
- docs/public/api/spec-inline.js: regenerated via `pnpm docs:spec` so
  the published API docs reflect the new description.
- SDKs: regenerated Python lifecycle and JS lifecycle from the spec;
  hand-edited Python sandbox model, C# model, and Kotlin model
  docstrings.
- kubernetes/charts/opensandbox-server: grant `delete` and `list` on
  persistentvolumeclaims so the cleanup path runs under the stock chart.

Tests:
- Label stamping for opt-in and opt-out paths.
- Cleanup on success, on 404 sweep, no-cleanup on transient error,
  best-effort tolerance of list_pvcs failure, no-touch for pre-existing
  PVCs.
- Conflict validation rejects mismatched flags and allows matching ones.
- create_sandbox finally-sweep covers create_workload and
  wait_for_ready failures.
- _attach_pvc_owner_references patches each managed PVC with the right
  body, is best-effort on failure, and skips when owner info is
  incomplete.

End-to-end verified on a real cluster:
- opt-in PVC carries labels and ownerReferences; delete_sandbox API
  cleans it via the label sweep; direct CR delete (simulating TTL) is
  cleaned via K8s GC.
- opt-out PVC has no labels or ownerReferences and survives any CR
  delete.
- Pre-existing PVCs are untouched.
@xfgong xfgong force-pushed the fix/k8s-pvc-managed-cleanup branch from ccbfa9c to 0008077 Compare May 26, 2026 07:36
@xfgong
Copy link
Copy Markdown
Contributor Author

xfgong commented May 29, 2026

I’ve tested this fix in my environment for a while, and it looks ready to merge. Could someone help review it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working component/k8s For kubernetes runtime

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants