fix(k8s): delete auto-created PVCs on sandbox deletion when opted in#880
fix(k8s): delete auto-created PVCs on sandbox deletion when opted in#880xfgong wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1e3206bb19
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
1e3206b to
2be810e
Compare
|
Thanks for the review. Force-pushed P2-1 (kubernetes_service.py L455): Clean up managed PVCs when creation fails
P2-3 (kubernetes_service.py L352): Cover TTL expiry cleanup for managed PVCs You're right — the The label-based sweep in
P2-4 (kubernetes_service.py L352): Reject conflicting cleanup flags for repeated PVC claims Added an up-front pass at the top of P2-2 (spec L1510): Regenerate public spec docs Ran Verification:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2be810eee9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
2be810e to
2226a30
Compare
|
@chatgpt-codex-connector good catch on the missing |
|
To use Codex here, create an environment for this repo. |
|
The Codex auto-response above is just the setup prompt for repo-side Codex environments — no action needed from me on it. The latest revision ( Ready for human review. Could a maintainer apply the recommended labels ( |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2226a3016f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
2226a30 to
6fc5641
Compare
|
Force-pushed P2 (kubernetes_service.py L656): Avoid deleting PVCs when workload rollback fails Good catch. The previous Fix: track New test Other P2 (specs/sandbox-lifecycle.yml L1510): Regenerate public spec docs This is the same regen finding from the prior pass — Verification:
|
|
The failing K8s mini E2E looks unrelated to this PR. It was caused by the K8s metadata patch issue fixed in #899, which has now been merged. Could you please merge/rebase latest |
6fc5641 to
4c6a7bd
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4c6a7bd0ab
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
4c6a7bd to
c137fe4
Compare
|
Force-pushed P2 (client.py L324): Send a merge patch when attaching PVC owners Critical catch — verified by Fix: (My earlier cluster smoke test claimed ownerReferences were attached — I'd verified the label sweep, not the ownerReference, on the namespace I tested. Mea culpa; this would have leaked on real TTL deletion until now.) P2 (kubernetes_service.py L570): Track partial workload creation before PVC cleanup You're right that Fix: set P2 (kubernetes_service.py L822): Avoid deleting managed PVCs that other sandboxes mounted Real cross-sandbox bug. If A creates Fix: in Other P2 (specs/sandbox-lifecycle.yml L1519): Regenerate public spec docs Same finding that came back in the previous two passes — Verification:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c137fe495c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
c137fe4 to
54c1f3f
Compare
|
Force-pushed P2 (kubernetes_service.py L599): Clean up PVCs after pre-CR workload failures Real regression from the previous pessimistic flip. Fix: validate the P2 (kubernetes_service.py L853 / L378): Reject managed PVCs even when not auto-creating Right — the previous revision only checked ownership inside the
New test P2 (kubernetes_service.py L435): Re-check concurrently-created PVC ownership Good race. If two sandboxes race on the same Fix: on 409 from Repeats (already addressed in
Verification:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 54c1f3fc21
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
54c1f3f to
9699382
Compare
|
Force-pushed P2 (kubernetes_service.py L660): Clean up PVCs after provider preflight failures Right — the pessimistic pre-call flag was too broad. Other providers have preflight rejections too (e.g.
Three new tests cover: rollback success → sweep; rollback fail → skip; ValueError preflight → sweep with no rollback. P2 (kubernetes_service.py L766): Don't rely on ownerReferences before they are attached Good catch — when P2 (kubernetes_service.py L442): Fail closed when post-race PVC ownership cannot be checked Agreed. Changed the 409 race path: if the post-race Repeats (already addressed in prior revisions, Codex re-flagging):
Verification:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9699382b4d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
9699382 to
3dce8b3
Compare
|
Force-pushed P2 (kubernetes_service.py L713): Treat not-found rollbacks as safe for PVC cleanup Good catch on the BatchSandboxProvider self-rollback flow. Imported P2 (kubernetes_service.py L773): Sweep PVCs when readiness cleanup finds no workload Same pattern in the inner P2 (kubernetes_service.py L458): Retry PVC creation when the race winner disappears Right — a Repeats (already addressed in prior revisions):
Verification:
|
|
@codex review |
|
Codex Review: Didn't find any major issues. Breezy! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Thanks — confirms the 7 rounds of fixes converged. Ready for human review at commit |
fdc5a0a to
ccbfa9c
Compare
Follow-up to alibaba#661, addressing the post-merge review by @yinsyim (alibaba#661 (comment)). The original PR assumed StorageClass.reclaimPolicy would clean up auto-created PVCs, but reclaimPolicy only governs what happens to the PV *after* the PVC is deleted — it does not delete the PVC when the sandbox/workload terminates. Auto-created PVCs were leaking on every sandbox deletion. This PR fixes that by: - Gating cleanup on the existing deleteOnSandboxTermination PVC field (default false, opt-in), uniform with Docker semantics. Opt-in PVCs are labeled with `opensandbox.io/volume-managed-by=server` and `opensandbox.io/id=<sandbox_id>`. Pre-existing PVCs and opt-out PVCs are never deleted by the server. - Attaching ownerReferences from each opt-in PVC to the just-created workload CR after create_workload succeeds. Kubernetes garbage collection then deletes the PVC whenever the CR goes away — by the delete_sandbox API, by TTL expiry handled in the controller, or by any cascade. The label-based sweep in delete_sandbox remains as a fallback. - Sweeping auto-created PVCs when sandbox creation fails before returning. Previously create_workload / _wait_for_sandbox_ready failures left labeled PVCs orphaned because the caller has no sandbox id to call delete_sandbox with. - Rejecting 400 when the same claim_name appears in multiple Volume entries with inconsistent createIfNotExists / deleteOnSandboxTermination flags (first-wins previously caused either silent leaks or silent deletions). Server changes: - services/k8s/client.py: add list_pvcs, delete_pvc, patch_pvc. - services/k8s/kubernetes_service.py: - _ensure_pvc_volumes: label only on opt-in; reject conflicting flags; return the list of newly auto-created managed PVCs. - new _attach_pvc_owner_references: best-effort patch after create_workload, gated on workload_info exposing apiVersion/kind/uid. - new _cleanup_managed_pvcs: list-by-selector + delete; best-effort. - delete_sandbox: call cleanup after workload deletion succeeds or 404s, never on transient errors. - create_sandbox: finally-sweep on any failure before return. - services/k8s/batchsandbox_provider.py and agent_sandbox_provider.py: return apiVersion and kind alongside name and uid so the ownerReference can be constructed. Spec, docs, SDKs, RBAC: - specs/sandbox-lifecycle.yml + server/api/schema.py: replace the stale "Docker only / no effect on K8s / StorageClass reclaim policy handles it" description with correct cross-backend semantics. The reclaim-policy reference remains but is reframed (the PVC delete triggers it). - docs/public/api/spec-inline.js: regenerated via `pnpm docs:spec` so the published API docs reflect the new description. - SDKs: regenerated Python lifecycle and JS lifecycle from the spec; hand-edited Python sandbox model, C# model, and Kotlin model docstrings. - kubernetes/charts/opensandbox-server: grant `delete` and `list` on persistentvolumeclaims so the cleanup path runs under the stock chart. Tests: - Label stamping for opt-in and opt-out paths. - Cleanup on success, on 404 sweep, no-cleanup on transient error, best-effort tolerance of list_pvcs failure, no-touch for pre-existing PVCs. - Conflict validation rejects mismatched flags and allows matching ones. - create_sandbox finally-sweep covers create_workload and wait_for_ready failures. - _attach_pvc_owner_references patches each managed PVC with the right body, is best-effort on failure, and skips when owner info is incomplete. End-to-end verified on a real cluster: - opt-in PVC carries labels and ownerReferences; delete_sandbox API cleans it via the label sweep; direct CR delete (simulating TTL) is cleaned via K8s GC. - opt-out PVC has no labels or ownerReferences and survives any CR delete. - Pre-existing PVCs are untouched.
ccbfa9c to
0008077
Compare
|
I’ve tested this fix in my environment for a while, and it looks ready to merge. Could someone help review it? |
Summary
Follow-up to #661, addressing the post-merge review by @yinsyim in #661 (comment).
The original PR assumed
StorageClass.reclaimPolicywould clean up auto-created PVCs, butreclaimPolicyonly governs what happens to the PV after the PVC is deleted — it does not delete the PVC itself when the sandbox/workload terminates:…not:
So auto-created PVCs were leaking on every sandbox deletion. This PR fixes that by mirroring the existing Docker
volume-managed-bycleanup pattern for Kubernetes, gated on the samedeleteOnSandboxTerminationPVC field (defaultfalse, opt-in) so semantics are uniform across runtimes.Changes
Server (
services/k8s/):client.py: addlist_pvcsanddelete_pvc(404 swallowed on delete).kubernetes_service.py:_ensure_pvc_volumesstampsopensandbox.io/volume-managed-by=serverandopensandbox.io/id=<sandbox_id>labels on auto-created PVCs, only whendeleteOnSandboxTermination=true. Pre-existing and opt-out PVCs are left unlabeled.delete_sandboxcalls a new_cleanup_managed_pvcshelper that lists PVCs by that selector and deletes them. Cleanup runs only when the workload is actually gone (success or 404), never on transient errors. List/delete failures are tolerated (best-effort) so they never block the API response.Spec + schema docs:
specs/sandbox-lifecycle.ymlandserver/opensandbox_server/api/schema.pypreviously stateddeleteOnSandboxTerminationwas "Docker only … no effect on Kubernetes PVCs, whose lifecycle is managed by the StorageClass reclaim policy" — the exact assumption @yinsyim disproved. Updated to describe the correct cross-backend semantics; the reclaim-policy reference is kept but reframed (the PVC delete triggers it).SDKs: regenerated Python lifecycle (
api/lifecycle/models/pvc.py) and JS (api/lifecycle.ts) from the spec. Hand-edited the hand-written Python sandbox model, C# model, and Kotlin model docstrings.Helm RBAC (
kubernetes/charts/opensandbox-server/templates/server.yaml): addeddeleteandlistverbs onpersistentvolumeclaims. The cleanup degrades to a logged warning on 403, but the stock chart should be correct by default.Tests: cover label stamping for both opt-in and opt-out paths, cleanup on success / 404 sweep / no-cleanup on transient error, best-effort tolerance of
list_pvcsfailure, and no-touch for pre-existing PVCs.Compatibility
Backward-compatible.
deleteOnSandboxTerminationdefaults tofalse, so existing PVC-using sandboxes behave exactly as before unless callers explicitly opt in. The added RBAC verbs are additive.Test plan
cd server && uv run pytest tests/k8s— 399 passedcd server && uv run ruff checkon changed files — cleancd sdks/sandbox/python && uv run python scripts/generate_api.py— regenerated cleanlycd sdks/sandbox/javascript && pnpm run gen:api— regenerated cleanlycreateIfNotExists=true, deleteOnSandboxTermination=true, delete sandbox, verify PVC is deleted and PV is reclaimed per StorageClass policydeleteOnSandboxTermination=false— PVC must persist after sandbox deletionmanaged-bylabel) is untouched on sandbox deletionNotes
opensandbox.io/volume-managed-by=server,opensandbox.io/id=<sandbox-id>. Both must match — extra safety against accidentally sweeping a user-managed PVC that happens to share our sandbox ID.kubernetes.io/pvc-protectionfinalizer. If pods are still terminating, K8s puts the PVC intoTerminatingand reaps it once pods are gone — also safe.