Skip to content

fix(allocator): BatchSandbox does not rebind a new Pod from Pool after allocated Pod is deleted#953

Open
longsuizhi wants to merge 3 commits into
alibaba:mainfrom
longsuizhi:fix/pool-rebind-deleted-pod
Open

fix(allocator): BatchSandbox does not rebind a new Pod from Pool after allocated Pod is deleted#953
longsuizhi wants to merge 3 commits into
alibaba:mainfrom
longsuizhi:fix/pool-rebind-deleted-pod

Conversation

@longsuizhi
Copy link
Copy Markdown
Contributor

@longsuizhi longsuizhi commented May 28, 2026

Fixes #954

问题描述

在 Pool 模式下,当已分配给 BatchSandbox 的 Pod 被外部删除(手动 delete、节点驱逐、OOM Kill 等)后,BatchSandbox 不会从 Pool 中重新获取新 Pod,导致 sandbox 永久不可用。

根因

allocator.gogetSandboxRequest 通过 alloc-status 注解获取已分配 Pod 列表计算:

supplement = replicas - len(allocated)

Pod 被删除后注解未清理,len(allocated) 不变,supplement = 0,永远不会触发重新分配。

修复方案

getSandboxRequest 中增加存活检测:

  1. 将 Pool 的活跃 Pod 集合(仅 Running+Ready)传入 getSandboxRequest
  2. 对比 alloc-status 中的 Pod 与实际存活 Pod,过滤出已删除的 Pod
  3. 已删除的 Pod 从有效分配中排除(使 supplement > 0 触发重新分配)
  4. 已删除的 Pod 加入 ToRelease 队列清理分配记录
  5. Released 且已不存在的 Pod 不再占用 slot,避免阻塞后续重新分配
  6. Terminal 状态的 Pod(Failed/Evicted)不计入活跃集合,同样触发重新分配

测试结果

所有单测通过(含新增 3 个用例):

--- PASS: TestSchedule/allocate_normally_-_2_pods_for_2_sandboxes
--- PASS: TestSchedule/skip_non-running_pods
--- PASS: TestSchedule/partial_allocated_-_allocate_remaining
--- PASS: TestSchedule/with_release_-_pods_to_release
--- PASS: TestSchedule/partial_release_-_only_unreleased_pods_in_ToRelease
--- PASS: TestSchedule/not_enough_pods_-_PodSupplement_>_0
--- PASS: TestSchedule/skip_already_allocated_pod_-_other_sandbox_gets_supplement
--- PASS: TestSchedule/terminating_sandbox_-_queue_unreleased_pods_for_release,_no_supplement
--- PASS: TestSchedule/dead_pod_triggers_re-allocation_-_allocated_pod_deleted_externally (新增)
--- PASS: TestSchedule/released_pod_does_not_block_second_re-allocation (新增)
--- PASS: TestSchedule/terminal_pod_(Failed)_triggers_re-allocation (新增)
--- PASS: TestSchedule/orphan_sandbox_-_pods_in_store_but_sandbox_no_longer_in_spec
PASS
ok  github.com/alibaba/OpenSandbox/sandbox-k8s/internal/controller  1.173s

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 28, 2026

CLA assistant check
All committers have signed the CLA.

@longsuizhi longsuizhi changed the title fix: Pool 模式下 Pod 被删除后 BatchSandbox 不会重新分配新 Pod fix(allocator): Pool 模式下 Pod 被删除后 BatchSandbox 不会重新分配新 Pod May 28, 2026
@longsuizhi longsuizhi changed the title fix(allocator): Pool 模式下 Pod 被删除后 BatchSandbox 不会重新分配新 Pod fix(allocator): BatchSandbox does not rebind a new Pod from Pool after allocated Pod is deleted May 28, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e567cc496f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kubernetes/internal/controller/allocator.go Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d4b8e95d88

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kubernetes/internal/controller/allocator.go
@github-actions
Copy link
Copy Markdown
Contributor

⚠️ This PR has no labels. Please add one based on the changes.

Changed directories: kubernetes.

📋 Recommended labels (based on changed files):

  • component/k8s ⬅️

Other available labels:

  • bug - Something isn't working
  • dependencies - Pull requests that update a dependency file
  • documentation - Improvements or additions to documentation
  • feature - New feature or request
  • packages - Changes for package, image and configuration

💡 Tip: Use feature for new functionality or improvements, bug for fixes.

cc @longsuizhi

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5c9e7537fe

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kubernetes/internal/controller/allocator.go Outdated
Comment thread kubernetes/internal/controller/allocator.go Outdated
@longsuizhi longsuizhi force-pushed the fix/pool-rebind-deleted-pod branch 2 times, most recently from 53be88c to 56f2ed0 Compare May 28, 2026 08:18
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56f2ed04a4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kubernetes/internal/controller/allocator.go
Comment thread kubernetes/internal/controller/allocator.go Outdated
Comment thread kubernetes/internal/controller/allocator.go Outdated
@longsuizhi longsuizhi force-pushed the fix/pool-rebind-deleted-pod branch from 56f2ed0 to ecdd9ea Compare May 28, 2026 08:33
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ecdd9eadef

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kubernetes/internal/controller/allocator.go Outdated
Comment thread kubernetes/internal/controller/allocator.go
Comment thread kubernetes/internal/controller/allocator.go Outdated
@longsuizhi longsuizhi force-pushed the fix/pool-rebind-deleted-pod branch from ecdd9ea to 20956e7 Compare May 28, 2026 09:14
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 20956e73ce

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kubernetes/internal/controller/allocator.go
当 Pool 中已分配给 BatchSandbox 的 Pod 被外部删除时,alloc-status 注解中仍保留
已删除 Pod 的名称,导致 supplement 计算为 0,无法触发重新分配。

本次修复在 getSandboxRequest 中增加了存活检测:将已删除的 Pod 从有效分配中
排除并加入 ToRelease 队列,使 supplement > 0 从而触发 Pool 重新分配新 Pod。
…ods from live allocation

- P1: Released pods no longer count toward liveAllocated, preventing
  stale released entries from blocking subsequent re-allocations.
- P2: Only Running+Ready pods are added to livePodSet, so terminal
  pods (Failed/Evicted) that still have their object present also
  trigger re-allocation.
@longsuizhi longsuizhi force-pushed the fix/pool-rebind-deleted-pod branch from 20956e7 to 46d6222 Compare June 1, 2026 09:33
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46d6222c71

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread kubernetes/internal/controller/allocator.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: BatchSandbox does not rebind a new Pod from Pool after allocated Pod is deleted

2 participants