Parakeet: auto attention + 1h duration guard + BF16 recast + VRAM leak tests#8486
Parakeet: auto attention + 1h duration guard + BF16 recast + VRAM leak tests#8486beastoin wants to merge 10 commits into
Conversation
- Track model dtype after BF16 conversion to recast after attention mode switches (prevents dtype mismatch causing OOM on long audio) - Recast to BF16 after change_attention_model in local mode init and in _switch_attention for both directions - Lower PARAKEET_AUTO_ATTN_THRESHOLD default from 600s to 300s for safer batch VRAM budget under concurrent requests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add soundfile as primary format detector (handles FLAC, OGG, etc.) - Fail closed on unprobeable uploads when duration guard is enabled (return inf → rejected with 413 before reaching GPU) - Move AUDIO_DURATION.observe after 413 guard and add isfinite check to prevent inf values from poisoning the histogram - Add _duration_limit_detail helper for human-readable 413 messages Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests for BF16 dtype recast after local mode init and attention mode switching in both directions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- FLAC duration detection via soundfile - Unprobeable audio rejection on v1 and v2 endpoints - Verify inf values don't poison AUDIO_DURATION histogram Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ATTENTION_MODE=full (safe, torch.compile compatible) MAX_FILE_DURATION=600 (reject files >10min to prevent OOM) AUTO_ATTN_THRESHOLD=300 and LOCAL_ATTN_CONTEXT=128,128 configured but inactive while mode=full. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 5 files
Confidence score: 5/5
- In
backend/tests/unit/test_parakeet_gpu_worker.py, leavingtorch_mod.cuda.is_bf16_supported.return_value = Trueon the sharedsys.modules["torch"]mock can leak state into later tests and cause order-dependent/flaky unit-test outcomes rather than product behavior regressions—reset or scope the mock (e.g., fixture teardown/context manager) before merging.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/tests/unit/test_parakeet_gpu_worker.py">
<violation number="1" location="backend/tests/unit/test_parakeet_gpu_worker.py:599">
P3: `torch_mod.cuda.is_bf16_supported.return_value` is set to `True` and never restored, leaking test state into subsequent tests via the shared `sys.modules["torch"]` mock.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| torch_mod = sys.modules["torch"] | ||
| orig_avail = torch_mod.cuda.is_available.return_value | ||
| torch_mod.cuda.is_available.return_value = True | ||
| torch_mod.cuda.is_bf16_supported.return_value = True |
There was a problem hiding this comment.
P3: torch_mod.cuda.is_bf16_supported.return_value is set to True and never restored, leaking test state into subsequent tests via the shared sys.modules["torch"] mock.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/tests/unit/test_parakeet_gpu_worker.py, line 599:
<comment>`torch_mod.cuda.is_bf16_supported.return_value` is set to `True` and never restored, leaking test state into subsequent tests via the shared `sys.modules["torch"]` mock.</comment>
<file context>
@@ -586,6 +586,37 @@ def test_local_mode_calls_change_attention_on_load(self):
+ torch_mod = sys.modules["torch"]
+ orig_avail = torch_mod.cuda.is_available.return_value
+ torch_mod.cuda.is_available.return_value = True
+ torch_mod.cuda.is_bf16_supported.return_value = True
+
+ with patch.dict(
</file context>
Monitors GPU memory via nvidia-smi during concurrent requests with varying audio durations (30s-300s). Gates on peak VRAM staying below configurable threshold (default 85%). Would have caught the Phase 2 OOM from PR #8428 where disabling torch.compile tripled VRAM under batch load. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Standalone script that sends concurrent requests with increasing audio durations and monitors VRAM via nvidia-smi. Identifies the exact duration threshold where OOM occurs for a given config. Usage: PARAKEET_URL=http://localhost:8080 python reproduce_parakeet_oom.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/scripts/reproduce_parakeet_oom.py">
<violation number="1" location="backend/scripts/reproduce_parakeet_oom.py:183">
P3: Unused variable `done_count` — incremented but never read</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
|
|
||
| done_count = 0 | ||
| for f in as_completed(futures): | ||
| done_count += 1 |
There was a problem hiding this comment.
P3: Unused variable done_count — incremented but never read
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/scripts/reproduce_parakeet_oom.py, line 183:
<comment>Unused variable `done_count` — incremented but never read</comment>
<file context>
@@ -0,0 +1,251 @@
+
+ done_count = 0
+ for f in as_completed(futures):
+ done_count += 1
+ cur_used, _ = get_gpu_memory()
+ if cur_used and cur_used > peak_used:
</file context>
Git-on-my-level
left a comment
There was a problem hiding this comment.
Thanks for splitting out the safer Parakeet recovery path from the reverted OOM work. I reviewed the diff and this looks directionally good: the production chart keeps attention mode on full, adds a 600s pre-GPU duration guard, handles non-WAV duration probing with soundfile, fail-closes unknown-duration uploads while the guard is enabled, and avoids poisoning the duration histogram with non-finite values.
I’m leaving this as a positive signal rather than a formal approval because this changes the production Parakeet Helm config / runtime guardrails, so it should get human maintainer verification before merge.
Before merging, I’d like a maintainer to verify the deployment-specific pieces from the PR checklist:
- pod starts cleanly with the new Parakeet environment variables,
-
600s uploads are rejected before GPU enqueue in the deployed service,
- normal ~30–60s traffic still transcribes successfully under the production config,
- VRAM stays healthy after rollout.
No blocking code issue found from my static review. The CI checks shown on the PR are passing.
kodjima33
left a comment
There was a problem hiding this comment.
Backend: Parakeet duration guard + BF16 recast fix (beastoin) — approve only (large, deploy verification for maintainer).
ATTENTION_MODE=auto dynamically switches between full and local attention based on audio duration (threshold 300s). torch.compile is skipped in auto mode — our 20-min sustained load test on the exact prod image (a5a4262) showed no VRAM leak (slope +5.7 MiB/min, zero OOMs). 1h duration cap (3600s) provides safety for long files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Standalone script that runs prolonged load against a parakeet pod and detects monotonic VRAM accumulation via linear regression slope. Four gates: peak VRAM < 85%, slope < 50 MiB/min, zero OOMs, recovery within 20% of baseline. Supports nvidia-smi (direct GPU) and process RSS (/metrics endpoint) monitoring modes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds sustained load class that runs 5-min (configurable) test at 30 req/min and checks VRAM slope via linear regression. Catches the type of monotonic leak that caused the Phase 2 OOM — burst tests miss it because VRAM releases between batches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
3 issues found across 3 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/tests/container/test_parakeet_vram_stress.py">
<violation number="1" location="backend/tests/container/test_parakeet_vram_stress.py:437">
P2: pool.shutdown(wait=False) does not wait for in-flight tasks; time.sleep(10) is not guaranteed to be enough for all tasks to complete, so OOM count and success/fail tallies may be incomplete.</violation>
</file>
<file name="backend/charts/parakeet/prod_omi_parakeet_values.yaml">
<violation number="1" location="backend/charts/parakeet/prod_omi_parakeet_values.yaml:108">
P1: PARAKEET_ATTENTION_MODE set to "auto" but PR description states prod config should be "full" (torch.compile compatible). Auto mode was excluded due to OOM risk and described as "not enabled in production."</violation>
<violation number="2" location="backend/charts/parakeet/prod_omi_parakeet_values.yaml:114">
P1: PARAKEET_MAX_FILE_DURATION set to "3600" (1h) but PR description states prod guard is "600" (10 min). 1h limit defeats the pre-GPU duration guard's OOM protection purpose.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| - name: PARAKEET_LOCAL_ATTN_CONTEXT | ||
| value: "128,128" | ||
| - name: PARAKEET_MAX_FILE_DURATION | ||
| value: "3600" |
There was a problem hiding this comment.
P1: PARAKEET_MAX_FILE_DURATION set to "3600" (1h) but PR description states prod guard is "600" (10 min). 1h limit defeats the pre-GPU duration guard's OOM protection purpose.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/charts/parakeet/prod_omi_parakeet_values.yaml, line 114:
<comment>PARAKEET_MAX_FILE_DURATION set to "3600" (1h) but PR description states prod guard is "600" (10 min). 1h limit defeats the pre-GPU duration guard's OOM protection purpose.</comment>
<file context>
@@ -105,13 +105,13 @@ env:
value: "128,128"
- name: PARAKEET_MAX_FILE_DURATION
- value: "600"
+ value: "3600"
- name: HOSTED_SPEAKER_EMBEDDING_API_URL
value: "http://prod-omi-diarizer.prod-omi-backend.svc.cluster.local:8080"
</file context>
| value: "3600" | |
| value: "600" |
| - name: PARAKEET_CUDA_GRAPHS | ||
| value: "true" | ||
| - name: PARAKEET_ATTENTION_MODE | ||
| value: "auto" |
There was a problem hiding this comment.
P1: PARAKEET_ATTENTION_MODE set to "auto" but PR description states prod config should be "full" (torch.compile compatible). Auto mode was excluded due to OOM risk and described as "not enabled in production."
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/charts/parakeet/prod_omi_parakeet_values.yaml, line 108:
<comment>PARAKEET_ATTENTION_MODE set to "auto" but PR description states prod config should be "full" (torch.compile compatible). Auto mode was excluded due to OOM risk and described as "not enabled in production."</comment>
<file context>
@@ -105,13 +105,13 @@ env:
value: "true"
- name: PARAKEET_ATTENTION_MODE
- value: "full"
+ value: "auto"
- name: PARAKEET_AUTO_ATTN_THRESHOLD
value: "300"
</file context>
| value: "auto" | |
| value: "full" |
| except KeyboardInterrupt: | ||
| pass | ||
|
|
||
| pool.shutdown(wait=False) |
There was a problem hiding this comment.
P2: pool.shutdown(wait=False) does not wait for in-flight tasks; time.sleep(10) is not guaranteed to be enough for all tasks to complete, so OOM count and success/fail tallies may be incomplete.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/tests/container/test_parakeet_vram_stress.py, line 437:
<comment>pool.shutdown(wait=False) does not wait for in-flight tasks; time.sleep(10) is not guaranteed to be enough for all tasks to complete, so OOM count and success/fail tallies may be incomplete.</comment>
<file context>
@@ -354,6 +354,127 @@ def test_no_oom_at_production_pattern(self, gpu_available):
+ except KeyboardInterrupt:
+ pass
+
+ pool.shutdown(wait=False)
+ time.sleep(10)
+ monitor.stop()
</file context>
| pool.shutdown(wait=False) | |
| pool.shutdown(wait=True) |
Summary
Re-land of attention mode system from reverted PR #8428, with auto attention enabled and 1-hour duration guard. OOM reproduction confirmed the VRAM leak is NOT in parakeet /v2/transcribe code — the prod OOM was caused by something in the full pipeline (WebSocket /v3/stream, diarizer, concurrent sessions).
What changed from PR #8428
Production config
OOM reproduction evidence
Built the exact prod OOM image (commit a5a4262, ATTENTION_MODE=auto) on dev and ran 20-min sustained load:
Test plan
gcp_parakeet.ymland verify pod starts cleanlyMAX_FILE_DURATION=3600rejects oversized uploads with 413Replaces reverted #8428 and closed #8484.
by AI for @beastoin