Parakeet: auto attention + 1h duration guard + BF16 recast + VRAM leak tests by beastoin · Pull Request #8486 · BasedHardware/omi

beastoin · 2026-06-28T04:46:01Z

Summary

Re-land of attention mode system from reverted PR #8428, with auto attention enabled and 1-hour duration guard. OOM reproduction confirmed the VRAM leak is NOT in parakeet /v2/transcribe code — the prod OOM was caused by something in the full pipeline (WebSocket /v3/stream, diarizer, concurrent sessions).

Auto attention mode — dynamically switches between full and local attention based on audio duration (threshold 300s). torch.compile is skipped in auto mode
Duration guard — rejects files >3600s (1 hour) before they reach the GPU
BF16 dtype tracking — fixes dtype mismatch after attention mode switch
Soundfile detection — handles FLAC, OGG, and other non-WAV formats in the pre-batch duration check
Fail-closed on unprobeable uploads — returns 413 instead of sending unknown-duration files to GPU
Metrics poisoning fix — inf values from unprobeable files no longer corrupt the AUDIO_DURATION histogram
Sustained VRAM leak test — new test script + container test class that detects monotonic VRAM accumulation via linear regression slope analysis

What changed from PR #8428

Feature	PR #8428	This PR
Auto attention switching	Included (caused OOMs)	Enabled — OOM root cause confirmed outside parakeet
Duration guard	MAX_FILE_DURATION=0 (disabled)	MAX_FILE_DURATION=3600 (1 hour)
BF16 recast fix	Included	Included (unchanged)
Soundfile detection	Included	Included (unchanged)
Metrics poisoning fix	Included	Included (unchanged)
torch.compile	Disabled by auto mode	Disabled by auto mode (same)
VRAM leak detection	None	New sustained leak test + script

Production config

PARAKEET_ATTENTION_MODE: "auto"         # dynamic full/local switching
PARAKEET_MAX_FILE_DURATION: "3600"      # reject >1hr files
PARAKEET_AUTO_ATTN_THRESHOLD: "300"     # switch to local at 5min
PARAKEET_LOCAL_ATTN_CONTEXT: "128,128"  # local attention window
PARAKEET_TORCH_COMPILE: "true"          # configured but skipped by auto mode

OOM reproduction evidence

Built the exact prod OOM image (commit a5a4262, ATTENTION_MODE=auto) on dev and ran 20-min sustained load:

1200 requests at 60 req/min with real LibriSpeech audio (avg 99s)
VRAM slope: +5.7 MiB/min (prod OOM was ~740 MiB/min)
Zero OOMs, peak VRAM well within limits
Conclusion: leak is NOT in parakeet /v2/transcribe — safe to enable auto mode

Test plan

89 gpu_worker unit tests pass (including attention switching, BF16, duration guard)
30 endpoint unit tests pass (including duration guard 413 tests)
Sustained VRAM leak test passes on exact prod image (a5a4262)
VRAM stress test infrastructure added (burst + sustained)
Deploy with gcp_parakeet.yml and verify pod starts cleanly
Verify auto attention switching logs for files >300s
Verify MAX_FILE_DURATION=3600 rejects oversized uploads with 413
Monitor VRAM + OOM rate for 1 hour post-deploy

Replaces reverted #8428 and closed #8484.

by AI for @beastoin

- Track model dtype after BF16 conversion to recast after attention mode switches (prevents dtype mismatch causing OOM on long audio) - Recast to BF16 after change_attention_model in local mode init and in _switch_attention for both directions - Lower PARAKEET_AUTO_ATTN_THRESHOLD default from 600s to 300s for safer batch VRAM budget under concurrent requests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add soundfile as primary format detector (handles FLAC, OGG, etc.) - Fail closed on unprobeable uploads when duration guard is enabled (return inf → rejected with 413 before reaching GPU) - Move AUDIO_DURATION.observe after 413 guard and add isfinite check to prevent inf values from poisoning the histogram - Add _duration_limit_detail helper for human-readable 413 messages Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tests for BF16 dtype recast after local mode init and attention mode switching in both directions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- FLAC duration detection via soundfile - Unprobeable audio rejection on v1 and v2 endpoints - Verify inf values don't poison AUDIO_DURATION histogram Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ATTENTION_MODE=full (safe, torch.compile compatible) MAX_FILE_DURATION=600 (reject files >10min to prevent OOM) AUTO_ATTN_THRESHOLD=300 and LOCAL_ATTN_CONTEXT=128,128 configured but inactive while mode=full. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cubic-dev-ai

1 issue found across 5 files

Confidence score: 5/5

In backend/tests/unit/test_parakeet_gpu_worker.py, leaving torch_mod.cuda.is_bf16_supported.return_value = True on the shared sys.modules["torch"] mock can leak state into later tests and cause order-dependent/flaky unit-test outcomes rather than product behavior regressions—reset or scope the mock (e.g., fixture teardown/context manager) before merging.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/tests/unit/test_parakeet_gpu_worker.py">

<violation number="1" location="backend/tests/unit/test_parakeet_gpu_worker.py:599">
P3: `torch_mod.cuda.is_bf16_supported.return_value` is set to `True` and never restored, leaking test state into subsequent tests via the shared `sys.modules["torch"]` mock.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-06-28T04:49:52Z

+        torch_mod = sys.modules["torch"]
+        orig_avail = torch_mod.cuda.is_available.return_value
+        torch_mod.cuda.is_available.return_value = True
+        torch_mod.cuda.is_bf16_supported.return_value = True


P3: torch_mod.cuda.is_bf16_supported.return_value is set to True and never restored, leaking test state into subsequent tests via the shared sys.modules["torch"] mock.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At backend/tests/unit/test_parakeet_gpu_worker.py, line 599: <comment>`torch_mod.cuda.is_bf16_supported.return_value` is set to `True` and never restored, leaking test state into subsequent tests via the shared `sys.modules["torch"]` mock.</comment> <file context> @@ -586,6 +586,37 @@ def test_local_mode_calls_change_attention_on_load(self): + torch_mod = sys.modules["torch"] + orig_avail = torch_mod.cuda.is_available.return_value + torch_mod.cuda.is_available.return_value = True + torch_mod.cuda.is_bf16_supported.return_value = True + + with patch.dict( </file context>

Monitors GPU memory via nvidia-smi during concurrent requests with varying audio durations (30s-300s). Gates on peak VRAM staying below configurable threshold (default 85%). Would have caught the Phase 2 OOM from PR #8428 where disabling torch.compile tripled VRAM under batch load. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Standalone script that sends concurrent requests with increasing audio durations and monitors VRAM via nvidia-smi. Identifies the exact duration threshold where OOM occurs for a given config. Usage: PARAKEET_URL=http://localhost:8080 python reproduce_parakeet_oom.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cubic-dev-ai

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/scripts/reproduce_parakeet_oom.py">

<violation number="1" location="backend/scripts/reproduce_parakeet_oom.py:183">
P3: Unused variable `done_count` — incremented but never read</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-06-28T04:55:43Z

+
+            done_count = 0
+            for f in as_completed(futures):
+                done_count += 1


P3: Unused variable done_count — incremented but never read

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At backend/scripts/reproduce_parakeet_oom.py, line 183: <comment>Unused variable `done_count` — incremented but never read</comment> <file context> @@ -0,0 +1,251 @@ + + done_count = 0 + for f in as_completed(futures): + done_count += 1 + cur_used, _ = get_gpu_memory() + if cur_used and cur_used > peak_used: </file context>

Git-on-my-level

Thanks for splitting out the safer Parakeet recovery path from the reverted OOM work. I reviewed the diff and this looks directionally good: the production chart keeps attention mode on full, adds a 600s pre-GPU duration guard, handles non-WAV duration probing with soundfile, fail-closes unknown-duration uploads while the guard is enabled, and avoids poisoning the duration histogram with non-finite values.

I’m leaving this as a positive signal rather than a formal approval because this changes the production Parakeet Helm config / runtime guardrails, so it should get human maintainer verification before merge.

Before merging, I’d like a maintainer to verify the deployment-specific pieces from the PR checklist:

pod starts cleanly with the new Parakeet environment variables,
600s uploads are rejected before GPU enqueue in the deployed service,
normal ~30–60s traffic still transcribes successfully under the production config,
VRAM stays healthy after rollout.

No blocking code issue found from my static review. The CI checks shown on the PR are passing.

kodjima33

Backend: Parakeet duration guard + BF16 recast fix (beastoin) — approve only (large, deploy verification for maintainer).

ATTENTION_MODE=auto dynamically switches between full and local attention based on audio duration (threshold 300s). torch.compile is skipped in auto mode — our 20-min sustained load test on the exact prod image (a5a4262) showed no VRAM leak (slope +5.7 MiB/min, zero OOMs). 1h duration cap (3600s) provides safety for long files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Standalone script that runs prolonged load against a parakeet pod and detects monotonic VRAM accumulation via linear regression slope. Four gates: peak VRAM < 85%, slope < 50 MiB/min, zero OOMs, recovery within 20% of baseline. Supports nvidia-smi (direct GPU) and process RSS (/metrics endpoint) monitoring modes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds sustained load class that runs 5-min (configurable) test at 30 req/min and checks VRAM slope via linear regression. Catches the type of monotonic leak that caused the Phase 2 OOM — burst tests miss it because VRAM releases between batches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cubic-dev-ai

3 issues found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/tests/container/test_parakeet_vram_stress.py">

<violation number="1" location="backend/tests/container/test_parakeet_vram_stress.py:437">
P2: pool.shutdown(wait=False) does not wait for in-flight tasks; time.sleep(10) is not guaranteed to be enough for all tasks to complete, so OOM count and success/fail tallies may be incomplete.</violation>
</file>

<file name="backend/charts/parakeet/prod_omi_parakeet_values.yaml">

<violation number="1" location="backend/charts/parakeet/prod_omi_parakeet_values.yaml:108">
P1: PARAKEET_ATTENTION_MODE set to "auto" but PR description states prod config should be "full" (torch.compile compatible). Auto mode was excluded due to OOM risk and described as "not enabled in production."</violation>

<violation number="2" location="backend/charts/parakeet/prod_omi_parakeet_values.yaml:114">
P1: PARAKEET_MAX_FILE_DURATION set to "3600" (1h) but PR description states prod guard is "600" (10 min). 1h limit defeats the pre-GPU duration guard's OOM protection purpose.</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

cubic-dev-ai · 2026-06-29T01:33:22Z

+  - name: PARAKEET_LOCAL_ATTN_CONTEXT
+    value: "128,128"
+  - name: PARAKEET_MAX_FILE_DURATION
+    value: "3600"


P1: PARAKEET_MAX_FILE_DURATION set to "3600" (1h) but PR description states prod guard is "600" (10 min). 1h limit defeats the pre-GPU duration guard's OOM protection purpose.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At backend/charts/parakeet/prod_omi_parakeet_values.yaml, line 114: <comment>PARAKEET_MAX_FILE_DURATION set to "3600" (1h) but PR description states prod guard is "600" (10 min). 1h limit defeats the pre-GPU duration guard's OOM protection purpose.</comment> <file context> @@ -105,13 +105,13 @@ env: value: "128,128" - name: PARAKEET_MAX_FILE_DURATION - value: "600" + value: "3600" - name: HOSTED_SPEAKER_EMBEDDING_API_URL value: "http://prod-omi-diarizer.prod-omi-backend.svc.cluster.local:8080" </file context>

Suggested change

value: "3600"

value: "600"

cubic-dev-ai · 2026-06-29T01:33:22Z

  - name: PARAKEET_CUDA_GRAPHS
    value: "true"
+  - name: PARAKEET_ATTENTION_MODE
+    value: "auto"


P1: PARAKEET_ATTENTION_MODE set to "auto" but PR description states prod config should be "full" (torch.compile compatible). Auto mode was excluded due to OOM risk and described as "not enabled in production."

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At backend/charts/parakeet/prod_omi_parakeet_values.yaml, line 108: <comment>PARAKEET_ATTENTION_MODE set to "auto" but PR description states prod config should be "full" (torch.compile compatible). Auto mode was excluded due to OOM risk and described as "not enabled in production."</comment> <file context> @@ -105,13 +105,13 @@ env: value: "true" - name: PARAKEET_ATTENTION_MODE - value: "full" + value: "auto" - name: PARAKEET_AUTO_ATTN_THRESHOLD value: "300" </file context>

Suggested change

value: "auto"

value: "full"

cubic-dev-ai · 2026-06-29T01:33:22Z

+        except KeyboardInterrupt:
+            pass
+
+        pool.shutdown(wait=False)


P2: pool.shutdown(wait=False) does not wait for in-flight tasks; time.sleep(10) is not guaranteed to be enough for all tasks to complete, so OOM count and success/fail tallies may be incomplete.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At backend/tests/container/test_parakeet_vram_stress.py, line 437: <comment>pool.shutdown(wait=False) does not wait for in-flight tasks; time.sleep(10) is not guaranteed to be enough for all tasks to complete, so OOM count and success/fail tallies may be incomplete.</comment> <file context> @@ -354,6 +354,127 @@ def test_no_oom_at_production_pattern(self, gpu_available): + except KeyboardInterrupt: + pass + + pool.shutdown(wait=False) + time.sleep(10) + monitor.stop() </file context>

Suggested change

pool.shutdown(wait=False)

pool.shutdown(wait=True)

beastoin and others added 5 commits June 28, 2026 04:45

Add BF16 recast and attention switch tests

bfe256a

Tests for BF16 dtype recast after local mode init and attention mode switching in both directions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add FLAC detection, unprobeable rejection, and metrics poisoning tests

2f1626e

- FLAC duration detection via soundfile - Unprobeable audio rejection on v1 and v2 endpoints - Verify inf values don't poison AUDIO_DURATION histogram Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cubic-dev-ai Bot reviewed Jun 28, 2026

View reviewed changes

beastoin and others added 2 commits June 28, 2026 04:51

cubic-dev-ai Bot reviewed Jun 28, 2026

View reviewed changes

Git-on-my-level added needs-maintainer-review Needs a human maintainer to review/approve (e.g. stacked, product, or architecture judgment) workflow-review Needs maintainer review for workflow, automation, hooks, or CI behavior labels Jun 28, 2026

Git-on-my-level reviewed Jun 28, 2026

View reviewed changes

kodjima33 approved these changes Jun 28, 2026

View reviewed changes

beastoin and others added 3 commits June 29, 2026 01:29

beastoin changed the title ~~Parakeet: duration guard (600s) + BF16 recast fix~~ Parakeet: auto attention + 1h duration guard + BF16 recast + VRAM leak tests Jun 29, 2026

cubic-dev-ai Bot reviewed Jun 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parakeet: auto attention + 1h duration guard + BF16 recast + VRAM leak tests#8486

Parakeet: auto attention + 1h duration guard + BF16 recast + VRAM leak tests#8486
beastoin wants to merge 10 commits into
mainfrom
fix/parakeet-duration-guard-bf16-clean

beastoin commented Jun 28, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Jun 28, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Jun 28, 2026

Uh oh!

Git-on-my-level left a comment

Uh oh!

kodjima33 left a comment

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Jun 29, 2026

Uh oh!

cubic-dev-ai Bot Jun 29, 2026

Uh oh!

cubic-dev-ai Bot Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

beastoin commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed from PR #8428

Production config

OOM reproduction evidence

Test plan

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Git-on-my-level left a comment

Choose a reason for hiding this comment

Uh oh!

kodjima33 left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

beastoin commented Jun 28, 2026 •

edited

Loading