Skip to content

Parakeet: auto attention + 1h duration guard + BF16 recast + VRAM leak tests#8486

Open
beastoin wants to merge 10 commits into
mainfrom
fix/parakeet-duration-guard-bf16-clean
Open

Parakeet: auto attention + 1h duration guard + BF16 recast + VRAM leak tests#8486
beastoin wants to merge 10 commits into
mainfrom
fix/parakeet-duration-guard-bf16-clean

Conversation

@beastoin

@beastoin beastoin commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

Re-land of attention mode system from reverted PR #8428, with auto attention enabled and 1-hour duration guard. OOM reproduction confirmed the VRAM leak is NOT in parakeet /v2/transcribe code — the prod OOM was caused by something in the full pipeline (WebSocket /v3/stream, diarizer, concurrent sessions).

  • Auto attention mode — dynamically switches between full and local attention based on audio duration (threshold 300s). torch.compile is skipped in auto mode
  • Duration guard — rejects files >3600s (1 hour) before they reach the GPU
  • BF16 dtype tracking — fixes dtype mismatch after attention mode switch
  • Soundfile detection — handles FLAC, OGG, and other non-WAV formats in the pre-batch duration check
  • Fail-closed on unprobeable uploads — returns 413 instead of sending unknown-duration files to GPU
  • Metrics poisoning fix — inf values from unprobeable files no longer corrupt the AUDIO_DURATION histogram
  • Sustained VRAM leak test — new test script + container test class that detects monotonic VRAM accumulation via linear regression slope analysis

What changed from PR #8428

Feature PR #8428 This PR
Auto attention switching Included (caused OOMs) Enabled — OOM root cause confirmed outside parakeet
Duration guard MAX_FILE_DURATION=0 (disabled) MAX_FILE_DURATION=3600 (1 hour)
BF16 recast fix Included Included (unchanged)
Soundfile detection Included Included (unchanged)
Metrics poisoning fix Included Included (unchanged)
torch.compile Disabled by auto mode Disabled by auto mode (same)
VRAM leak detection None New sustained leak test + script

Production config

PARAKEET_ATTENTION_MODE: "auto"         # dynamic full/local switching
PARAKEET_MAX_FILE_DURATION: "3600"      # reject >1hr files
PARAKEET_AUTO_ATTN_THRESHOLD: "300"     # switch to local at 5min
PARAKEET_LOCAL_ATTN_CONTEXT: "128,128"  # local attention window
PARAKEET_TORCH_COMPILE: "true"          # configured but skipped by auto mode

OOM reproduction evidence

Built the exact prod OOM image (commit a5a4262, ATTENTION_MODE=auto) on dev and ran 20-min sustained load:

  • 1200 requests at 60 req/min with real LibriSpeech audio (avg 99s)
  • VRAM slope: +5.7 MiB/min (prod OOM was ~740 MiB/min)
  • Zero OOMs, peak VRAM well within limits
  • Conclusion: leak is NOT in parakeet /v2/transcribe — safe to enable auto mode

Test plan

  • 89 gpu_worker unit tests pass (including attention switching, BF16, duration guard)
  • 30 endpoint unit tests pass (including duration guard 413 tests)
  • Sustained VRAM leak test passes on exact prod image (a5a4262)
  • VRAM stress test infrastructure added (burst + sustained)
  • Deploy with gcp_parakeet.yml and verify pod starts cleanly
  • Verify auto attention switching logs for files >300s
  • Verify MAX_FILE_DURATION=3600 rejects oversized uploads with 413
  • Monitor VRAM + OOM rate for 1 hour post-deploy

Replaces reverted #8428 and closed #8484.

by AI for @beastoin

beastoin and others added 5 commits June 28, 2026 04:45
- Track model dtype after BF16 conversion to recast after attention
  mode switches (prevents dtype mismatch causing OOM on long audio)
- Recast to BF16 after change_attention_model in local mode init
  and in _switch_attention for both directions
- Lower PARAKEET_AUTO_ATTN_THRESHOLD default from 600s to 300s for
  safer batch VRAM budget under concurrent requests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add soundfile as primary format detector (handles FLAC, OGG, etc.)
- Fail closed on unprobeable uploads when duration guard is enabled
  (return inf → rejected with 413 before reaching GPU)
- Move AUDIO_DURATION.observe after 413 guard and add isfinite check
  to prevent inf values from poisoning the histogram
- Add _duration_limit_detail helper for human-readable 413 messages

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests for BF16 dtype recast after local mode init and attention
mode switching in both directions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- FLAC duration detection via soundfile
- Unprobeable audio rejection on v1 and v2 endpoints
- Verify inf values don't poison AUDIO_DURATION histogram

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ATTENTION_MODE=full (safe, torch.compile compatible)
MAX_FILE_DURATION=600 (reject files >10min to prevent OOM)
AUTO_ATTN_THRESHOLD=300 and LOCAL_ATTN_CONTEXT=128,128 configured
but inactive while mode=full.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files

Confidence score: 5/5

  • In backend/tests/unit/test_parakeet_gpu_worker.py, leaving torch_mod.cuda.is_bf16_supported.return_value = True on the shared sys.modules["torch"] mock can leak state into later tests and cause order-dependent/flaky unit-test outcomes rather than product behavior regressions—reset or scope the mock (e.g., fixture teardown/context manager) before merging.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/tests/unit/test_parakeet_gpu_worker.py">

<violation number="1" location="backend/tests/unit/test_parakeet_gpu_worker.py:599">
P3: `torch_mod.cuda.is_bf16_supported.return_value` is set to `True` and never restored, leaking test state into subsequent tests via the shared `sys.modules["torch"]` mock.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

torch_mod = sys.modules["torch"]
orig_avail = torch_mod.cuda.is_available.return_value
torch_mod.cuda.is_available.return_value = True
torch_mod.cuda.is_bf16_supported.return_value = True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: torch_mod.cuda.is_bf16_supported.return_value is set to True and never restored, leaking test state into subsequent tests via the shared sys.modules["torch"] mock.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/tests/unit/test_parakeet_gpu_worker.py, line 599:

<comment>`torch_mod.cuda.is_bf16_supported.return_value` is set to `True` and never restored, leaking test state into subsequent tests via the shared `sys.modules["torch"]` mock.</comment>

<file context>
@@ -586,6 +586,37 @@ def test_local_mode_calls_change_attention_on_load(self):
+        torch_mod = sys.modules["torch"]
+        orig_avail = torch_mod.cuda.is_available.return_value
+        torch_mod.cuda.is_available.return_value = True
+        torch_mod.cuda.is_bf16_supported.return_value = True
+
+        with patch.dict(
</file context>

beastoin and others added 2 commits June 28, 2026 04:51
Monitors GPU memory via nvidia-smi during concurrent requests with
varying audio durations (30s-300s).  Gates on peak VRAM staying
below configurable threshold (default 85%).

Would have caught the Phase 2 OOM from PR #8428 where disabling
torch.compile tripled VRAM under batch load.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Standalone script that sends concurrent requests with increasing
audio durations and monitors VRAM via nvidia-smi.  Identifies the
exact duration threshold where OOM occurs for a given config.

Usage: PARAKEET_URL=http://localhost:8080 python reproduce_parakeet_oom.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/scripts/reproduce_parakeet_oom.py">

<violation number="1" location="backend/scripts/reproduce_parakeet_oom.py:183">
P3: Unused variable `done_count` — incremented but never read</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic


done_count = 0
for f in as_completed(futures):
done_count += 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Unused variable done_count — incremented but never read

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/scripts/reproduce_parakeet_oom.py, line 183:

<comment>Unused variable `done_count` — incremented but never read</comment>

<file context>
@@ -0,0 +1,251 @@
+
+            done_count = 0
+            for f in as_completed(futures):
+                done_count += 1
+                cur_used, _ = get_gpu_memory()
+                if cur_used and cur_used > peak_used:
</file context>

@Git-on-my-level Git-on-my-level added needs-maintainer-review Needs a human maintainer to review/approve (e.g. stacked, product, or architecture judgment) workflow-review Needs maintainer review for workflow, automation, hooks, or CI behavior labels Jun 28, 2026

@Git-on-my-level Git-on-my-level left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for splitting out the safer Parakeet recovery path from the reverted OOM work. I reviewed the diff and this looks directionally good: the production chart keeps attention mode on full, adds a 600s pre-GPU duration guard, handles non-WAV duration probing with soundfile, fail-closes unknown-duration uploads while the guard is enabled, and avoids poisoning the duration histogram with non-finite values.

I’m leaving this as a positive signal rather than a formal approval because this changes the production Parakeet Helm config / runtime guardrails, so it should get human maintainer verification before merge.

Before merging, I’d like a maintainer to verify the deployment-specific pieces from the PR checklist:

  • pod starts cleanly with the new Parakeet environment variables,
  • 600s uploads are rejected before GPU enqueue in the deployed service,

  • normal ~30–60s traffic still transcribes successfully under the production config,
  • VRAM stays healthy after rollout.

No blocking code issue found from my static review. The CI checks shown on the PR are passing.

@kodjima33 kodjima33 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backend: Parakeet duration guard + BF16 recast fix (beastoin) — approve only (large, deploy verification for maintainer).

beastoin and others added 3 commits June 29, 2026 01:29
ATTENTION_MODE=auto dynamically switches between full and local
attention based on audio duration (threshold 300s). torch.compile
is skipped in auto mode — our 20-min sustained load test on the
exact prod image (a5a4262) showed no VRAM leak (slope +5.7 MiB/min,
zero OOMs). 1h duration cap (3600s) provides safety for long files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Standalone script that runs prolonged load against a parakeet pod
and detects monotonic VRAM accumulation via linear regression slope.
Four gates: peak VRAM < 85%, slope < 50 MiB/min, zero OOMs,
recovery within 20% of baseline. Supports nvidia-smi (direct GPU)
and process RSS (/metrics endpoint) monitoring modes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds sustained load class that runs 5-min (configurable) test at
30 req/min and checks VRAM slope via linear regression. Catches
the type of monotonic leak that caused the Phase 2 OOM — burst
tests miss it because VRAM releases between batches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beastoin beastoin changed the title Parakeet: duration guard (600s) + BF16 recast fix Parakeet: auto attention + 1h duration guard + BF16 recast + VRAM leak tests Jun 29, 2026

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/tests/container/test_parakeet_vram_stress.py">

<violation number="1" location="backend/tests/container/test_parakeet_vram_stress.py:437">
P2: pool.shutdown(wait=False) does not wait for in-flight tasks; time.sleep(10) is not guaranteed to be enough for all tasks to complete, so OOM count and success/fail tallies may be incomplete.</violation>
</file>

<file name="backend/charts/parakeet/prod_omi_parakeet_values.yaml">

<violation number="1" location="backend/charts/parakeet/prod_omi_parakeet_values.yaml:108">
P1: PARAKEET_ATTENTION_MODE set to "auto" but PR description states prod config should be "full" (torch.compile compatible). Auto mode was excluded due to OOM risk and described as "not enabled in production."</violation>

<violation number="2" location="backend/charts/parakeet/prod_omi_parakeet_values.yaml:114">
P1: PARAKEET_MAX_FILE_DURATION set to "3600" (1h) but PR description states prod guard is "600" (10 min). 1h limit defeats the pre-GPU duration guard's OOM protection purpose.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

- name: PARAKEET_LOCAL_ATTN_CONTEXT
value: "128,128"
- name: PARAKEET_MAX_FILE_DURATION
value: "3600"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: PARAKEET_MAX_FILE_DURATION set to "3600" (1h) but PR description states prod guard is "600" (10 min). 1h limit defeats the pre-GPU duration guard's OOM protection purpose.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/charts/parakeet/prod_omi_parakeet_values.yaml, line 114:

<comment>PARAKEET_MAX_FILE_DURATION set to "3600" (1h) but PR description states prod guard is "600" (10 min). 1h limit defeats the pre-GPU duration guard's OOM protection purpose.</comment>

<file context>
@@ -105,13 +105,13 @@ env:
     value: "128,128"
   - name: PARAKEET_MAX_FILE_DURATION
-    value: "600"
+    value: "3600"
   - name: HOSTED_SPEAKER_EMBEDDING_API_URL
     value: "http://prod-omi-diarizer.prod-omi-backend.svc.cluster.local:8080"
</file context>
Suggested change
value: "3600"
value: "600"

- name: PARAKEET_CUDA_GRAPHS
value: "true"
- name: PARAKEET_ATTENTION_MODE
value: "auto"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: PARAKEET_ATTENTION_MODE set to "auto" but PR description states prod config should be "full" (torch.compile compatible). Auto mode was excluded due to OOM risk and described as "not enabled in production."

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/charts/parakeet/prod_omi_parakeet_values.yaml, line 108:

<comment>PARAKEET_ATTENTION_MODE set to "auto" but PR description states prod config should be "full" (torch.compile compatible). Auto mode was excluded due to OOM risk and described as "not enabled in production."</comment>

<file context>
@@ -105,13 +105,13 @@ env:
     value: "true"
   - name: PARAKEET_ATTENTION_MODE
-    value: "full"
+    value: "auto"
   - name: PARAKEET_AUTO_ATTN_THRESHOLD
     value: "300"
</file context>
Suggested change
value: "auto"
value: "full"

except KeyboardInterrupt:
pass

pool.shutdown(wait=False)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: pool.shutdown(wait=False) does not wait for in-flight tasks; time.sleep(10) is not guaranteed to be enough for all tasks to complete, so OOM count and success/fail tallies may be incomplete.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/tests/container/test_parakeet_vram_stress.py, line 437:

<comment>pool.shutdown(wait=False) does not wait for in-flight tasks; time.sleep(10) is not guaranteed to be enough for all tasks to complete, so OOM count and success/fail tallies may be incomplete.</comment>

<file context>
@@ -354,6 +354,127 @@ def test_no_oom_at_production_pattern(self, gpu_available):
+        except KeyboardInterrupt:
+            pass
+
+        pool.shutdown(wait=False)
+        time.sleep(10)
+        monitor.stop()
</file context>
Suggested change
pool.shutdown(wait=False)
pool.shutdown(wait=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-maintainer-review Needs a human maintainer to review/approve (e.g. stacked, product, or architecture judgment) workflow-review Needs maintainer review for workflow, automation, hooks, or CI behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants