diff --git a/.agents/skills/deepgram-python-audio-intelligence/SKILL.md b/.agents/skills/deepgram-python-audio-intelligence/SKILL.md index 655ae50f..a674f142 100644 --- a/.agents/skills/deepgram-python-audio-intelligence/SKILL.md +++ b/.agents/skills/deepgram-python-audio-intelligence/SKILL.md @@ -7,16 +7,11 @@ description: Use when writing or reviewing Python code in this repo that calls D Analytics overlays applied to `/v1/listen` transcription: summarize, topics, intents, sentiment, language detection, diarization, redaction, entities. Same endpoint / same client methods as STT — enable features via params. -## When to use this product - -- You have **audio** (file, URL, or live stream) and want analytics alongside the transcript. -- REST is the primary path — most analytics are REST-only. - **Use a different skill when:** -- You want a pure transcript with no analytics → `deepgram-python-speech-to-text`. -- Your input is already transcribed text → `deepgram-python-text-intelligence` (`/v1/read`). -- You need conversational turn-taking → `deepgram-python-conversational-stt`. -- You need a full interactive agent → `deepgram-python-voice-agent`. +- Pure transcript with no analytics → `deepgram-python-speech-to-text`. +- Input is already transcribed text → `deepgram-python-text-intelligence` (`/v1/read`). +- Conversational turn-taking → `deepgram-python-conversational-stt`. +- Full interactive agent → `deepgram-python-voice-agent`. ## Feature availability: REST vs WSS @@ -55,13 +50,13 @@ response = client.listen.v1.media.transcribe_url( model="nova-3", smart_format=True, punctuate=True, - diarize=True, # speaker separation - summarize="v2", # "v2" for the current model; True also accepted on /v1/listen + diarize=True, + summarize="v2", topics=True, intents=True, sentiment=True, detect_language=True, - redact=["pci", "pii"], # or Sequence[str] + redact=["pci", "pii"], language="en-US", ) @@ -98,44 +93,23 @@ response = client.listen.v1.media.transcribe_file( ## Quick start — diarization with word-level timings -Enable speaker separation and word-level timestamps in a single request, then iterate the per-word objects to build a speaker-labelled transcript with timing. - ```python response = client.listen.v1.media.transcribe_url( url="https://dpgr.am/spacewalk.wav", model="nova-3", - diarize=True, # tag each word with a speaker id - smart_format=True, # punctuated_word for cleaner output + diarize=True, + smart_format=True, punctuate=True, ) words = response.results.channels[0].alternatives[0].words or [] - -# Per-word: speaker, timestamps, confidence -for w in words: - speaker = getattr(w, "speaker", None) - text = w.punctuated_word or w.word - print(f"[speaker {speaker}] {text} ({w.start:.2f}s–{w.end:.2f}s, conf={w.confidence:.2f})") - -# Group consecutive words by speaker into utterances from itertools import groupby for speaker, group in groupby(words, key=lambda w: getattr(w, "speaker", None)): text = " ".join((w.punctuated_word or w.word) for w in group) print(f"Speaker {speaker}: {text}") ``` -Per-word fields available on each entry: - -| Field | Type | Description | -|---|---|---| -| `word` | `str` | Lowercase token | -| `punctuated_word` | `str \| None` | Token with smart-formatted casing/punctuation (when `smart_format=True`) | -| `start`, `end` | `float` | Audio timestamps in seconds | -| `confidence` | `float` | 0.0–1.0 confidence | -| `speaker` | `int \| None` | Speaker id (when `diarize=True`); `None` if diarization disabled | -| `speaker_confidence` | `float \| None` | Speaker-id confidence | - -For a higher-level breakdown, set `utterances=True` to get pre-grouped speaker turns at `response.results.utterances`. Set `paragraphs=True` for a `paragraphs` view organised by speaker turn boundaries. +Each word object has: `word`, `punctuated_word`, `start`/`end` (float seconds), `confidence`, `speaker` (int, when `diarize=True`), `speaker_confidence`. For pre-grouped speaker turns use `utterances=True` (`response.results.utterances`) or `paragraphs=True`. ## Quick start — WSS subset (diarize / redact / entities only) @@ -151,27 +125,32 @@ with client.listen.v1.connect(model="nova-3", diarize=True, redact=["pii"]) as c conn.send_finalize() ``` +## Validation & recovery + +After transcription, verify analytics fields are populated: + +```python +r = response.results +if r.summary is None and summarize_was_requested: + # Feature silently ignored -- likely passed on WSS (REST-only). + # Recovery: re-run via REST instead of WSS. + response = client.listen.v1.media.transcribe_url(url=..., summarize="v2", ...) +``` + +For `redact`, confirm redacted markers appear in the transcript (e.g., search for `[REDACTED]`). A missing marker means encoding mismatch or unsupported redact value. + ## Key parameters `summarize`, `topics`, `intents`, `sentiment`, `detect_language`, `diarize`, `redact`, `custom_topic`, `custom_topic_mode`, `custom_intent`, `custom_intent_mode`, `detect_entities`, plus all the standard STT params (`model`, `language`, `encoding`, `sample_rate`, ...). -`redact` is typed as `Optional[str]` in the current generated SDK (`src/deepgram/listen/v1/media/client.py`). Pass a single redaction mode such as `"pci"`, `"pii"`, `"numbers"`, or `"phi"`. Multi-mode redaction at the transport level is supported by sending `redact` as a repeated query parameter — check `src/deepgram/types/listen_v1redact.py` for the current type and fall back to raw query-param construction (or multiple calls) if you need several modes. The earlier `Union[str, Sequence[str]]` override is no longer carried in `.fernignore`. +`redact` is typed as `Optional[str]` in the generated SDK. Pass a single mode (`"pci"`, `"pii"`, `"numbers"`, `"phi"`). For multi-mode, use repeated query params or multiple calls -- see `src/deepgram/types/listen_v1redact.py`. ## API reference (layered) -1. **In-repo reference**: `reference.md` — "Listen V1 Media" (REST params include all analytics flags), "Listen V1 Connect" (WSS-supported subset). -2. **OpenAPI (REST)**: https://developers.deepgram.com/openapi.yaml -3. **AsyncAPI (WSS)**: https://developers.deepgram.com/asyncapi.yaml -4. **Context7**: library ID `/llmstxt/developers_deepgram_llms_txt`. -5. **Product docs**: - - https://developers.deepgram.com/docs/stt-intelligence-feature-overview - - https://developers.deepgram.com/docs/summarization - - https://developers.deepgram.com/docs/topic-detection - - https://developers.deepgram.com/docs/intent-recognition - - https://developers.deepgram.com/docs/sentiment-analysis - - https://developers.deepgram.com/docs/language-detection - - https://developers.deepgram.com/docs/redaction - - https://developers.deepgram.com/docs/diarization +1. **In-repo reference**: `reference.md` -- "Listen V1 Media" (REST), "Listen V1 Connect" (WSS subset). +2. **OpenAPI / AsyncAPI**: https://developers.deepgram.com/openapi.yaml, https://developers.deepgram.com/asyncapi.yaml +3. **Context7**: library ID `/llmstxt/developers_deepgram_llms_txt`. +4. **Product docs**: https://developers.deepgram.com/docs/stt-intelligence-feature-overview (overview); per-feature pages at `/docs/summarization`, `/docs/topic-detection`, `/docs/intent-recognition`, `/docs/sentiment-analysis`, `/docs/language-detection`, `/docs/redaction`, `/docs/diarization`. ## Gotchas @@ -195,12 +174,4 @@ with client.listen.v1.connect(model="nova-3", diarize=True, redact=["pii"]) as c - `deepgram-python-conversational-stt` — Flux for turn-taking - `deepgram-python-voice-agent` — interactive assistants -## Central product skills - -For cross-language Deepgram product knowledge — the consolidated API reference, documentation finder, focused runnable recipes, third-party integration examples, and MCP setup — install the central skills: - -```bash -npx skills add deepgram/skills -``` - -This SDK ships language-idiomatic code skills; `deepgram/skills` ships cross-language product knowledge (see `api`, `docs`, `recipes`, `examples`, `starters`, `setup-mcp`). +For cross-language Deepgram product knowledge, install the central skills: `npx skills add deepgram/skills`. diff --git a/.agents/skills/deepgram-python-conversational-stt/SKILL.md b/.agents/skills/deepgram-python-conversational-stt/SKILL.md index 888f1259..aadd8e67 100644 --- a/.agents/skills/deepgram-python-conversational-stt/SKILL.md +++ b/.agents/skills/deepgram-python-conversational-stt/SKILL.md @@ -7,16 +7,10 @@ description: Use when writing or reviewing Python code in this repo that calls D Turn-aware streaming STT at `/v2/listen` — optimized for conversational audio (end-of-turn detection, eager EOT, barge-in scenarios). -## When to use this product - -- You're building a **conversational UI** and need explicit turn boundaries. -- You want **Flux models** (optimized for human-to-human or human-to-agent conversation). -- You want lower latency turn signals than v1 utterance_end. - **Use a different skill when:** -- You want general-purpose transcription (captions, batch, non-conversational) → `deepgram-python-speech-to-text`. -- You want a full interactive agent (STT + LLM + TTS) → `deepgram-python-voice-agent`. -- You want analytics (summarize/sentiment) → `deepgram-python-audio-intelligence`. +- General-purpose transcription (captions, batch, non-conversational) → `deepgram-python-speech-to-text`. +- Full interactive agent (STT + LLM + TTS) → `deepgram-python-voice-agent`. +- Analytics (summarize/sentiment) → `deepgram-python-audio-intelligence`. ## Authentication @@ -74,6 +68,26 @@ with client.listen.v2.connect( conn.start_listening() ``` +## Error recovery + +On `ListenV2FatalError`, the connection is terminal -- open a new one. For transient disconnects (`EventType.CLOSE` without a prior fatal), reconnect with exponential backoff: + +```python +import time + +def run_with_reconnect(max_retries=5): + for attempt in range(max_retries): + try: + with client.listen.v2.connect(model="flux-general-en", encoding="linear16", sample_rate="16000") as conn: + # ... register handlers, send audio ... + conn.start_listening() + break # clean exit + except Exception as e: + wait = min(2 ** attempt, 30) + print(f"Disconnected ({e}), retrying in {wait}s...") + time.sleep(wait) +``` + ## Key parameters | Param | Notes | @@ -143,12 +157,4 @@ async with client.listen.v2.connect(model="flux-general-en", ...) as conn: - `deepgram-python-speech-to-text` — v1 general-purpose STT (REST + WSS) - `deepgram-python-voice-agent` — full interactive assistant -## Central product skills - -For cross-language Deepgram product knowledge — the consolidated API reference, documentation finder, focused runnable recipes, third-party integration examples, and MCP setup — install the central skills: - -```bash -npx skills add deepgram/skills -``` - -This SDK ships language-idiomatic code skills; `deepgram/skills` ships cross-language product knowledge (see `api`, `docs`, `recipes`, `examples`, `starters`, `setup-mcp`). +For cross-language Deepgram product knowledge, install the central skills: `npx skills add deepgram/skills`. diff --git a/.agents/skills/deepgram-python-management-api/SKILL.md b/.agents/skills/deepgram-python-management-api/SKILL.md index 31f42752..b58a03d0 100644 --- a/.agents/skills/deepgram-python-management-api/SKILL.md +++ b/.agents/skills/deepgram-python-management-api/SKILL.md @@ -7,18 +7,9 @@ description: Use when writing or reviewing Python code in this repo that calls D Administrative REST endpoints at `api.deepgram.com/v1/projects`, `/v1/models`, and reusable agent configuration storage. Project-scoped resources live under `client.manage.v1.projects.*` (keys, members, members.invites, usage, billing, models, requests). Global models at `client.manage.v1.models`. Think-model discovery at `client.agent.v1.settings.think.models`. Reusable agent configs at `client.voice_agent.configurations.*`. -## When to use this product - -- **Discover / pin models**: `client.manage.v1.models.list()` returns the active STT/TTS set. -- **Project admin**: list/get/update/delete/leave projects. -- **API key lifecycle**: list/create/delete project keys. -- **Member + invite management**: add/remove members, manage roles, send/revoke invites. -- **Usage + billing**: query request volume, balances. -- **Reusable Voice Agent configs**: persist the **`agent` block** of a Settings message on the server, reference by `agent_id`. The stored blob is the `agent` object only (listen / think / speak providers + prompt), not the full `AgentV1Settings`. - **Use a different skill when:** -- You want to actually talk to an agent → `deepgram-python-voice-agent`. -- You want to transcribe or synthesize → STT/TTS skills. +- Running an agent interactively → `deepgram-python-voice-agent`. +- Transcribing or synthesizing → STT/TTS skills. ## Authentication @@ -85,33 +76,24 @@ See `examples/51-55` for each sub-module. ## Quick start — Voice Agent configurations +**Important:** The stored config is the `agent` block only (listen/think/speak providers + prompt) as a JSON string, NOT the full `AgentV1Settings`. Top-level fields like `audio` go in the live Settings message at connect time. The returned `agent_id` replaces the inline `agent` object in future Settings messages. Configs are immutable -- create a new one to change behavior; only metadata is mutable. + ```python -# List reusable configs +import json configs = client.voice_agent.configurations.list(project_id=pid) -# Create: `config` is a JSON string of the `agent` BLOCK ONLY — not the full -# Settings message. Do NOT include top-level Settings fields like `audio`; -# those are sent at connect-time in the live Settings message. The stored -# `agent_id` later replaces the inline `agent` object in a Settings message. -import json config_json = json.dumps({ "listen": {"provider": {"type": "deepgram", "model": "nova-3"}}, "think": {"provider": {"type": "open_ai", "model": "gpt-4o-mini"}, "prompt": "..."}, "speak": {"provider": {"type": "deepgram", "model": "aura-2-asteria-en"}}, }) created = client.voice_agent.configurations.create( - project_id=pid, - config=config_json, - metadata={"label": "support-en"}, + project_id=pid, config=config_json, metadata={"label": "support-en"}, ) print(created.agent_id) -# Update metadata (immutable config body — create a new one to change behavior) client.voice_agent.configurations.update(project_id=pid, agent_id=created.agent_id, metadata={"label": "v2"}) - -# Get / delete one = client.voice_agent.configurations.get(project_id=pid, agent_id=created.agent_id) -# client.voice_agent.configurations.delete(project_id=pid, agent_id=...) ``` Think-provider model discovery (which LLMs Agent supports): @@ -140,18 +122,28 @@ projects = await client.manage.v1.projects.list() - https://developers.deepgram.com/reference/voice-agent/agent-configurations/create-agent-configuration - https://developers.deepgram.com/reference/voice-agent/think-models +## Destructive operation guard + +Delete operations (projects, keys, agent configs) are **irreversible**. Always verify the resource before deleting: + +```python +# Confirm before deleting a key +key = client.manage.v1.projects.keys.list(project_id=pid) +target = next((k for k in key.api_keys if k.api_key_id == kid), None) +assert target is not None, f"Key {kid} not found" +print(f"Deleting key: {target.comment}") +client.manage.v1.projects.keys.delete(project_id=pid, key_id=kid) +``` + ## Gotchas 1. **`Token` auth, not `Bearer`.** -2. **Project-scoped resources are nested under `.projects.*`.** There is no top-level `client.manage.v1.keys` / `.members` / `.invites` / `.usage` / `.billing`. Use `client.manage.v1.projects.keys`, `...projects.members`, `...projects.members.invites`, `...projects.usage`, `...projects.billing.balances`, and `...projects.requests` for request logs. The only top-level `client.manage.v1.*` namespaces are `projects` and `models`. -3. **Think-model discovery is on the Agent client**, not Manage: `client.agent.v1.settings.think.models.list()`. There is no `client.manage.v1.agent.*`. -4. **Agent config body is a JSON STRING on create**, not a nested object. Pass `config=json.dumps(...)`. -5. **Agent config is the `agent` block only**, not the full Settings message. Do not include top-level fields like `audio` — those go in the live Settings message at connect time. -6. **Agent configs are immutable** — you cannot edit the config body. Create a new one to change behavior. Only metadata is mutable. -7. **Use `include_outdated=True`** on `models.list()` when pinning older models. -8. **Delete is irreversible.** Wire tests typically comment out destructive calls. -9. **Project-scoped vs global models**: `client.manage.v1.models.list()` returns all; `client.manage.v1.projects.models.list(project_id=...)` returns what the project can access. -10. **Returned agent configs are uninterpolated** — raw stored JSON string. Parse before use. +2. **Project-scoped resources are nested under `.projects.*`.** No top-level `client.manage.v1.keys` etc. Use `client.manage.v1.projects.keys`, `...projects.members`, `...projects.members.invites`, `...projects.usage`, `...projects.billing.balances`, `...projects.requests`. +3. **Think-model discovery is on the Agent client**, not Manage: `client.agent.v1.settings.think.models.list()`. +4. **Agent config body is a JSON STRING on create**: pass `config=json.dumps(...)`. See the Voice Agent configurations section above for full details. +5. **Use `include_outdated=True`** on `models.list()` when pinning older models. +6. **Project-scoped vs global models**: `client.manage.v1.models.list()` returns all; `client.manage.v1.projects.models.list(project_id=...)` returns what the project can access. +7. **Returned agent configs are uninterpolated** -- raw stored JSON string. Parse before use. ## Example files in this repo @@ -170,12 +162,4 @@ projects = await client.manage.v1.projects.list() - `deepgram-python-voice-agent` — run an agent (use a config created here) -## Central product skills - -For cross-language Deepgram product knowledge — the consolidated API reference, documentation finder, focused runnable recipes, third-party integration examples, and MCP setup — install the central skills: - -```bash -npx skills add deepgram/skills -``` - -This SDK ships language-idiomatic code skills; `deepgram/skills` ships cross-language product knowledge (see `api`, `docs`, `recipes`, `examples`, `starters`, `setup-mcp`). +For cross-language Deepgram product knowledge, install the central skills: `npx skills add deepgram/skills`. diff --git a/.agents/skills/deepgram-python-speech-to-text/SKILL.md b/.agents/skills/deepgram-python-speech-to-text/SKILL.md index 17e25853..a4978f54 100644 --- a/.agents/skills/deepgram-python-speech-to-text/SKILL.md +++ b/.agents/skills/deepgram-python-speech-to-text/SKILL.md @@ -7,15 +7,10 @@ description: Use when writing or reviewing Python code in this repo that calls D Basic transcription (ASR) for prerecorded audio (REST) or live audio (WebSocket) via `/v1/listen`. -## When to use this product - -- **REST (`transcribe_url` / `transcribe_file`)** — one-shot transcription of a complete file or URL. Use for batch jobs, captioning pipelines, offline analysis. -- **WebSocket (`listen.v1.connect`)** — continuous streaming transcription. Use for live captions, real-time microphone input, phone audio. - **Use a different skill when:** -- You want summaries, sentiment, topics, intents, diarization, or redaction on the audio → `deepgram-python-audio-intelligence` (same endpoint, different params). -- You need turn-taking / end-of-turn events → `deepgram-python-conversational-stt` (v2 / Flux). -- You need a full-duplex interactive assistant (STT + LLM + TTS + function calls) → `deepgram-python-voice-agent`. +- Summaries, sentiment, topics, intents, diarization, or redaction on audio → `deepgram-python-audio-intelligence` (same endpoint, different params). +- Turn-taking / end-of-turn events → `deepgram-python-conversational-stt` (v2 / Flux). +- Full-duplex interactive assistant (STT + LLM + TTS + function calls) → `deepgram-python-voice-agent`. ## Authentication @@ -60,8 +55,6 @@ response = client.listen.v1.media.transcribe_file( ## Quick start — WebSocket (live streaming with interim results) -Live transcription emits **interim** (partial) and **final** results. Pass `interim_results=True` and switch on `is_final` to display partial text in real time, then overwrite it with the final transcript when the speaker pauses. - ```python import threading from deepgram.core.events import EventType @@ -72,12 +65,11 @@ from deepgram.listen.v1.types import ( with client.listen.v1.connect( model="nova-3", - interim_results=True, # ← emit partial results while user is still speaking - utterance_end_ms=1000, # silence (ms) before server emits UtteranceEnd - vad_events=True, # SpeechStarted events + interim_results=True, + utterance_end_ms=1000, + vad_events=True, smart_format=True, ) as conn: - # Mutable container so the on_message closure can update state without `global` state = {"last_interim_len": 0} def on_message(m): @@ -86,19 +78,17 @@ with client.listen.v1.connect( if not transcript: return if m.is_final: - # Final segment: overwrite the running interim line, newline if utterance ended pad = " " * max(0, state["last_interim_len"] - len(transcript)) end = "\n" if m.speech_final else "" print(f"\r{transcript}{pad}", end=end, flush=True) state["last_interim_len"] = 0 else: - # Interim: keep overwriting the same console line as the user speaks print(f"\r{transcript}", end="", flush=True) state["last_interim_len"] = len(transcript) elif isinstance(m, ListenV1UtteranceEnd): - print() # newline; UtteranceEnd fires after final results when audio goes silent + print() elif isinstance(m, ListenV1SpeechStarted): - pass # optional: reset UI when a new utterance begins + pass conn.on(EventType.OPEN, lambda _: print("connected")) conn.on(EventType.MESSAGE, on_message) @@ -116,14 +106,15 @@ with client.listen.v1.connect( ### Interim vs. final flag semantics -- **`is_final = False`** — interim hypothesis. Will be revised. Display in a non-committal style (lighter colour, italic) and overwrite when the next message arrives. -- **`is_final = True`, `speech_final = False`** — confirmed segment, but the speaker is still talking. Append to the transcript; another final will follow. -- **`is_final = True`, `speech_final = True`** — confirmed segment AND the utterance ended (silence detected). Commit the line and start a new one. -- **`from_finalize = True`** — this final was triggered by your explicit `send_finalize()` call (vs natural endpointing). Useful to distinguish "I asked for a flush" from "the speaker paused". +| `is_final` | `speech_final` | Meaning | +|---|---|---| +| `False` | -- | Interim hypothesis; will be revised | +| `True` | `False` | Confirmed segment; speaker still talking | +| `True` | `True` | Confirmed segment; utterance ended (silence) | -Send `send_finalize()` to force the server to emit final results immediately (e.g. user clicks "stop"). Send `send_close_stream()` after `send_finalize` to terminate cleanly. +`from_finalize=True` means the final was triggered by `send_finalize()` (vs natural endpointing). Call `send_finalize()` to flush, then `send_close_stream()` to terminate. Types: `deepgram.listen.v1.types`. -WSS message types live under `deepgram.listen.v1.types`. +**WebSocket error recovery:** If connection fails with 401, verify auth scheme (Gotcha #1). If transcript is empty, verify `encoding`/`sample_rate` match the audio (Gotcha #2). On unexpected close, check `EventType.ERROR` payload and reconnect with exponential backoff. ## Async equivalents @@ -140,56 +131,14 @@ async with client.listen.v1.connect(model="nova-3") as conn: ## Async / deferred result patterns -There are **two distinct** notions of "async" — don't confuse them. - -### 1. Python `async/await` (sync-style, immediate result) - -`AsyncDeepgramClient` returns `Awaitable[]`. The result is delivered when you `await`, not later. Use this when integrating with FastAPI, aiohttp, or any asyncio app. - -```python -import asyncio -from deepgram import AsyncDeepgramClient - -client = AsyncDeepgramClient() - -async def transcribe(url: str) -> str: - response = await client.listen.v1.media.transcribe_url( - url=url, - model="nova-3", - smart_format=True, - ) - # `response` is the FULL transcription — no polling, no callback, just await. - return response.results.channels[0].alternatives[0].transcript - -text = asyncio.run(transcribe("https://dpgr.am/spacewalk.wav")) -``` - -### 2. Deferred via callback URL (webhook, results posted later) - -Pass `callback="https://your.app/webhook"` and the request **returns immediately** with a `request_id`. Deepgram processes the audio in the background and POSTs the final result to your webhook URL. There is **no polling endpoint** — your server must be reachable to receive the result. - -```python -response = client.listen.v1.media.transcribe_url( - url="https://dpgr.am/spacewalk.wav", - callback="https://your.app/deepgram-webhook", - callback_method="POST", # or "PUT" - model="nova-3", - smart_format=True, -) -print(f"Accepted; tracking id: {response.request_id}") -# response is a "listen accepted" — NOT the transcript. Wait for your webhook. -``` - -The webhook receives the same JSON body you would have received from a synchronous `transcribe_url` call. Use this for very long files or when you don't want the request hanging open. - | Pattern | Returns | When to use | |---|---|---| -| `client.listen.v1.media.transcribe_url(...)` | full transcription synchronously | files up to ~10 min; HTTP timeout-bound | -| `await AsyncDeepgramClient().listen.v1.media.transcribe_url(...)` | full transcription, non-blocking | inside asyncio apps | -| `transcribe_url(..., callback="https://...")` | `{request_id}` immediately, transcription POSTs to webhook later | very long files; no long-lived HTTP connection | +| `client.listen.v1.media.transcribe_url(...)` | full transcription synchronously | files up to ~10 min | +| `await AsyncDeepgramClient().listen.v1.media.transcribe_url(...)` | full transcription, non-blocking | asyncio apps (FastAPI, aiohttp) | +| `transcribe_url(..., callback="https://...")` | `{request_id}` immediately; result POSTs to webhook | very long files; no polling endpoint exists | | `client.listen.v1.connect(...)` (WebSocket) | streaming events as audio is sent | live audio (mic, telephony) | -See `examples/12-transcription-prerecorded-callback.py` for a working callback example. +For the callback pattern, pass `callback="https://your.app/webhook"` and optionally `callback_method="POST"`. The response contains only `request_id` -- the full transcription JSON is POSTed to your webhook. See `examples/12-transcription-prerecorded-callback.py`. ## Key parameters @@ -225,12 +174,4 @@ See `examples/12-transcription-prerecorded-callback.py` for a working callback e - `tests/wire/test_listen_v1_media.py` — wire-level fixtures - `tests/manual/listen/v1/connect/main.py` — live WSS connection test -## Central product skills - -For cross-language Deepgram product knowledge — the consolidated API reference, documentation finder, focused runnable recipes, third-party integration examples, and MCP setup — install the central skills: - -```bash -npx skills add deepgram/skills -``` - -This SDK ships language-idiomatic code skills; `deepgram/skills` ships cross-language product knowledge (see `api`, `docs`, `recipes`, `examples`, `starters`, `setup-mcp`). +For cross-language Deepgram product knowledge, install the central skills: `npx skills add deepgram/skills`. diff --git a/.agents/skills/deepgram-python-voice-agent/SKILL.md b/.agents/skills/deepgram-python-voice-agent/SKILL.md index bdcf64cb..2e17fb0c 100644 --- a/.agents/skills/deepgram-python-voice-agent/SKILL.md +++ b/.agents/skills/deepgram-python-voice-agent/SKILL.md @@ -7,17 +7,11 @@ description: Use when writing or reviewing Python code in this repo that builds Full-duplex voice agent runtime: STT + LLM (think) + TTS + function calling over a single WebSocket at `agent.deepgram.com/v1/agent/converse`. -## When to use this product - -- You want an **interactive voice assistant**: user speaks, agent thinks, agent speaks, interruptions allowed. -- You want **function / tool calling** triggered by the conversation. -- You want Deepgram to host the orchestration (vs wiring STT + LLM + TTS yourself). - **Use a different skill when:** - One-way transcription → `deepgram-python-speech-to-text` or `deepgram-python-conversational-stt`. - One-way synthesis → `deepgram-python-text-to-speech`. - Analytics on finished audio → `deepgram-python-audio-intelligence`. -- Managing reusable agent configs (persisted on the server) → `deepgram-python-management-api`. +- Managing reusable agent configs → `deepgram-python-management-api`. ## Authentication @@ -100,168 +94,33 @@ with client.agent.v1.connect() as agent: ## Event types (server → client) -- `Welcome` — connection acknowledged -- `SettingsApplied` — your `Settings` accepted -- `ConversationText` — text of a turn (with `role`: `user` or `assistant`) -- `UserStartedSpeaking` — VAD detected user -- `AgentThinking` — LLM is working -- `FunctionCallRequest` — tool/function call initiated by the model -- `AgentStartedSpeaking` — TTS starting -- Binary frames — audio chunks -- `AgentAudioDone` — TTS finished for this turn -- `Warning`, `Error` - -## Client messages - -- Initial `Settings` (send first) -- `Media` (binary audio frames in declared encoding/sample_rate) -- `KeepAlive` (on long sessions) -- Prompt / think / speak update messages (change mid-session) -- User / assistant text injection -- Function call response (reply to `FunctionCallRequest`) +`Welcome`, `SettingsApplied`, `ConversationText` (with `role`), `UserStartedSpeaking`, `AgentThinking`, `FunctionCallRequest`, `AgentStartedSpeaking`, binary audio frames, `AgentAudioDone`, `Warning`, `Error`. ## Reusable agent configurations -You can persist the **`agent` block** of a Settings message server-side and reuse it by `agent_id`. `client.voice_agent.configurations.create` stores a JSON string representing the `agent` object only (listen / think / speak providers + prompt) — NOT the full `AgentV1Settings` payload. Do not send top-level Settings fields like `audio` to that API; those still go in the live Settings message at connect time. The returned `agent_id` replaces the inline `agent` object in future Settings messages. Managed via `client.voice_agent.configurations.*` — see `deepgram-python-management-api`. +Persist the `agent` block server-side via `client.voice_agent.configurations.*` and reference by `agent_id` in future Settings messages. See `deepgram-python-management-api` for CRUD operations. ## Dynamic mid-session adjustment -You can change agent behavior **without disconnecting** by sending control messages on the live socket. Each method is available on the agent connection object (`agent` in the quick-start) for both sync and async clients. +Change agent behavior without disconnecting via control messages. Key methods on the agent connection object: -```python -from deepgram.agent.v1.types import ( - AgentV1UpdatePrompt, - AgentV1UpdateSpeak, - AgentV1UpdateSpeakSpeak, # type alias accepting SpeakSettingsV1 or list - AgentV1UpdateThink, - AgentV1UpdateThinkThink, # type alias accepting ThinkSettingsV1 or list - AgentV1InjectAgentMessage, - AgentV1InjectUserMessage, - AgentV1KeepAlive, -) -from deepgram.types.speak_settings_v1 import SpeakSettingsV1 -from deepgram.types.speak_settings_v1provider import SpeakSettingsV1Provider_Deepgram -from deepgram.types.think_settings_v1 import ThinkSettingsV1 -from deepgram.types.think_settings_v1provider import ThinkSettingsV1Provider_OpenAi +| Method | Type | Server reply | Notes | +|---|---|---|---| +| `send_update_prompt(AgentV1UpdatePrompt(prompt="..."))` | `AgentV1UpdatePrompt` | `PromptUpdated` | Swap system prompt | +| `send_update_speak(AgentV1UpdateSpeak(speak=SpeakSettingsV1(...)))` | `AgentV1UpdateSpeak` | `SpeakUpdated` | Swap TTS voice/model | +| `send_update_think(AgentV1UpdateThink(think=ThinkSettingsV1(...)))` | `AgentV1UpdateThink` | `ThinkUpdated` | Swap LLM provider/model | +| `send_inject_agent_message(AgentV1InjectAgentMessage(message="..."))` | `AgentV1InjectAgentMessage` | — | Force agent to speak | +| `send_inject_user_message(AgentV1InjectUserMessage(content="..."))` | `AgentV1InjectUserMessage` | may `InjectionRefused` | Inject text as user; retry after `AgentAudioDone` if refused | +| `send_keep_alive()` | `AgentV1KeepAlive` | — | Idle keep-alive (every ~5s) | -# 1. Swap the LLM system prompt mid-conversation (e.g. escalate to a different persona) -agent.send_update_prompt( - AgentV1UpdatePrompt(prompt="You are now in expert escalation mode. Be precise and concise.") -) -# Server replies with a `PromptUpdated` event when the new prompt is in effect. - -# 2. Swap the TTS voice without reconnecting (e.g. switch language or persona) -agent.send_update_speak( - AgentV1UpdateSpeak( - speak=SpeakSettingsV1( - provider=SpeakSettingsV1Provider_Deepgram( - type="deepgram", model="aura-2-luna-en", - ), - ), - ) -) -# Server replies with a `SpeakUpdated` event. - -# 3. Swap the LLM provider/model (e.g. cheaper model for follow-ups) -agent.send_update_think( - AgentV1UpdateThink( - think=ThinkSettingsV1( - provider=ThinkSettingsV1Provider_OpenAi( - type="open_ai", model="gpt-4o-mini", temperature=0.3, - ), - prompt="You are a helpful assistant. Keep replies brief.", - ), - ) -) -# Server replies with a `ThinkUpdated` event. - -# 4. Force the agent to say something specific (without waiting for user audio) -agent.send_inject_agent_message( - AgentV1InjectAgentMessage(message="Quick reminder: your call is being recorded.") -) -# Useful for proactive prompts, status updates, or scripted segues. - -# 5. Inject a user message (e.g. text input from a chat sidebar alongside voice) -agent.send_inject_user_message( - AgentV1InjectUserMessage(content="Schedule a follow-up for next Tuesday at 2pm.") -) -# Server may reply with `InjectionRefused` if the agent is mid-utterance — retry after `AgentAudioDone`. - -# 6. Idle-period keep-alive (no payload required; the SDK fills in the type literal) -agent.send_keep_alive(AgentV1KeepAlive()) -# Or simply: agent.send_keep_alive() — the message arg is optional. -``` - -Async client equivalents are identical but `await`-prefixed: - -```python -await agent.send_update_prompt(AgentV1UpdatePrompt(prompt="...")) -await agent.send_inject_agent_message(AgentV1InjectAgentMessage(message="...")) -``` +Import types from `deepgram.agent.v1.types` and provider types from `deepgram.types.*`. Async equivalents are `await`-prefixed. ## Stream lifecycle & recovery -Continuous voice agents need explicit handling for idle periods, stream pauses, and reconnects. - -**Pause / idle (no audio for several seconds):** stop calling `send_media`, but emit a `KeepAlive` every ~5 seconds. Without it, the server closes the socket at ~10 seconds of idle. - -```python -import threading, time - -stop = threading.Event() - -def keepalive_loop(): - while not stop.is_set(): - if stop.wait(5): - return - try: - agent.send_keep_alive() - except Exception: - return # socket closed; outer loop will reconnect - -threading.Thread(target=keepalive_loop, daemon=True).start() -``` - -**Resume after pause:** just call `send_media` again. No control message is required — the agent picks up VAD on the next chunk. - -**Reconnect after disconnect (preserve conversation context):** `Settings` cannot be re-sent on the same closed socket; open a new connection and resend the same `Settings`. To carry conversation history forward, include it in the new `Settings.agent.context.messages` so the LLM resumes with prior turns: - -```python -from deepgram.agent.v1.types import ( - AgentV1SettingsAgentContext, - AgentV1SettingsAgentContextMessagesItem, - AgentV1SettingsAgentContextMessagesItemContent, - AgentV1SettingsAgentContextMessagesItemContentRole, -) - -# Build the new Settings with the captured prior turns -context = AgentV1SettingsAgentContext( - messages=[ - AgentV1SettingsAgentContextMessagesItem( - content=AgentV1SettingsAgentContextMessagesItemContent( - role=AgentV1SettingsAgentContextMessagesItemContentRole.USER, - content="Hi, I'd like to schedule a meeting.", - ), - ), - AgentV1SettingsAgentContextMessagesItem( - content=AgentV1SettingsAgentContextMessagesItemContent( - role=AgentV1SettingsAgentContextMessagesItemContentRole.ASSISTANT, - content="Sure — what day works best?", - ), - ), - ], -) -new_settings = settings.model_copy(update={"agent": settings.agent.model_copy(update={"context": context})}) - -# Open a fresh connection and replay -with client.agent.v1.connect() as agent2: - agent2.send_settings(new_settings) - # ... same handlers + audio loop as before -``` - -The server emits a `History` message on connect when the SDK has captured prior turns; in Python you receive this as an `AgentV1History` object (wire `type` literal: `"History"`). Persist these turns in your application so a reconnect can rebuild `context.messages`. - -**Detect disconnects:** the `EventType.CLOSE` handler fires before the `with` block exits. Catch it and trigger your reconnect logic from there. Check `EventType.ERROR` payloads for cause (network drop vs server-initiated close vs warning). +- **Pause / idle:** Stop sending audio but emit `send_keep_alive()` every ~5s. Server closes at ~10s idle without keepalive. +- **Resume:** Just call `send_media` again -- no control message needed. +- **Reconnect (preserve context):** Open a new connection, resend `Settings` with prior turns in `Settings.agent.context.messages` (types: `AgentV1SettingsAgentContext`, `AgentV1SettingsAgentContextMessagesItem`, `AgentV1SettingsAgentContextMessagesItemContent`, `AgentV1SettingsAgentContextMessagesItemContentRole`). Use `settings.model_copy(update={"agent": settings.agent.model_copy(update={"context": context})})`. +- **Detect disconnects:** `EventType.CLOSE` fires before the `with` block exits. Check `EventType.ERROR` payloads for cause. The server emits an `AgentV1History` object (`"History"` wire type) on connect -- persist these turns for reconnect context. ## API reference (layered) @@ -276,26 +135,16 @@ The server emits a `History` message on connect when the SDK has captured prior ## Gotchas -1. **Pick the right auth scheme for the credential type.** API keys use `Authorization: Token `. Temporary / access tokens (created via `client.auth.v1.tokens.grant()` or an equivalent server) use `Authorization: Bearer `. The custom `DeepgramClient` in this repo accepts an `access_token` parameter and installs a Bearer override for all HTTP + WebSocket calls — see `src/deepgram/client.py`. -2. **Base URL is `agent.deepgram.com`, not `api.deepgram.com`.** -3. **Send `Settings` IMMEDIATELY after connect** — no audio before settings are applied. -4. **Listen/speak encoding + sample_rate must match** both your input audio and your playback path. -5. **Keepalive on long idle sessions**, otherwise the server closes. -6. **Function call responses are synchronous to the turn** — reply promptly. -7. **Provider types are tagged unions** (`ThinkSettingsV1Provider_OpenAi`, `SpeakSettingsV1Provider_Deepgram`, ...). Pick the right union variant; don't pass raw dicts. -8. **`socket_client.py` is temporarily frozen** (see `.fernignore` → `src/deepgram/agent/v1/socket_client.py`) and currently carries `_sanitize_numeric_types` plus the `construct_type` / broad-catch fixes — needed for unknown WS message shapes. Expected to be unfrozen during a future Fern regen and re-compared. +1. **Auth:** API keys use `Token `. Access tokens (from `client.auth.v1.tokens.grant()`) use `Bearer`. The SDK's `access_token=` param installs the override -- see `src/deepgram/client.py`. +2. **Base URL is `agent.deepgram.com`**, not `api.deepgram.com`. +3. **Send `Settings` IMMEDIATELY after connect** -- no audio before settings are applied. +4. **Function call responses are synchronous to the turn** -- reply promptly or the agent stalls. +5. **Provider types are tagged unions** (`ThinkSettingsV1Provider_OpenAi`, `SpeakSettingsV1Provider_Deepgram`, ...). Pick the right variant; don't pass raw dicts. +6. **`socket_client.py` is temporarily frozen** (`.fernignore` → `src/deepgram/agent/v1/socket_client.py`) with `_sanitize_numeric_types` + `construct_type` fixes for unknown WS message shapes. ## Example files in this repo - `examples/30-voice-agent.py` - `tests/manual/agent/v1/connect/main.py` — live connection test -## Central product skills - -For cross-language Deepgram product knowledge — the consolidated API reference, documentation finder, focused runnable recipes, third-party integration examples, and MCP setup — install the central skills: - -```bash -npx skills add deepgram/skills -``` - -This SDK ships language-idiomatic code skills; `deepgram/skills` ships cross-language product knowledge (see `api`, `docs`, `recipes`, `examples`, `starters`, `setup-mcp`). +For cross-language Deepgram product knowledge, install the central skills: `npx skills add deepgram/skills`.