Publish JSON schemas; speaker fallback on join gap (v0.2.2)#7
Merged
Conversation
Add JSON Schemas for CombinedTranscript and BatchSummary under docs/schemas/. Generated from the Pydantic source-of-truth via scripts/generate_schemas.py in serialization mode. A CI drift check keeps the committed artifacts in lockstep with the models. Attached as assets on each GitHub release so consumers without access to the Python package have an authoritative artifact to validate against. In _build_segments, fall back to the enclosing segment's speaker when WhisperX's word-level diarization join misses (boundary-overlap gap on short trailing words). The Pydantic contract is unchanged -- word-level speaker is still nullable -- but in practice the field is now populated whenever segment-level assignment is confident. Refresh the README banner image. Bump version to 0.2.2.
There was a problem hiding this comment.
Pull request overview
Publishes versioned JSON Schema artifacts for the service’s combined transcript and batch summary documents, adds CI enforcement to prevent schema/model drift, and improves transcript word-level speaker assignment by falling back to the enclosing segment’s speaker when WhisperX’s word-level diarization join misses.
Changes:
- Add
scripts/generate_schemas.pyplus committeddocs/schemas/*.jsonartifacts and documentation for consumers. - Add
make generate-schemas/make check-schemasand a dedicated CI job to enforce drift checks; attach schemas to GitHub releases. - Fix
_build_segmentsto populate missing word-levelspeakervalues from the segment-level speaker (with new unit tests).
Reviewed changes
Copilot reviewed 12 out of 14 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Bumps locked project version to 0.2.2. |
| pyproject.toml | Bumps package version to 0.2.2. |
| CHANGELOG.md | Adds 0.2.2 release notes covering schema publishing + speaker fallback. |
| src/transcriber.py | Implements word speaker fallback to segment speaker in _build_segments. |
| tests/test_transcriber.py | Adds unit tests for word speaker fallback behavior. |
| scripts/generate_schemas.py | Generates JSON schemas from Pydantic models into docs/schemas/. |
| tests/test_schemas_drift.py | Adds pytest drift check to ensure committed schemas match model output. |
| Makefile | Adds generate-schemas / check-schemas targets and includes schema drift in all-checks. |
| docs/service.md | Documents schema validation and pinned raw GitHub URLs. |
| docs/schemas/combined-transcript-v1.json | Adds generated CombinedTranscript schema artifact. |
| docs/schemas/batch-summary-v1.json | Adds generated BatchSummary schema artifact. |
| .github/workflows/ci.yml | Adds “Schema Drift Check” CI job. |
| .github/workflows/release.yml | Includes docs/schemas/*.json as GitHub release assets. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
git diff --exit-code only sees tracked changes. A new SCHEMA_TARGETS entry whose generated file was never committed, or a git rm of an existing schema, would slip through the previous check. Switch to git status --porcelain so modifications, deletions, and untracked files all trigger the failure. Raised by CoPilot on PR #7.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CombinedTranscriptandBatchSummaryunderdocs/schemas/, generated from the Pydantic source-of-truth viascripts/generate_schemas.py(serialization mode). A CI drift check (newSchema Drift Checkjob inci.yml) keeps the committed artifacts in lockstep with the models. The release workflow attaches them as GitHub release assets so consumers without access to the Python package have an authoritative artifact to validate against.speaker: nullemission on word-level entries in combined transcripts: when WhisperX's word-level diarization join misses (boundary-overlap gap on short trailing words),_build_segmentsnow falls back to the enclosing segment's speaker. The Pydantic contract is unchanged — word-levelspeakeris still nullable — but the field is now populated whenever segment-level assignment is confident. Closes the consumer-side validation pain reported against v0.2.0 single-speaker transcripts.0.2.2.Consumers fetch the schema for a pinned AR version via
raw.githubusercontent.com/LunarCommand/audio-refinery/v0.2.2/docs/schemas/combined-transcript-v1.json; the major-version suffix only changes when the document shape breaks (e.g. v0.3.0 alignment lands as-v2.json). Seedocs/service.md→ "Validating against the schema" for the consumption pattern.Test plan
make test— 377 passed)make check-schemas)make lint,make type-check)TestBuildSegmentscases cover both fallback-applies and fallback-respects-None pathsdocs/schemas/combined-transcript-v1.jsonspot-checked:WordSegment.speakerisanyOf: [string, null]and not inrequiredv0.2.2pushed; release workflow attachesdocs/schemas/*.jsonas release assets alongsidedist/*raw.githubusercontent.comURL pinned to the tag