Skip to content

Publish JSON schemas; speaker fallback on join gap (v0.2.2)#7

Merged
chris-colinsky merged 2 commits into
mainfrom
release/v0.2.2
May 23, 2026
Merged

Publish JSON schemas; speaker fallback on join gap (v0.2.2)#7
chris-colinsky merged 2 commits into
mainfrom
release/v0.2.2

Conversation

@chris-colinsky

@chris-colinsky chris-colinsky commented May 23, 2026

Copy link
Copy Markdown
Member

Summary

  • Publishes JSON Schemas for CombinedTranscript and BatchSummary under docs/schemas/, generated from the Pydantic source-of-truth via scripts/generate_schemas.py (serialization mode). A CI drift check (new Schema Drift Check job in ci.yml) keeps the committed artifacts in lockstep with the models. The release workflow attaches them as GitHub release assets so consumers without access to the Python package have an authoritative artifact to validate against.
  • Fixes a speaker: null emission on word-level entries in combined transcripts: when WhisperX's word-level diarization join misses (boundary-overlap gap on short trailing words), _build_segments now falls back to the enclosing segment's speaker. The Pydantic contract is unchanged — word-level speaker is still nullable — but the field is now populated whenever segment-level assignment is confident. Closes the consumer-side validation pain reported against v0.2.0 single-speaker transcripts.
  • Refreshes the README banner image and bumps the package to 0.2.2.

Consumers fetch the schema for a pinned AR version via raw.githubusercontent.com/LunarCommand/audio-refinery/v0.2.2/docs/schemas/combined-transcript-v1.json; the major-version suffix only changes when the document shape breaks (e.g. v0.3.0 alignment lands as -v2.json). See docs/service.md → "Validating against the schema" for the consumption pattern.

Test plan

  • Unit tests pass locally (make test — 377 passed)
  • Schema drift check passes locally (make check-schemas)
  • Lint and type-check pass (make lint, make type-check)
  • New TestBuildSegments cases cover both fallback-applies and fallback-respects-None paths
  • docs/schemas/combined-transcript-v1.json spot-checked: WordSegment.speaker is anyOf: [string, null] and not in required
  • CI green on all four jobs (test, lint, type-check, schemas)
  • After merge: tag v0.2.2 pushed; release workflow attaches docs/schemas/*.json as release assets alongside dist/*
  • After release: schema is fetchable from the raw.githubusercontent.com URL pinned to the tag
  • After release: downstream consumer can validate a real AR transcript against the published schema

Add JSON Schemas for CombinedTranscript and BatchSummary under
docs/schemas/. Generated from the Pydantic source-of-truth via
scripts/generate_schemas.py in serialization mode. A CI drift check
keeps the committed artifacts in lockstep with the models. Attached
as assets on each GitHub release so consumers without access to the
Python package have an authoritative artifact to validate against.

In _build_segments, fall back to the enclosing segment's speaker when
WhisperX's word-level diarization join misses (boundary-overlap gap
on short trailing words). The Pydantic contract is unchanged --
word-level speaker is still nullable -- but in practice the field is
now populated whenever segment-level assignment is confident.

Refresh the README banner image. Bump version to 0.2.2.
Copilot AI review requested due to automatic review settings May 23, 2026 03:41

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Publishes versioned JSON Schema artifacts for the service’s combined transcript and batch summary documents, adds CI enforcement to prevent schema/model drift, and improves transcript word-level speaker assignment by falling back to the enclosing segment’s speaker when WhisperX’s word-level diarization join misses.

Changes:

  • Add scripts/generate_schemas.py plus committed docs/schemas/*.json artifacts and documentation for consumers.
  • Add make generate-schemas / make check-schemas and a dedicated CI job to enforce drift checks; attach schemas to GitHub releases.
  • Fix _build_segments to populate missing word-level speaker values from the segment-level speaker (with new unit tests).

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
uv.lock Bumps locked project version to 0.2.2.
pyproject.toml Bumps package version to 0.2.2.
CHANGELOG.md Adds 0.2.2 release notes covering schema publishing + speaker fallback.
src/transcriber.py Implements word speaker fallback to segment speaker in _build_segments.
tests/test_transcriber.py Adds unit tests for word speaker fallback behavior.
scripts/generate_schemas.py Generates JSON schemas from Pydantic models into docs/schemas/.
tests/test_schemas_drift.py Adds pytest drift check to ensure committed schemas match model output.
Makefile Adds generate-schemas / check-schemas targets and includes schema drift in all-checks.
docs/service.md Documents schema validation and pinned raw GitHub URLs.
docs/schemas/combined-transcript-v1.json Adds generated CombinedTranscript schema artifact.
docs/schemas/batch-summary-v1.json Adds generated BatchSummary schema artifact.
.github/workflows/ci.yml Adds “Schema Drift Check” CI job.
.github/workflows/release.yml Includes docs/schemas/*.json as GitHub release assets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Makefile Outdated
git diff --exit-code only sees tracked changes. A new SCHEMA_TARGETS
entry whose generated file was never committed, or a git rm of an
existing schema, would slip through the previous check. Switch to
git status --porcelain so modifications, deletions, and untracked
files all trigger the failure.

Raised by CoPilot on PR #7.
@chris-colinsky chris-colinsky merged commit 3a97d8c into main May 23, 2026
8 checks passed
@chris-colinsky chris-colinsky deleted the release/v0.2.2 branch May 23, 2026 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants