Built a production-ready AI voice platform that lets creators, businesses, and developers generate realistic speech audio at scale; fast, clean, and API-ready.
- Multi-voice generation with controllable parameters (temperature, top-p, top-k, repetition penalty)
- Organization-based voice management (system voices + per-org custom voices)
- Low-latency audio generation and playback (streamed audio via API)
- Developer-friendly APIs (Next.js Route Handlers + tRPC), plus a typed client to the inference service (OpenAPI)
- Audio storage + delivery via Cloudflare R2 (S3-compatible) with signed URLs
- Next.js (App Router) + TypeScript + React
- Prisma (PostgreSQL via
@prisma/adapter-pg+pg) - REST API (Next Route Handlers) + tRPC
- Python FastAPI service on Modal (GPU A10G) for TTS inference
- OpenAPI typegen + typed client (
openapi-typescript+openapi-fetch) - Auth & Org (Clerk)
- Object Storage (Cloudflare R2 via AWS S3 SDK + signed URLs)
- UI (Tailwind CSS + shadcn/ui + Radix)
- Validation (Zod)
- Data fetching/state (TanStack React Query)
flowchart LR
subgraph Client
U[User Browser]
UI[Dashboard UI]
U --> UI
end
subgraph App[Next.js App Router]
API[API Routes and tRPC]
Auth[Clerk Auth and Org]
UI --> API
API --> Auth
end
subgraph Data[Data Layer]
DB[(Postgres)]
R2[Cloudflare R2]
end
subgraph Inference[Inference Layer]
M[Modal FastAPI Service]
Model[Chatterbox TTS Model]
M --> Model
end
Auth -->|user session and orgId| API
API -->|Prisma read write| DB
%% Voice upload and management
API -->|upload voice audio| R2
API -->|save voice metadata| DB
%% TTS generation
API -->|POST generate| M
M -->|read voice prompt audio| R2
M -->|return wav bytes| API
API -->|store generation audio| R2
API -->|save generation record| DB
%% Playback
UI -->|GET api audio generationId| API
API -->|signed url| R2
R2 -->|wav stream| UI
- Voice upload: UI -> API -> R2, then API -> DB
- Generate TTS: UI -> tRPC -> Modal, then API stores WAV to R2 and metadata to DB
- Playback: UI requests
/api/audio/:generationId, API returns a streamed WAV from R2
- Node.js (recommended: latest LTS)
- Postgres database
- Cloudflare R2 credentials + bucket
- Clerk application keys
- A running Chatterbox inference API (Modal) or your own compatible endpoint
git clone https://github.com/Ryanakml/Voxify.git
cd Voxify
npm installCreate a .env in the project root. Minimum variables used by the code:
# Database
DATABASE_URL=postgresql://USER:PASSWORD@HOST:PORT/DB
# Cloudflare R2 (S3-compatible)
R2_ACCOUNT_ID=...
R2_ACCESS_KEY_ID=...
R2_SECRET_ACCESS_KEY=...
R2_BUCKET_NAME=...
# Chatterbox inference API (Modal FastAPI)
CHATTERBOX_API_URL=https://your-modal-app.modal.run
CHATTERBOX_API_KEY=...
# Clerk (typical Next.js + Clerk env vars)
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=...
CLERK_SECRET_KEY=...npx prisma migrate dev
npm run seedThis fetches ${CHATTERBOX_API_URL}/openapi.json and regenerates the typed client types.
npm run sync:apinpm run devThe inference service lives in chatterbox_tts.py and is designed to run on Modal with an A10G GPU.
- It expects an API key via
x-api-keyheader - It reads voice prompt audio from a Cloudflare R2 bucket mounted into Modal
If you change the inference API base URL, update CHATTERBOX_API_URL and rerun npm run sync:api.
