Replace the background of a talking-head MP4 with a still image — no green
screen required. Uses Robust Video Matting
(RVM) via ONNX Runtime to segment the person, then composites them over the
chosen background and re-encodes with ffmpeg.
The project is split into:
backend/— FastAPI service that runs RVM matting (ONNX, CPU) frame-by-frame and pipes the composited frames intoffmpegfor H.264/AAC encoding.frontend/— React + Vite + TypeScript UI.
- Python 3.9+
- Node.js 20+
ffmpegandffprobeavailable onPATH(e.g.brew install ffmpegon macOS,apt install ffmpegon Debian/Ubuntu).
cd backend
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
PYTHONPATH=. uvicorn app.main:app --reload --port 8000cd frontend
npm install
npm run devThe Vite dev server proxies /api to http://localhost:8000.
./start.sh(Assumes the backend's .venv already exists.)
docker compose up --buildThe image installs ffmpeg, builds the frontend into static/ and serves the
API on port 8000.
POST /api/probe— multipartfile(mp4/mov) →{ "duration_seconds", "width", "height" }.POST /api/build— multipart:video_file: the source MP4 (talking-head; any background).background_file: a still image (PNG/JPEG/WEBP) to use as the new background.matting_model(optional, defaultmobilenetv3):mobilenetv3(Fast, ~14 MB) orresnet50(Quality, ~100 MB).downsample_ratio(optional, default0.25, range0.05–1.0): internal inference scale. RVM suggests ~0.25for 1080p, ~0.375for 720p,1.0for ≤512 px. Lower = faster, less edge detail.output_name(optional): output filename. Returns the generatedvideo/mp4.
Each frame is run through the RVM ONNX model (CPU), which produces a
foreground RGB and an alpha matte. The frame is composited over the
background image (scaled + center-cropped to the video's resolution) and
streamed as raw BGR24 into a single ffmpeg process that mixes in the
original audio and encodes H.264/AAC.
The selected model file is downloaded on first use into
~/.cache/mp4_bg_swap/models/ and reused afterwards (override the cache
location with the MP4_BG_SWAP_CACHE environment variable).
Performance: CPU inference is slow. On an Apple-silicon Mac, expect roughly 1 fps for a 1080p source with
mobilenetv3atdownsample_ratio=0.25;resnet50is several times slower. Pick Fast for iteration, Quality for the final render.
- Same resolution and frame rate as the input video, H.264 (
yuv420p), AAC audio at 192 kbps.