MP4 Background Swap

Replace the background of a talking-head MP4 with a still image — no green screen required. Uses Robust Video Matting (RVM) via ONNX Runtime to segment the person, then composites them over the chosen background and re-encodes with ffmpeg.

The project is split into:

backend/ — FastAPI service that runs RVM matting (ONNX, CPU) frame-by-frame and pipes the composited frames into ffmpeg for H.264/AAC encoding.
frontend/ — React + Vite + TypeScript UI.

Requirements

Python 3.9+
Node.js 20+
ffmpeg and ffprobe available on PATH (e.g. brew install ffmpeg on macOS, apt install ffmpeg on Debian/Ubuntu).

Local development

Backend

cd backend
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
PYTHONPATH=. uvicorn app.main:app --reload --port 8000

Frontend

cd frontend
npm install
npm run dev

The Vite dev server proxies /api to http://localhost:8000.

Both at once

./start.sh

(Assumes the backend's .venv already exists.)

Docker

docker compose up --build

The image installs ffmpeg, builds the frontend into static/ and serves the API on port 8000.

API

POST /api/probe — multipart file (mp4/mov) → { "duration_seconds", "width", "height" }.
POST /api/build — multipart:
- video_file: the source MP4 (talking-head; any background).
- background_file: a still image (PNG/JPEG/WEBP) to use as the new background.
- matting_model (optional, default mobilenetv3): mobilenetv3 (Fast, ~14 MB) or resnet50 (Quality, ~100 MB).
- downsample_ratio (optional, default 0.25, range 0.05–1.0): internal inference scale. RVM suggests ~0.25 for 1080p, ~0.375 for 720p, 1.0 for ≤512 px. Lower = faster, less edge detail.
- output_name (optional): output filename. Returns the generated video/mp4.

How it works

Each frame is run through the RVM ONNX model (CPU), which produces a foreground RGB and an alpha matte. The frame is composited over the background image (scaled + center-cropped to the video's resolution) and streamed as raw BGR24 into a single ffmpeg process that mixes in the original audio and encodes H.264/AAC.

The selected model file is downloaded on first use into ~/.cache/mp4_bg_swap/models/ and reused afterwards (override the cache location with the MP4_BG_SWAP_CACHE environment variable).

Performance: CPU inference is slow. On an Apple-silicon Mac, expect roughly 1 fps for a 1080p source with mobilenetv3 at downsample_ratio=0.25; resnet50 is several times slower. Pick Fast for iteration, Quality for the final render.

Output

Same resolution and frame rate as the input video, H.264 (yuv420p), AAC audio at 192 kbps.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
frontend		frontend
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MP4 Background Swap

Requirements

Local development

Backend

Frontend

Both at once

Docker

API

How it works

Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MP4 Background Swap

Requirements

Local development

Backend

Frontend

Both at once

Docker

API

How it works

Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages