Guided background removal — tell remove_background what to keep.
Standard background removal (RMBG) guesses what the "foreground" is. Sometimes it's ambiguous (a living room — is the sofa the foreground? the coffee table? the rug?). Sometimes it keeps too much or too little. Guided Remove Background lets the user guide the process with natural language prompts — add items RMBG missed, remove items it kept, or narrow to specific objects. A VLM + SAM pipeline interprets user intent and precisely adjusts the RMBG baseline, then an agentic judge verifies the result and self-corrects if needed.
The pipeline has four layers, each with a single responsibility:
- RMBG (Bria RMBG-2.0) — produces a baseline foreground mask with sub-pixel alpha edges. This is the starting point: the "obvious foreground."
- VLM (Vision-Language Model) — reads the user's prompt and classifies their intent into a mode (
add,remove,narrow, oradd_remove) plus a list of target descriptions for SAM to find. - SAM (Segment Anything Model) — a dumb segmentation tool. Given text descriptions, it returns object masks. It has no concept of modes — it doesn't know or care what we do with its masks afterward.
- Judge (VLM-based, Claude Opus) — an agentic verification loop that evaluates the final result against the user's prompt. Uses chain-of-thought reasoning: first inventories every object visible in the result, then compares against the prompt. If it finds issues (missing items, unwanted items, mask artifacts), it triggers automated corrections (re-run SAM with better prompts, fill holes) for up to 2 retries. Duplicate detection breaks hallucination loops early. If corrections make things worse, it reverts to the pre-judge result.
The mode logic lives entirely in our pipeline, which combines the RMBG baseline with SAM masks differently depending on the VLM's classification:
User prompt ──► VLM classifies intent ──► SAM segments targets ──► Pipeline combines masks ──► Alpha blend ──► Judge verifies
│ │ │ │
mode + targets per-object masks RMBG ∪/−/∩ SAM pass / retry / revert
(mode-agnostic) based on mode
| Step | What Happens |
|---|---|
| 1. RMBG baseline | Bria RMBG-2.0 produces the default foreground mask with sub-pixel alpha |
| 2. VLM decomposition | A vision-language model sees the image + RMBG overlay, classifies user intent into a mode + target list |
| 3. SAM segmentation | SAM 3.1 segments each target object by text description (mode-agnostic) |
| 4. Mode-specific mask | Pipeline combines RMBG and SAM masks using the mode's formula |
| 5. Alpha blend | RMBG's precise alpha where available; feathered edges for SAM-only zones |
| 6. Agentic judge | VLM evaluates the final image, triggers corrections if needed, reverts if corrections fail |
The core principle: the obvious foreground (RMBG baseline) is preserved by default. The user's prompt guides modifications to that baseline.
- ADD — Keep everything RMBG found, plus extra items:
final = RMBG ∪ SAM_targets. Example: "with the surfboard on the wall" - REMOVE — Keep everything RMBG found, minus specific items:
final = RMBG − SAM_targets. Example: "without the dog" - NARROW — Keep only the named items from RMBG's foreground:
final = RMBG ∩ SAM_targets. Reserved for explicit "only/just" statements or when the user names the main subject directly. Example: "only the chef and the pot" - ADD_REMOVE — Mixed intent, applies both:
final = (RMBG ∪ add_targets) − remove_targets. Example: "add the surfboard, remove the far chairs"
Each step in detail, including the post-processing we apply:
Step 1 — RMBG baseline (1 API call)
Bria RMBG-2.0 returns an RGBA image with sub-pixel alpha edges. We threshold at alpha > 128 to get a binary rmbg_mask. This mask is the starting point for everything.
Step 2 — VLM decomposition (1 vision LLM call + 1 text-only LLM call for NARROW mode) We send the VLM two images: the original photo and a green overlay showing what RMBG considers foreground. This lets the VLM see the baseline before deciding how to modify it. The VLM returns a mode + target list. For NARROW mode, a second text-only LLM call validates each target against the user's prompt and drops any the VLM hallucinated.
Step 3 — SAM segmentation (N API calls, parallelized) Each target description is sent to SAM 3.1 as a separate text prompt, in parallel (up to 4 workers). SAM returns a binary mask + confidence score per target. For ADD mode, if SAM finds nothing, we retry with simplified 2-3 word descriptions.
Step 4 — Mode-specific mask combination (local, no APIs) This is where the mode formulas are applied, with morphological post-processing:
- ADD: Dilate SAM masks (3 iterations), union with RMBG, then
binary_fill_holesto close gaps at the seam between RMBG and SAM regions. - REMOVE: Erode RMBG mask (8+ iterations) to find the "core." Per-target: dilate SAM mask, then check core protection — if a target is < 5% of foreground and > 90% inside the core, it's skipped (it's part of the main subject, not a removable item). Subtract surviving targets from RMBG.
- NARROW: Per-target: dilate SAM mask, compute RMBG coverage. If > 15% overlap (target is an RMBG foreground object): dilate the intersection, fill holes, clamp back to RMBG boundary. If < 15% (target is in RMBG's background): use dilated SAM mask directly.
- ADD_REMOVE: Apply ADD logic for add targets, REMOVE logic (with core protection) for remove targets, combine, then
binary_fill_holes.
Step 5 — Alpha channel construction (local) Where the final mask overlaps RMBG, we use RMBG's original sub-pixel alpha (best edge quality). Where the mask extends beyond RMBG (SAM-only zones), we build feathered alpha using distance transforms + Gaussian blur. For ADD/ADD_REMOVE modes, we bridge pixel-thin seam gaps between RMBG and SAM zones.
Step 6 — Agentic judge (1-3 vision LLM calls, Claude Opus) The judge receives the original image, the final result (composited on a white background as lossless PNG for visual clarity), and the user prompt. It uses chain-of-thought reasoning: first it inventories every object visible in the result image, then compares that inventory against the user's prompt. This two-step process reduces hallucination — the VLM must ground its judgment in what it actually sees rather than what it expects from the original photo.
The judge looks for three specific problems:
- Missing items — a discrete object the user explicitly asked for is completely absent.
- Unwanted items — an object the user asked to remove (or didn't ask to keep in NARROW mode) is still fully present.
- Mask artifacts — large holes inside foreground objects.
If issues are found, the judge suggests fixes (re-sam with a better prompt, fill-holes). The pipeline applies the fix and re-evaluates, up to 2 retries. Safety guards prevent over-correction: a mask floor (20% of original size) stops excessive removal, shrink-mask/expand-mask fixes are disabled, and duplicate detection auto-passes if the judge returns the same issue twice (hallucination loop). If all retries fail, the pipeline reverts to the pre-judge result rather than returning a broken correction.
Cost per image: 1 RMBG call + 1 vision LLM call + 0-1 validation LLM call + N SAM calls + 1-3 judge LLM calls (Opus). Total: 4-9 API calls.
# Narrow: keep only the chef, stove, and pots (drop plates, countertop, etc.)
uv run guided-remove-background \
--image cooking_scene.jpg \
--prompts "the chef with the stove and pots" \
--output result.png
# Remove: keep the person, drop the dog
uv run guided-remove-background \
--image person_dog.jpg \
--prompts "the person walking without the dog" \
--output person_only.png
# Add: keep everything RMBG finds + add the staircase
uv run guided-remove-background \
--image living_room.jpg \
--prompts "all the furniture including the staircase" \
--output full_room.png
# Add+Remove: add the surfboard, remove the far chairs
uv run guided-remove-background \
--image cafe_interior.jpg \
--prompts "add the surfboard, remove the far chairs" \
--output adjusted.png
# Plain RMBG baseline (no guidance)
uv run guided-remove-background \
--image living_room.jpg \
--prompts "anything" \
--output baseline.png \
--mode rmbg-onlygit clone https://github.com/Bria-AI/guided-remove-background.git
cd guided-remove-background
cp .env.example .env # add your API keys
make benchmark # install → fetch images → run 58 cases → open dashboardThe dashboard opens at http://localhost:8899/live.html — browse every case with full pipeline step visualization, VLM reasoning, SAM scores, judge verdicts, and interactive feedback.
The playground at http://localhost:8899/playground.html lets you iterate on prompts interactively — pick an image, type a prompt, and see the full pipeline result with step-by-step visualization in a chat interface.
make setup # install dependencies
make images # download 15 test images from Pexels
make run # run all 58 benchmark cases
make serve # start dashboard at localhost:8899
make grade # auto-grade results with VLM (optional)
make export # export results as a single shareable HTML file
make help # show all available commandscd guided-remove-background
uv syncCreate a .env file (or copy .env.example) with your API keys:
BRIA_API_KEY=... # Bria.ai — background removal
FAL_KEY=... # Fal.ai — SAM 3.1 segmentation
ANTHROPIC_API_KEY=... # VLM decomposition + judge + grading
58 test cases across 15 images in two scenario types:
- Ambiguous foreground — scenes with no clear single subject (interiors, table settings, workspaces). The user's guidance defines the foreground.
- Adjustable foreground — scenes with a clear default subject (a person, a group), but the user wants to adjust scope (add the yoga mat, remove the dog, keep only the laptop).
Each case has a scenario type (include, exclude, narrow) and a user prompt.
The dashboard shows the full pipeline for each case: original image, RMBG baseline, VLM mode + targets, SAM masks with confidence scores, combined mask, alpha refinement, judge verdict (pass/fail/reverted), and the final result. Click any step card to expand it for a detailed view. Each case has like/dislike buttons with comment support for collecting feedback.
guided-remove-background/
src/guided_remove_background/ # Core package
__init__.py # Version, MODES
cli.py # CLI entry point
remove_bg.py # Four-mode orchestrator + judge integration
clients/
http_utils.py # Shared HTTP retry, env helpers
bria_rmbg.py # Bria RMBG-2.0 API client
fal_sam.py # SAM 3.1 via Fal.ai client
vlm_decompose.py # VLM prompt decomposition (mode + targets)
vlm_judge.py # Agentic judge — visual verification + self-correction
processing/
debug.py # Step recorder (saves intermediate visuals)
output.py # Save result PNG + preview JPG
mask_cleanup.py # Morphological mask cleanup
edge_band.py # Edge-band refinement
sanity.py # Sanity guards (RMBG/SAM agreement, bloat)
benchmark/ # Benchmark suite
data/
cases.csv # 58 test cases (image, scenario, prompt)
catalog.py # Image URL catalog (15 curated images)
fetch_images.py # Download benchmark images
runner.py # Batch runner with step recording
feedback_server.py # HTTP server with feedback API
live.html # Live dashboard with pipeline step visualization
playground.html # Interactive prompt playground
export_html.py # Export results as shareable single-file HTML
grader/
prompt.py # VLM grading prompt
providers.py # Anthropic + OpenAI grading
run_grader.py # Grading orchestration
| Key | Service | Purpose |
|---|---|---|
BRIA_API_KEY |
Bria.ai | Background removal (RMBG-2.0) |
FAL_KEY |
Fal.ai | SAM 3.1 segmentation |
ANTHROPIC_API_KEY |
Anthropic | VLM decomposition + judge + grading |
OPENAI_API_KEY |
OpenAI | VLM decomposition + grading (alternative) |