SafeVL is a visual-prompting driving-safety reasoning framework. It uses Grounding DINO + SAM2 to detect and track objects across dashcam frames, then a fine-tuned Qwen2.5-VL-7B to produce 4-step chain-of-thought reasoning and a calibrated safe / unsafe verdict.
conda create -n safevl python=3.10 -y
conda activate safevl
pip install -r requirements.txtPyTorch must be ≥ 2.6 with CUDA (tested with torch==2.6.0+cu124 on CUDA 12.9 driver). If your torch was installed against a different CUDA, reinstall it to match:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124git clone https://github.com/IDEA-Research/Grounded-SAM-2.git
cd Grounded-SAM-2
pip install -e .
cd ..Download sam2_hiera_large.pt (or sam2.1_hiera_large.pt) from the SAM2 release page and place it under ./checkpoints/.
The fine-tuned VLM is on HuggingFace and will download automatically the first time the pipeline runs:
👉 gray311/SafeVL (~16 GB)
To pre-download:
huggingface-cli download gray311/SafeVL --local-dir ./checkpoints/SafeVLOpen quickstart.ipynb — it walks you through:
- Sanity-checking your environment
- Initializing
SafeVLPipeline(loads SAM2 + Grounding-DINO + Qwen2.5-VL once) - Running on a clip (video file, frame list, or frame directory)
- Inspecting the 4-step reasoning and the verdict probabilities
- Visualizing the SoM-annotated frames
jupyter notebook quickstart.ipynb- SAM2 — Segment Anything Model 2
- Grounded-SAM-2 — Grounding DINO + SAM2 packaging
- Qwen2.5-VL — Vision-Language Model backbone