๐ฐ๐ท ํ๊ตญ์ด ๋ฒ์ | ๐ Model Evaluation | ๐ฎ Try Demo
- ๐ก Install & Requirements
- ๐ SetUp
- ๐ DeepFake Video BenchMark Datasets โ Overview of Celeb-DF-v2, FF++, and KoDF datasets used for training.
- โ๏ธ Data Preparation โ Efficient face detection and landmark extraction pipeline using YOLOv8
- ๐ Model Architecture โ Detailed look into our hybrid CNN-ViT (MS-EffViT & MS-EffGCViT) designs.
- ๐งฌ Model Zoo โ Comparison of model variants, parameter counts, and computational complexity (FLOPs).
- ๐ Training - Step-by-step training scrips with Goolge Colab and W&B experiment tracking
- ๐ Model Evaluation - Benchmarking results
- ๐ป Model Usage - How to integrate DeepGuard models into your own Python code or via timm
- ๐ฎ Predict Image & Video - Simple Inference examples for detecting deepfakes in image and video
- ๐ฌ Authors
- ๐ Reference
- โ๏ธ License
To install requirements:
pip install -r requirements.txtClone the repository and move into it:
git clone https://github.com/HanMoonSub/DeepGuard.git
cd DeepGuard
To evaluate the generalization and robustness of our deepfake detection model, we utilize three large-scale, widely recognized benchmark datasets. Each dataset presents unique challenges and covers different types of forgery methods.
| Dataset | Real Videos | Fake Videos | Year | Participants | Description (Paper Title) | Details |
|---|---|---|---|---|---|---|
| Celeb-DF-v2 | 890 | 5,639 | 2019 | 59 | A Large-scale Challenging Dataset for DeepFake Forensics | ๐ Readme |
| FaceForensics++ | 1,000 | 6,000 | 2019 | 1,000 | Learning to Detect Manipulated Facial Images | ๐ Readme |
| KoDF | 62,166 | 175,776 | 2020 | 400 | Large-Scale Korean Deepfake Detection Dataset | ๐ Readme |
Our preprocessing pipeline is designed to efficiently extract facial features from videos and prepare them for high-accuracy deepfake detection.
To maximize preprocessing efficiency, face detection is performed only on original (real) videos. Since mnipulated videos in DeepFake Video BenchMark Datasets share the same spatial coordinates as their sources, these bounding boxes are reused for the corresponding deepfake versions.
๐ Efficiency Optimizations
-
Lightweight Model: Uses yolov8n-face for high-speed inference without sacrificing accuracy.
-
Targeted Processing: By detecting faces only in original videos, the total detection workload is reduced by approximately 80%.
-
Dynamic Rescaling: To maintain consistent inference speed across different resolutions, frames are automatically resized based on their dimensions:
| Frame Size(Longest Side) | Scale Factor | Action |
|---|---|---|
| < 300px | 2.0 | |
| 300px - 700px | 1.0 | |
| 700px - 1500px | 0.5 | |
| > 1500px | 0.33 |
This module extracts face crops from both original and deepfake videos using the bounding boxes generated in the previous step. It also performs landmark detection to facilitate advanced augmentations like Landmark-based Cutout
๐ Key Features
-
Dynamic Margin with Jitter: Adds a configurable margin around the face. The margin_jitter parameter introduces random variance to the crop size, making the model more robust to different face scales.
-
Landmark Localization:
Detects 5 primary facial landmarks(eyes, nose, mouth corners) and saves them as .npy files.
DATA_ROOT/
โโโ crops/
โ โโโ {video_id}/
โ โโโ 12.png
โ โโโ ...
โโโ landmarks/
โ โโโ {video_id}/
โ โโโ 12.npy
โ โโโ ...
โโโ train_frame_metadata.csv
Click the links below to view the specific preprocessing details for each dataset:
Multi Scale Efficient Global Context Vision Transformer is an optimized multi-scale hybrid architecture that integrates CNN-driven spatial inductive bias with hierarchical attention mechanisms to effectively identify subtle(local) artifacts and macro(global) artifacts for robust deepfake forensics."
-
Model Architecture: MS-EffViT - Multi Scale Efficient Vision Transformer
-
Advanced Architecture: MS-EFFGCViT - Multi Scale Efficient Global Context Vision Transformer
We utilizes two distinct types of self-attention to capture both long-range and short-range information across feature maps.
-
Local Window Attention: this model efficiently captures local textures and precise spatial details while maintaining linear computational complexity relative to the image size.
-
Global Window Attention: Unlike Swin Transformer, this module utilizes global-queries that interact with local window keys and values. This allows each local region to incorporate global context, effectively capturing long-range dependencies and providing a comprehensive understanding of the entire spatial structure
| Model | Resolution | # Total Params(M) | # Backbone(M) | # L-ViT(M) | # H-ViT(M) | FLOPs (G) | Model Config |
|---|---|---|---|---|---|---|---|
| โก ms_eff_gcvit_b0 | 224 X 224 | 8.7 | 3.6(41.4%) | 1.7(19.5%) | 3.3(37.9%) | 0.87 | spec |
| ๐ฅ ms_eff_gcvit_b5 | 384 X 384 | 50.3 | 27.3(54.3%) | 6.6(13.1%) | 16.1(32.0%) | 13.64 | spec |
We provide training scripts for both ms_eff_vit and ms_eff_gcvit. We recommend using Google Colab for free GPU access and Weightes & Biases(W&B) for experiment tracking
- ms_eff_vit_b0: Celeb-DF-v2 ๐ | FaceForensics++ ๐ | KoDF ๐
- ms_eff_vit_b5: Celeb-DF-v2 ๐ | FaceForensics++ ๐ | KoDF ๐
- ms_eff_gcvit_b0: Celeb-DF-v2 ๐ | FaceForensics++ ๐ | KoDF ๐
- ms_eff_gcvit_b5: Celeb-DF-v2 ๐ | FaceForensics++ ๐ | KoDF ๐
!python -m train_eff_vit \ # train_eff_gcvit
--root-dir DATA_ROOT \
--model-ver "ms_eff_vit_b5" \ # ms_eff_vit_b0, ms_eff_vit_b5, ms_eff_gcvit_b0, ms_eff_gcvit_b5
--dataset "ff++" \ # ff++, celeb_df_v2, kodf
--seed 2025 \ # for reproducibility
--wandb-api-key "your-api-key" # Write your own api key!python -m inference.predict_video \
--root-dir DATA_ROOT \
--margin-ratio 0.2 \
--conf-thres 0.5 \
--min-face-ratio 0.01 \
--model-name "ms_eff_gcvit_b0" \ # ms_eff_vit_b0, ms_eff_vit_b5, ms_eff_gcvit_b0, ms_eff_gcvit_b5
--model-dataset "kodf" \ # ff++, celeb_df_v2, kodf
--num-frames 20 \
--tta-hflip 0.0 \
--agg-mode "conf" \Celeb DF(v2) Pretrained Models
| Model Variant | Test@Acc | Test@Auc | Test@log_loss | Download | Train Config |
|---|---|---|---|---|---|
| ms_eff_gcvit_b0 | 0.9842 | 0.9965 | 0.0283 | model | recipe |
| ms_eff_gcvit_b5 | 0.9981 | 0.9984 | 0.0089 | model | recipe |
FaceForensics++ Pretrained Models
| Model Variant | Test@Acc | Test@Auc | Test@log_loss | Download | Train Config |
|---|---|---|---|---|---|
| ms_eff_gcvit_b0 | 0.9808 | 0.9969 | 0.0637 | model | recipe |
| ms_eff_gcvit_b5 | 0.9850 | 0.9974 | 0.0492 | model | recipe |
KoDF Pretrained Models
| Model Variant | Test@Acc | Test@Auc | Test@log_loss | Download | Train Config |
|---|---|---|---|---|---|
| ms_eff_gcvit_b0 | 0.9655 | 0.9792 | 0.1237 | model | recipe |
| ms_eff_gcvit_b5 | 0.9850 | 0.9974 | 0.0492 | model | recipe |
Quick Start
You can load the models directly via the DeepGuard package or through the timm interface.
Available Datasets: celeb_df_v2, ff++, kodf
Installation
pip install -U git+https://github.com/HanMoonSub/DeepGuard.gitOption A: Direct Import (via DeepGuard)
from deepguard import ms_eff_gcvit_b0, ms_eff_gcvit_b5
model = ms_eff_gcvit_b0(pretrained=True, dataset="celeb_df_v2")
model = ms_eff_gcvit_b5(pretrained=True, dataset="ff++")Option B: Using timm Interface (via timm)
import timm
import deepguard
model = timm.create_model("ms_eff_gcvit_b0", pretrained=True, dataset="ff++")
model = timm.create_model("ms_eff_gcvit_b5", pretrained=True, dataset="kodf")from inference import ImagePredictor
# Initialize the predictor
predictor = ImagePredictor(
margin_ratio = 0.2, # Margin ratio around the detected face crop
conf_thres = 0.5, # Confidence threshold for face detection
min_face_ratio = 0.01, # Minimum face-toframe size ratio to process
model_name = "ms_eff_vit_b0", # ms_eff_vit_b5, ms_eff_gcvit_b0, ms_eff_gcvit_b5
dataset = "celeb_df_v2" # ff++, kodf
)
# Run Inference
result = predictor.predict_img(
img_path="path/to/image.jpg",
tta_hflip=0.0 # Horizontal Flip for Test-Time Augmentation
)
print(f"Deepfake Probability: {result:.4f}")from inference import VideoPredictor
# Initialize the predictor
predictor = VideoPredictor(
margin_ratio = 0.2, # Margin ratio around the detected face crop
conf_thres = 0.5, # Confidence threshold for face detection
min_face_ratio = 0.01, # Minimum face-toframe size ratio to process
model_name = "ms_eff_vit_b0", # ms_eff_vit_b5, ms_eff_gcvit_b0, ms_eff_gcvit_b5
dataset = "celeb_df_v2" # ff++, kodf
)
# Run Inference
result = predictor.predict_video(
video_path = "path/to/video.mp4",
num_frames = 20, # Number of frames to sample per video
agg_mode = "conf", # Aggregation Method: 'conf', 'mean', 'vote'
tta_hflip=0.0 # Horizontal Flip for Test-Time Augmentation
)
print(f"Deepfake Probability: {result:.4f}")This project was developed as a Senior Graduation Project by the Department of Software at Chungbuk National University (CBNU), Republic of Korea.
- ํ๋ฌธ์ญ: Data & Backend Engineering (Data Preprocessing Pipeline, DB Schema Design) โ hanmoon3054@gmail.com
- ์ด์์: UI/UX & Frontend Engineering (UI/UX Design, User Dashboard, Model Visualization) โ yesol4138@chungbuk.ac.kr
- ์์ค์ : AI Engineering (AI Model Architecture, Inference API Design, Model Serving) โ seoyunje2001@gmail.com
facenet-pytorch- Pretrained Face Detection(MTCNN) and Recognition(InceptionResNet) Models by Tim Eslerface-cutout- Face Cutout Library by SowmenCeleb-DF++- Celeb-DF++ Dataset by OUC-VAS GroupDeeperForensics-1.0- DeeperForensics-1.0 Dataset by Endless SoraDeepfake Detection- Detection of Video Deepfake using ResNext and LSTM by Abhijith Jadhavdeepfake-detection-project-v4- Multiple Deep Learning Models by Ameen CaslamAwesome-Deepfake-Detection- A curated list of tools, papers and code by Daisy Zhang
This project is licensed under the terms of the MIT license.
