AUDDT is a benchmark toolkit for audio deepfake detection. The landscape of audio deepfake detection is fragmented with numerous datasets, each having its own data format and evaluation protocol. AUDDT addresses this by providing a unified platform to seamlessly benchmark pretrained models against a wide variety of public datasets. We make a dedicated effort to update it regularly to include more recent datasets. Please see below for current coverage.
The current version includes 33 datasets.
- Supported Datasets
- Update log
- Installation
- Benchmarking Your Detector
- Example Output
- Contributing
- Limitations
- Disclaimer
- Citation
- License
The full list of 30+ supported datasets is maintained in a public Google Sheet for easy viewing and filtering.
➡️ View Full Dataset List on Google Sheets
We are actively developing AUDDT. See below for the latest updates.
- 2026-05-20
- Added multi-GPU inference support
- Fixed metadata labels for a few datasets
- 2026-05-18
- Added deepfake audio event datasets: FakeSound, VCapAV, and EnvSSD.
- Added more subgroups for easy selective benchmarking
- 2025-09-19
- Birth of AUDDT
- Added 28 datasets to the benchmark
- Added an examplar baseline model
-
Clone the repository:
git clone https://github.com/MUSAELab/AUDDT.git cd AUDDT -
Create a virtual environment and install dependencies:
python3 -m venv venv source venv/bin/activate pip install -r requirements.txt
All commands below assume you are in the project root (AUDDT/) with your environment activated.
Set the data root in download/config.sh (default: data/). Then run the corresponding download script for each dataset you need:
bash download/get_XXX.sh # e.g., bash download/get_fakesound.shBy default, data is placed under:
data/
DATASET_X/
raw/ # compressed archives downloaded from source
processed/ # extracted audio files
Generate CSV manifest files for all configured datasets:
python preprocessing/prep_all_datasets.py --config preprocessing/dataset_list.yamlEnable or disable individual datasets by commenting them in preprocessing/dataset_list.yaml. Each enabled dataset must already be downloaded before this step.
Place your model script in models/ and edit benchmark/evaluate_setup.yaml:
model:
path: models/detector_wrapper.py
class_name: AudioDeepfakeDetector
checkpoint: models/Best_LA_model_for_DF.pth
device: 'cuda:0'
model_args:
raw_model_path: models/baseline_model.py
raw_model_class_name: Model
raw_model_args:
args: null
model_device: 'cuda:0'Comparing against the exemplar baseline?
baseline_model.pydepends onfairseq, which requires Python 3.10 (incompatible with 3.12+). Set up a dedicated environment first:conda create -n auddt python=3.10 -y conda activate auddt pip install -r requirements.txtThe XLSR-300M backbone weights also need to be present in the project root:
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr2_300m.ptVerify everything loads before running full evaluation:
python benchmark/smoke_test.py
By default, evaluation runs on all datasets whose manifests exist. To evaluate on a specific subset, define a named group in benchmark/dataset_group.yaml:
audio-events:
- name: FakeSound
manifest_path: fakesound/processed/manifest_fakesound.csv
- name: EnvSDD
manifest_path: envssd/processed/manifest_envssd.csv
- name: VCapAV-dev1
manifest_path: vcapav/processed/dev1.csv
- name: VCapAV-dev2
manifest_path: vcapav/processed/dev2.csv
- name: VCapAV-dev3
manifest_path: vcapav/processed/dev3.csvThen set group_name accordingly in benchmark/evaluate_setup.yaml:
data:
group_name: audio-events
groups_config_path: benchmark/dataset_group.yamlpython -m benchmark.evaluate --config benchmark/evaluate_setup.yamlResults are written to results/. If latex_output_path is set in the config, a .tex table is also generated automatically.
You may need to adjust batch_size in evaluate_setup.yaml based on available GPU memory.
Running the exemplar AASIST-style baseline (XLSR-300M + RawNet2 + GAT, trained on ASVspoof LA) on the audio-events group produces the following. Note that near-random EER on audio-event datasets is expected — this model was trained on speech deepfakes and does not generalize to out-of-domain audio events.
The LaTeX table is saved to results/examplar_table.tex:
% Required packages: \usepackage{booktabs}
\begin{table*}[htbp]
\centering
\caption{Evaluation Results}
\label{tab:results}
\begin{tabular}{lrrrrrrrrrrr}
\toprule
& EER (\%) & AUC & Acc (\%) & TPR (\%) & TNR (\%) & Pre (\%) & F1 & TP & TN & FP & FN \\
\midrule
FakeSound & N/A & N/A & N/A & N/A & N/A & N/A & N/A & N/A & N/A & N/A & N/A \\
EnvSDD & 42.70 & 0.5902 & 32.70 & 25.63 & 82.02 & 90.87 & 0.3999 & 637 & 292 & 64 & 1848 \\
VCapAV-dev1 & 57.79 & 0.4495 & 37.36 & 30.74 & 57.23 & 68.32 & 0.4240 & 3480 & 2160 & 1614 & 7842 \\
VCapAV-dev2 & 51.46 & 0.5060 & 44.45 & 38.31 & 56.73 & 63.91 & 0.4791 & 2892 & 2141 & 1633 & 4656 \\
VCapAV-dev3 & 54.80 & 0.4691 & 37.67 & 33.86 & 56.73 & 79.65 & 0.4752 & 6390 & 2141 & 1633 & 12480 \\
\midrule
\textbf{Average} & 51.69 & 0.5037 & 38.05 & 32.14 & 63.18 & 75.69 & 0.4445 & 13399 & 6734 & 4944 & 26826 \\
\bottomrule
\end{tabular}
\end{table*}FakeSound contains only spoof samples (no bonafide), so EER and AUC are reported as N/A; accuracy reflects spoof detection rate only.
While the team will keep updating the benchmark coverage, it is highly encouraged to suggest dataset addition via creating an issue and point us to the source link and paper.
- The FakeSound dataset included in this benchmark currently contains only spoofed audio samples. Obtaining the corresponding bona fide (real) data requires additional manual steps. We plan to resolve this limitation and provide the complete dataset in future updates.
@misc{zhu2025auddtaudiounifieddeepfake,
title={AUDDT: Audio Unified Deepfake Detection Benchmark Toolkit},
author={Yi Zhu and Heitor R. Guimarães and Arthur Pimentel and Tiago Falk},
year={2025},
eprint={2509.21597},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2509.21597}
}
This project is licensed for academic and research use only.
Commercial use is strictly prohibited without prior written permission.
See the full LICENSE file for details.
We do not include any proprietary datasets or the ones with unknown sources for transparency. We also encourage users to be careful with the potential training/test overlap, e.g., some datasets like ASVspoof2019 / ASVspoof5 are widely used as training sets. Results obtained with this toolkit should solely be used for research purposes instead of advertisement for commercial usage.
