Official pytorch implementation of out paper:
How Far Can We Go With Synthetic Data for Audio-Visual Sound Source Localization?
Arda Senocak*, Sooyoung Park*, Tae-Hyun Oh ,Joon Son Chung (* Equal Contribution)
CVPR 2026
This repository contains the implementation of SyntheticSSL. Our code is based on the Audio-Grounded Contrastive Learning (ACL) framework, adapted for synthetic data experiments.
- Python = 3.12
- Pytorch = 2.5.1
- transformers = 4.46.3
$ conda install python=3.12
$ pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
$ pip install transformers==4.46.3
$ pip install tensorboard opencv-python scikit-learnImportant Note: All audio samples must be converted to 16kHz, and for detailed instructions, refer to the readme in each dataset-specific directory.
- Dataset
Downloading pretrained model (audio backbone) in pretrain folder
- BEATs: https://github.com/microsoft/unilm/tree/master/beats
- BEATs_iter3_plus_AS2M_finedtuned_on_AS2M_cpt2.pt
- Ensure that you check the .sh files and set the
$ export CUDA_VISIBLE_DEVICES=”**”according to your hardware setup. - Make sure that
—model_namecorresponds to the configuration file located at./config/model/{-model_name}.yaml. - Model files (.pth) will be saved in the directory
{—save_path}/Train_record/{-model_name}_{-exp_name}/. - Review the configuration settings in
./config/train/{-train_config}.yamlto ensure they match your training requirements. - Choose one of the following methods to initiate training:
$ sh SingleGPU_Experiment.sh. # For single GPU setup
$ sh Distributed_Experiment.sh. # For multi-GPU setup (DDP)- Before testing, please review the .sh file and set the
$ export CUDA_VISIBLE_DEVICES=”**”environment variable according to your hardware configuration. - Ensure that the
—model_nameparameter corresponds to the configuration file located at./config/model/{-model_name}.yaml. - Model files (.pth) located in the directory
{—save_path}/{-model_name}_{-exp_name}/Param_{-epochs}.pthwill be used for testing. - The
—epochsparameter can accept either an integer or a list of integers (e.g., 1, 2, 3). - If
—epochsis left unspecified (null), the default model file{—save_path}/Train_record/{-model_name}_{-exp_name}/Param_best.pthwill be used for testing.
$ sh Test_PTModelsImportant Note: After downloading the Param_best.pth file, move it to the directory {—save_path}/{-model_name}_{-exp_name}/ before use.
- (Syn1_Image, Syn1_Audio): [Link]
- All images and audio from Synthetic VGGSound Clone1
- (Syn1_Image, Real_Audio): [Link]
- All images from Synthetic VGGSound Clone1
- 3x scale (Table 3 (C)) [Link]
- (Real_Image, Real_Audio)
$\cup$ (Syn1_Image, Real_Audio)$\cup$ (Syn2_Image, Real_Audio):
- (Real_Image, Real_Audio)
If you use this project, please cite this project as:
@inproceedings{senocak2026howfar,
title={How Far Can We Go With Synthetic Data for Audio-Visual Sound Source Localization?},
author={Senocak, Arda and Park, Sooyoung and Oh, Tae-Hyun and Chung, Joon Son},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026},
}