Toward Human-like Audio-Visual Intelligence of Omni-MLLMs — ICML 2026
📄 Paper · 🌐 Project Page · 💻 Code · 🤗 Dataset
AVI-Bench evaluates how well Omni-Multimodal Large Language Models (Omni-MLLMs) such as Gemini, GPT-4o, Qwen-Omni, and Baichuan-Omni handle joint audio-visual reasoning. It organises evaluation around the human cognitive process — Perception → Understanding → Reasoning — and adds the Primitive Sensation (PriSe) extension to test generalisation to low-semantic, unfamiliar audio-visual inputs.
This repository contains:
- the 14-task / 5,864-sample benchmark dataset layout reported in the paper,
- a unified inference + answer-formatting + evaluation pipeline,
- a reference adapter for any OpenAI-compatible API (Gemini / GPT-4o / Claude / self-hosted),
- a four-level AVI taxonomy for principled cross-stage comparison.
- Highlights
- Benchmark Overview
- Headline Results
- Repository Layout
- Dataset Format
- Quickstart
- Pipeline Details
- Adding a New Model
- Citation
- Usage & Restrictions
- License
- Cognitively Inspired Framework. Three cognitively grounded stages (Perception, Understanding, Reasoning) plus the Primitive Sensation (PriSe) extension for unified, interpretable evaluation of Omni-MLLMs.
- Primitive Sensation Extension. A novel suite probing models' robustness on naive, low-semantic stimuli (synthetic 2D/3D shapes with controlled sounds) — directly tests generalisation beyond common training distributions.
- Four-Level Intelligence Taxonomy. Beyond raw accuracy, scores can be aggregated as Task-, Modality-, Stage-, and Domain-Adaptive intelligence, enabling fine-grained comparison.
- Compact but rich. 5,864 curated samples across 14 tasks — larger than most existing audio-visual benchmarks (WorldSense 3,172; AV-Odyssey 4,555) yet small enough to keep evaluation fast and reasoning-focused.
- Reproducible pipeline. One-command inference, LLM-based answer refinement, and a metric suite covering accuracy, mIoU, retrieval R@k, FENSE captioning, and counting RMSE.
| Stage | Tasks | Samples | What it measures |
|---|---|---|---|
| Perception | AMIC, VMIC, AVL, AVM | 1,494 | Detection and recognition of fundamental audio/visual entities; cross-modal local + global alignment |
| Understanding | VAR, AVR, AVC | 808 | Integration of multimodal context (cross-modal retrieval, narrative captioning) |
| Reasoning | AVH, VAH, AVQA, AVLG | 1,472 | Higher-order inference; hallucination robustness; cross-modal grounding |
| Primitive Sensation | ASQA, VSQA, AVSQA | 2,090 | Sensitivity to low-semantic stimuli (texture / colour / spatial / temporal change) |
| Total | 14 tasks | 5,864 |
| Task | Full name | Cognitive stage | Metric |
|---|---|---|---|
| AMIC | Audio Multi-instance Classification | Perception | Semantic F1 + counting RMSE |
| VMIC | Visual Multi-instance Classification | Perception | Semantic recall + counting RMSE |
| AVL | Audio-Visual Localization | Perception | 0.7·mIoU + 0.3·Instance score |
| AVM | Audio-Visual Matching | Perception | Full-match accuracy |
| VAR | Visual-reference Audio Retrieval | Understanding | (R@1+R@3)/2 + F1 / 2 |
| AVR | Audio-reference Visual Retrieval | Understanding | same as VAR |
| AVC | Audio-Visual Captioning | Understanding | FENSE (Zhou et al., ICASSP'22) |
| AVH | Audio-reference Visual Hallucination | Reasoning | Full-match accuracy |
| VAH | Visual-reference Audio Hallucination | Reasoning | Full-match accuracy |
| AVQA | Audio-Visual Question Answering | Reasoning | Full-match accuracy |
| AVLG | Audio-Visual Language Grounding | Reasoning | per-frame mIoU |
| ASQA | Audio Sensation QA | Sensation | Full-match accuracy + double-confirm |
| VSQA | Visual Sensation QA | Sensation | Full-match accuracy + double-confirm |
| AVSQA | Audio-Visual Sensation QA | Sensation | AVSQA-specific matching + double-confirm |
Most existing benchmarks evaluate semantically rich audio-visual content (music, speech, real-world scenes), which overlaps heavily with model pre-training. AVI-Bench-PriSe instead measures whether Omni-MLLMs exhibit primitive sensation: detecting variations in colour, volume, shape, area, or temporal order on synthetic, low-semantic stimuli that lie outside the typical training distribution.
Each PriSe task includes paired pre-question / formal-question entries — a double-confirm mechanism that zeros the formal score whenever the model fails the basic "Can you hear / see anything?" pre-check, guarding against models that hallucinate confident answers without actually attending to the input.
| Level | Name | Definition |
|---|---|---|
| L1 | Task-Adaptive | Per-task average score |
| L2 | Modality-Adaptive | Balance between audio-dominant and visual-dominant tasks |
| L3 | Stage-Adaptive | Whether reasoning is grounded in its perceptual / conceptual prerequisites |
| L4 | Domain-Adaptive | Performance gap between familiar and unfamiliar (Sensation) domains |
L1–L4 yield interpretable diagnostic axes beyond raw accuracy.
| Model | Params | Perception | Understand | Reasoning | Sensation | Overall |
|---|---|---|---|---|---|---|
| Gemini-2.5-Pro | – | 54.58 | 68.97 | 69.06 | 36.22 | 57.21 |
| Gemini-2.5-Flash | – | 45.97 | 43.79 | 63.70 | 30.63 | 46.02 |
| Gemini-2.0-Flash | – | 44.27 | 42.11 | 64.03 | 29.48 | 44.97 |
| Qwen2.5-Omni | 7B | 42.81 | 39.68 | 58.26 | 24.59 | 41.33 |
| GPT-4o | – | 40.45 | 48.60 | 56.87 | 16.81 | 40.68 |
| Human (subset) | – | – | – | – | 90+ | 92.6 |
| Model | Params | L1 Task | L2 Modality | L3 Stage | L4 Domain |
|---|---|---|---|---|---|
| Gemini-2.5-Pro | – | 64.20 | 62.80 | 57.08 | 32.97 |
| Gemini-2.5-Flash | – | 51.15 | 48.58 | 40.47 | 27.72 |
| Gemini-2.0-Flash | – | 50.14 | 49.21 | 39.79 | 27.12 |
| Qwen-Omni-Turbo | 7B | 46.50 | 45.15 | 37.70 | 26.13 |
| Qwen2.5-Omni | 7B | 46.92 | 45.93 | 37.61 | 25.89 |
| Baichuan-Omni | 7B | 37.35 | 35.80 | 30.18 | 24.10 |
| GPT-4o | – | 48.64 | 47.19 | 41.93 | 00.55 |
Key observations:
- Even the top model, Gemini-2.5-Pro, reaches only 57.2 overall — far below the human baseline (~92.6). AVI-Bench remains an open frontier.
- A consistent modality imbalance is observed: most models excel on visual-dominant tasks but lag on audio-dominant ones.
- Primitive Sensation is the weakest stage across the board, indicating poor generalisation to low-semantic, unfamiliar inputs.
- The four-level taxonomy amplifies failures hidden by raw averages: e.g. GPT-4o (L1 48.64, L3 41.93) collapses to L4 = 0.55 due to near-zero performance on audio-only sensation tasks (its cascaded audio path fails on out-of-distribution stimuli).
AVIBench/
├── README.md
├── run.py # unified inference entry
├── run_all.sh # wrapper to run all tasks
├── test_one_task.py # smoke test single task
├── test_all_tasks.py # smoke test all tasks (5 samples each)
│
├── scripts/ # prompt assembly + dataset loader
├── models/ # model adapters
│ └── gemini/run.py # OpenAI-compatible API client
├── auto_format/ # LLM-based answer refining
└── eval/ # FENSE-aware evaluation
├── eval.py
├── level_metrics/
├── user_outputs/{model}/tasks/*.json
├── user_outputs_refined/{model}/tasks/*.json
└── eval_outputs/{model}.json
The benchmark data lives separately (e.g. AVIBench_data_release/levels/) — see Dataset Format.
levels/
├── perception/
│ ├── AMIC/ ├── VMIC/ ├── AVL/ └── AVM/
├── understand/
│ ├── VAR/ ├── AVR/ └── AVC/
├── reasoning/
│ ├── AVH/ ├── VAH/ ├── AVQA/ └── AVLG/
└── sensation/
├── ASQA/ ├── VSQA_I/ ├── VSQA_V/ └── AVSQA/
Each task folder contains data.json (a list of sample entries) and an input/ directory with the media files. A typical entry:
prompt and options are pseudo keys (e.g. "prompt_amic") resolved at runtime against scripts/definitions.py, so data.json stays compact.
conda create -n avibench python=3.11 -y
conda activate avibench
pip install -r requirements.txt # openai, nltk, aac-metrics, sentence-transformers, ...Point at the unpacked dataset and configure your API gateway:
export DATA_ROOT=/path/to/AVIBench_data_release/levels
export OPENAI_API_KEY=... # your OpenAI-compatible gateway key
export OPENAI_BASE_URL=... # your OpenAI-compatible gateway endpointpython test_all_tasks.py # 5 samples per task; finishes in minutes# (a) Inference — produces eval/user_outputs/<model>/tasks/*.json
MODEL_PATH=gemini-2.5-pro MODEL_LABEL=gemini-2.5-pro \
DATA_ROOT=$DATA_ROOT bash run_all.sh
# (b) Refine — produces eval/user_outputs_refined/<model>/tasks/*.json
cd auto_format && python run.py && cd ..
# (c) Evaluate — produces eval/eval_outputs/<model>.json
cd eval && python eval.py --models gemini-2.5-prorun.py is the unified entry. Tasks run as independent processes (one per task) and parallelise trivially across processes.
| Flag | Default | Purpose |
|---|---|---|
--model_path |
gemini-2.5-pro |
Model identifier passed to the API |
--model_label |
gemini-2.5-pro |
Logical model name (used in output paths) |
--tasks |
all tasks | Subset of tasks to run |
--data_root |
./data/levels |
Path to dataset root |
--output_dir |
./eval/user_outputs |
Where to write <model_label>/tasks/*.json |
--concurrency |
1 |
Concurrent requests per task |
--n_samples |
None | Cap samples per task (for quick tests) |
Predictions are written at their dataset array position and resumable: re-running the same command skips slots that already have valid predictions.
Raw model outputs are free-form text. A second LLM ("refine") normalises them per task:
| Task family | Refine output format |
|---|---|
| QA (ASQA, VSQA, AVSQA, AVQA) | Stripped final answer (**option** or number) |
| Multi-instance counting (AMIC, VMIC) | {"class": "count"} dict |
| Retrieval (AVR, VAR) | Index list [0, 3, 7] |
| Localization (AVL) | {"category_id": "[x_min, y_min, x_max, y_max]"} dict |
| Video grounding (AVLG) | {"frame_0": [x,y,w,h], ..., "frame_9": ...} |
| No refine needed (AVM, AVC, AVH, VAH) | Passed through |
Configure via env vars: OPENAI_API_KEY, OPENAI_BASE_URL, REFINE_MODEL (default gemini-2.5-flash), REFINE_CONCURRENCY (default 8).
cd eval
python eval.py # eval all models under user_outputs_refined/
python eval.py --models gemini-2.5-pro # eval one specific modelOutput: eval/eval_outputs/<model_label>.json with per-task and stage-aggregated scores.
{
"perception": { "amic": {...}, "vmic": {...}, "avl": {...}, "avm": 0.768 },
"understand": { "var": {...}, "avr": {...}, "avc": {"fense": 0.56, ...} },
"reasoning": { "avh": 0.86, "vah": 0.80, "avqa": 0.40, "avlg": 0.35 },
"sensation": { "asqa": 0.29, "avsqa": 0.16, "vsqa": 0.32 }
}For FENSE (AVC), the evaluator downloads two HuggingFace models on first use: sentence-transformers/paraphrase-TinyBERT-L6-v2 and bert-base-uncased. Set HF_HUB_OFFLINE=1 once they are cached.
-
Create
models/<your_model>/run.pyexposingset_model()andget_response()(seemodels/gemini/run.pyas a reference). -
Select it via
MODEL_BACKEND=<your_model>:MODEL_BACKEND=your_model \ MODEL_PATH=your-model-id MODEL_LABEL=your-model \ bash run_all.sh
The refine + eval pipeline is model-agnostic — your results land at eval/eval_outputs/your-model.json without any extra plumbing.
If you use AVI-Bench in your research, please cite:
@inproceedings{wang2026avibench,
title = {AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs},
author = {Wang, Yaoting and Zhang, Ziyi and Tu, Wenming and Xu, Shaoxuan and Du, Wenjie and Liang, Cheng and Wang, Weijun and Li, Yuanchao and Li, Guangyao and Fei, Hao and Li, Yuanchun and Ding, Henghui and Liu, Yunxin},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
year = {2026}
}
⚠️ AVI-Bench is released for academic evaluation only.
The dataset is provided under the AVI-Bench Data Use Policy v1.0 (CC BY-ND 4.0 with an Anti-Training Addendum). By downloading, accessing, or otherwise using the dataset you accept the full policy. Commercial use for evaluation and benchmarking is permitted; training is not.
Permitted. Evaluating, benchmarking, probing, or red-teaming pre-existing models; reproducing results from the paper; academic teaching and qualitative analysis; methodology research about evaluation itself.
Prohibited.
- Using the dataset, in whole or in part, to train, fine-tune, distil, align, or otherwise update any machine-learning model, including LLMs, VLMs, audio-language models, omni-modal foundation models, diffusion models, and any subsequent model class.
- Constructing training data through paraphrasing, translation, augmentation, synthetic generation, or LLM-assisted relabelling of AVI-Bench content.
- Bulk redistribution, mirroring, or rehosting outside the official Hugging Face repository.
- Automated scraping, crawling, or batch downloading of the project page or dataset.
To support contamination audits, the project page and HF dataset card declare AI-training opt-outs via robots.txt, HTML <meta name="robots" content="noai, noimageai">, and a gated HF dataset that requires acceptance of the policy before download. Search engines and AI-search retrieval agents remain free to index this page so that researchers can discover and cite the benchmark.
If any bundled source content infringes your rights, please open a GitHub Issue titled "Removal request"; we will respond within 14 calendar days.
AVI-Bench uses a dual-licence model:
| Component | Licence | File |
|---|---|---|
| Code (Python / shell / HTML / CSS / LaTeX / config) | MIT | LICENSE |
| Dataset (annotations, splits, JSON, processed media on HF) | AVI-Bench Data Use Policy v1.0 (CC BY-ND 4.0 + Anti-Training Addendum) | DATA_USE_POLICY.md |
The dataset bundles content from several public sources (MusicAVQA, AV-Caps, AVHBench, …) — each upstream source retains its original licence, which continues to apply to the underlying media. The Anti-Training Addendum governs the AVI-Bench annotations, organisation, and processing layered on top. See DATA_USE_POLICY.md §6 and the HF dataset card for per-source attribution.


{ "id": "00000", "task": "AVQA", "subtask": null, "input": { "question": { "prompt": "prompt_avqa", "text": "Answer the question based on the given audio and video. Question: Is there a voiceover?", "options": "avqa_options_list_is" }, "video": "./input/videos/00000.mp4", "audio_list": null, "image_list": null }, "output": { "question_answer": "yes" } }