Skip to content

MiuLab/AudioDecision

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MedVoiceBias

MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making Zhi Rui Tam, Yun-Nung Chen — National Taiwan University, 2025

arXiv Dataset

Abstract

As large language models transition from text-based interfaces to audio interactions in clinical settings, they might introduce new vulnerabilities through paralinguistic cues in audio. We evaluated these models on 170 clinical cases, each synthesized into speech from 36 distinct voice profiles spanning variations in age, gender, and emotion. Our findings reveal a severe modality bias: surgical recommendations for audio inputs varied by as much as 35% compared to identical text-based inputs, with one model providing 80% fewer recommendations. Further analysis uncovered age disparities of up to 12% between young and elderly voices, which persisted in most models despite chain-of-thought prompting. While explicit reasoning successfully eliminated gender bias, the impact of emotion was not detected due to poor recognition performance. These results demonstrate that audio LLMs are susceptible to making clinical decisions based on a patient's voice characteristics rather than medical evidence.


Dataset

The MedVoiceBias dataset contains 170 DDXPlus clinical cases synthesized into 36 voice profiles (age × gender × emotion) using Sesame-1B CSM. It is hosted on HuggingFace and loaded automatically by all evaluation scripts — no manual download required.

Dataset: theblackcat102/MedVoiceBias
Each split corresponds to one voice profile (e.g. "old_female", "young_male", "expresso_happy")
Fields: qid, PATIENT_PROFILE, PATHOLOGY, audio, whisper_v3

Voice profile categories:

Category Profiles
Age (CommonVoice) young_female, young_male, old_female, old_male
Emotion (Expresso) expresso_happy, expresso_laughing, expresso_sad, expresso_confused, expresso_enunciated, expresso_whisper

Installation

pip install -r requirements.txt

Additional setup for local models:

# DeSTA2.5
pip install desta

# Qwen2.5-Omni
pip install qwen-omni-utils

API keys required (set as environment variables):

export GEMINI_API_KEY=...
export GCP_PROJECT_NAME=...   # for Gemini via Vertex AI
export OAI_AUDIO_KEY=...      # for GPT-4o-mini

Evaluation

All scripts are run as Python modules from the AudioDecision/ directory.

Step 0 (Prerequisite): Verify demographic detection ability

Confirms that a model can perceive age, gender, and emotion from audio before running bias experiments (Table 2).

python -m eval.eval_cohort_detection \
    --model_name gemini-2.5-flash \
    --profile_name old_female

Step 1: Text baseline

# Direct Answer (DA)
python -m eval.eval_surgery_text \
    --model_name gemini-2.5-flash \
    --profile_name old_female \
    --mode da

# Chain-of-Thought (CoT)
python -m eval.eval_surgery_text \
    --model_name gemini-2.5-flash \
    --profile_name old_female \
    --mode cot

Step 2: Text+Profile (inject GT demographic description)

python -m eval.eval_surgery_text \
    --model_name gemini-2.5-flash \
    --profile_name old_female \
    --mode da \
    --with_profile

Step 3: ASR column (Whisper-v3 transcripts from dataset)

python -m eval.eval_surgery_asr \
    --model_name gemini-2.5-flash \
    --profile_name old_female \
    --mode da

Step 4: Audio evaluation (main experiment)

# DA with audio
python -m eval.eval_surgery_audio_da \
    --model_name gemini-2.5-flash \
    --profile_name old_female \
    --use_audio

# CoT with audio
python -m eval.eval_surgery_audio_cot \
    --model_name gemini-2.5-flash \
    --profile_name old_female \
    --use_audio

Run all models and profiles

bash scripts/eval_all_models.sh

Reproducing Paper Tables

bash scripts/run_analysis.sh

Or individually:

python analysis/report_demographic_bias.py   # Tables 3 & 4: age/gender bias

Results are written to logging/ as .jsonl files.


Supported Models

--model_name --series Type
gpt-4o-mini openai API
gemini-2.0-flash gemini API
gemini-2.5-flash gemini API
Qwen/Qwen2.5-Omni-3B qwen_omni Local
Qwen/Qwen2.5-Omni-7B qwen_omni Local
mistralai/Voxtral-Mini-3B-2507 voxtral Local (vLLM)
DeSTA-ntu/DeSTA2.5-Audio-Llama-3.1-8B desta_2_5 Local

Repository Structure

AudioDecision/
├── llms/                        # Model interfaces
│   ├── utils.py                 # get_llm() factory + retry logic
│   ├── gemini.py                # Google Gemini (Vertex AI)
│   ├── oai.py                   # OpenAI GPT-4o
│   ├── desta_2_5.py             # DeSTA2.5-Audio
│   ├── qwen_omni.py             # Qwen2.5-Omni
│   ├── qwen3_omni.py            # Qwen3-Omni
│   └── voxtral.py               # Voxtral (via vLLM)
├── eval/                        # Evaluation scripts
│   ├── utils.py                 # Shared constants & answer parsers
│   ├── eval_surgery_text.py     # Text / Text+Profile columns
│   ├── eval_surgery_asr.py      # ASR column (row['whisper_v3'])
│   ├── eval_surgery_audio_da.py # Audio column, Direct Answer
│   ├── eval_surgery_audio_cot.py# Audio column, Chain-of-Thought
│   ├── eval_surgery_audio_da_gt.py   # Ablation: audio + GT voice desc, DA
│   ├── eval_surgery_audio_cot_gt.py  # Ablation: audio + GT voice desc, CoT
│   ├── eval_surgery_audio_cot_pred.py# Ablation: audio + predicted voice desc
│   └── eval_cohort_detection.py # Demographic detection prerequisite
├── analysis/
│   └── report_demographic_bias.py  # Age & gender bias tables (Tables 3-4)
├── scripts/
│   ├── eval_all_models.sh       # Run all evaluations
│   └── run_analysis.sh          # Generate paper tables
└── requirements.txt

Citation

@article{tam2025medvoicebias,
  title     = {MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making},
  author    = {Tam, Zhi Rui and Chen, Yun-Nung},
  year      = {2025},
  eprint    = {2511.06592},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL}
}

About

[ICASSP 26] When Voice Matters : A study of audio llm behavior in clinical decision making

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors