Fun-ASR

[TOC]

「简体中文」|「English」

Fun-ASR is an end-to-end speech recognition large model launched by Tongyi Lab. It is trained on tens of millions of hours of real speech data, possessing powerful contextual understanding capabilities and industry adaptability. It supports low-latency real-time transcription and covers 31 languages. It excels in vertical domains such as education and finance, accurately recognizing professional terminology and industry expressions, effectively addressing challenges like "hallucination" generation and language confusion, achieving "clear hearing, understanding meaning, and accurate writing."

Project Kickoff Briefing

uv sync --extra cu128
uv pip install transformers==4.57.6 peft funasr==1.3.1 deepspeed

# Training qwen3-asr requires the additional installation of the following plugins:
uv pip install datasets qwen_asr
# export MAX_JOBS=2
# uv pip install -U flash-attn==2.8.3 --no-build-isolation
uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp311-cp311-linux_x86_64.whl

Homepage ｜ Core Features ｜ Performance Evaluation ｜ Environment Setup ｜ Usage Tutorial

Model Repository: modelscope, huggingface

Online Experience: ModelScope Community Space, huggingface space

Qwen3-ASR Repository：Repository

Model Name	Task Details	Training Data	Parameters
Fun-ASR-Nano (⭐ 🤗)	Speech recognition supports Chinese, English, and Japanese. Chinese includes support for 7 dialects (Wu, Cantonese, Min, Hakka, Gan, Xiang, Jin) and 26 regional accents (Henan, Shanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi and more than 20 other regions). English and Japanese cover multiple regional accents. Additional features include lyric recognition and rap speech recognition.	Tens of millions of hours	800M
Fun-ASR-MLT-Nano (⭐ 🤗)	Speech recognition supports Chinese, English, Cantonese, Japanese, Korean, Vietnamese, Indonesian, Thai, Malay, Filipino, Arabic, Hindi, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish, and 31 languages in total.	Hundreds of thousands of hours	800M

What's New 🔥

2025/12: Fun-ASR-Nano-2512 is an end-to-end speech recognition large model trained on tens of millions of hours real speech data. It supports low-latency real-time transcription and covers 31 languages.
2024/7: FunASR is a fundamental speech recognition toolkit that offers a variety of features, including speech recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, Language Models, Speaker Verification, Speaker Diarization and multi-talker ASR.

Core Features 🎯

Fun-ASR focuses on high-precision speech recognition, multi-language support, and industry customization capabilities

Far-field High-noise Recognition: Deeply optimized for far-distance sound pickup and high-noise scenarios (such as conference rooms, in-vehicle environments, industrial sites, etc.), improving recognition accuracy to 93%.
Chinese Dialects and Regional Accents:
- Supports 7 major dialects: Wu, Cantonese, Min, Hakka, Gan, Xiang, Jin
- Covers 26 regional accents: including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi and more than 20 other regions
Multi-language Free Speech: Supports recognition of 31 languages, with focused optimization on East and Southeast Asian languages, supporting free language switching and mixed recognition.
Music Background Lyric Recognition: Enhanced speech recognition performance under music background interference, supporting accurate recognition of lyric content in songs.

Environment Setup 🐍

git clone https://github.com/FunAudioLLM/Fun-ASR.git
cd Fun-ASR
uv sync

TODO

Support returning timestamps
Support speaker diarization
Support model training

Usage 🛠️

Inference

Using funasr for inference

from funasr import AutoModel


def main():
    model_dir = "FunAudioLLM/Fun-ASR-Nano-2512"
    model = AutoModel(
        model=model_dir,
        trust_remote_code=True,
        remote_code="./model.py",
        device="cuda:0",
        # hub：download models from ms (for ModelScope) or hf (for Hugging Face).
        hub="hf"
    )

    wav_path = f"{model.model_path}/example/zh.mp3"
    res = model.generate(
        input=[wav_path],
        cache={},
        batch_size=1,
        hotwords=["开放时间"],
        # 中文、英文、日文 for Fun-ASR-Nano-2512
        # 中文、英文、粤语、日文、韩文、越南语、印尼语、泰语、马来语、菲律宾语、阿拉伯语、
        # 印地语、保加利亚语、克罗地亚语、捷克语、丹麦语、荷兰语、爱沙尼亚语、芬兰语、希腊语、
        # 匈牙利语、爱尔兰语、拉脱维亚语、立陶宛语、马耳他语、波兰语、葡萄牙语、罗马尼亚语、
        # 斯洛伐克语、斯洛文尼亚语、瑞典语 for Fun-ASR-MLT-Nano-2512
        language="中文",
        itn=True, # or False
    )
    text = res[0]["text"]
    print(text)

    model = AutoModel(
        model=model_dir,
        trust_remote_code=True,
        vad_model="fsmn-vad",
        vad_kwargs={"max_single_segment_time": 30000},
        remote_code="./model.py",
        device="cuda:0",
    )
    res = model.generate(input=[wav_path], cache={}, batch_size=1)
    text = res[0]["text"]
    print(text)


if __name__ == "__main__":
    main()

Direct Inference

from model import FunASRNano


def main():
    model_dir = "FunAudioLLM/Fun-ASR-Nano-2512"
    m, kwargs = FunASRNano.from_pretrained(model=model_dir, device="cuda:0")
    m.eval()

    wav_path = f"{kwargs['model_path']}/example/zh.mp3"
    res = m.inference(data_in=[wav_path], **kwargs)
    text = res[0][0]["text"]
    print(text)


if __name__ == "__main__":
    main()

Parameter Description (click to expand)

model_dir: Model name or local disk model path.
trust_remote_code: Whether to trust remote code for loading custom model implementations.
remote_code: Specify the location of specific model code (e.g., model.py in the current directory), supporting both absolute and relative paths.
device: Specify the device to use, such as "cuda:0" or "cpu".

Finetune

Please refer to docs/finetune.md

Performance 📝

We evaluated Fun-ASR against other state-of-the-art models on open-source benchmarks, Chinese dialect datasets, and industry-specific test sets. The results demonstrate that Fun-ASR achieves superior performance across various scenarios.

1. Open-Source Dataset Performance (WER %)

Test set	GLM-ASR-nano	GLM-ASR-nano*	Whisper-large-v3	Seed-ASR	Seed-ASR*	Kimi-Audio	Step-Audio2	FireRed-ASR	Fun-ASR-nano	Fun-ASR
Model Size	1.5B	1.5B	1.6B	-	-	-	-	1.1B	0.8B	7.7B
OpenSource	✅	✅	✅	❌	❌	✅	✅	✅	✅	❌
AIShell1	1.81	2.17	4.72	0.68	1.63	0.71	0.63	0.54	1.80	1.22
AIShell2	-	3.47	4.68	2.27	2.76	2.86	2.10	2.58	2.75	2.39
Fleurs-zh	-	3.65	5.18	3.43	3.23	3.11	2.68	4.81	2.56	2.53
Fleurs-en	5.78	6.95	6.23	9.39	9.39	6.99	3.03	10.79	5.96	4.74
Librispeech-clean	2.00	2.17	1.86	1.58	2.8	1.32	1.17	1.84	1.76	1.51
Librispeech-other	4.19	4.43	3.43	2.84	5.69	2.63	2.42	4.52	4.33	3.03
WenetSpeech Meeting	6.73	8.21	18.39	5.69	7.07	6.24	4.75	4.95	6.60	6.17
WenetSpeech Net	-	6.33	11.89	4.66	4.84	6.45	4.67	4.94	6.01	5.46

Note: Seed-ASR* results are evaluated using the official API on volcengine; GLM-ASR-nano* results are evaluated using the open-source checkpoint.

2. Industry Dataset Performance (WER %)

Test set	GLM-ASR-Nano	Whisper-large-v3	Seed-ASR	FireRed-ASR	Kimi-Audio	Paraformer v2	Fun-ASR-nano	Fun-ASR
Model Size	1.5B	1.6B	-	1.1B	8B	0.2B	0.8B	7.7B
OpenSource	✅	✅	❌	✅	✅	✅	✅	❌
Nearfield	16.95	16.58	7.20	10.10	9.02	8.11	7.79	6.31
Farfield	9.44	22.21	4.59	7.49	10.95	9.55	5.79	4.34
Complex Background	23.79	32.57	12.90	15.56	15.56	15.19	14.59	11.45
English General	16.47	18.56	15.65	21.62	18.12	19.48	15.28	13.73
Opensource	4.67	7.05	3.83	5.31	3.79	6.23	4.22	3.38
Dialect	54.21	66.14	29.45	52.82	71.94	41.16	28.18	15.21
Accent	19.78	36.03	10.23	14.05	27.20	17.80	12.90	10.31
Lyrics	46.56	54.82	30.26	42.87	65.18	50.14	30.85	21.00
Hiphop	43.32	46.56	29.46	33.88	57.25	43.79	30.87	28.58
Average	26.13	33.39	15.95	22.63	31.00	23.49	16.72	12.70

Phased Mixed Training

Refer to the official website：https://gitee.com/WangJiaHui202144/funasr-nano/blob/main/docs/fintune_zh.md Choose the training method that suits your needs based on different data volumes.

Dimension	Full Fine-tuning	LoRA
Parameters	All LLM params (GB+)	Low-rank matrices (MB+)
Trainable Ratio	100%	0.1%-1%
Overfitting Risk	Extremely High	Low
Training Cost	Extremely High (VRAM/Time)	Low (Save 70%+ VRAM)
LR Sensitivity	Highly Sensitive (Precise Tuning)	More Tolerant
Data Requirements	1000h+	10h-1000h
Small Data Performance	Prone to Collapse/Degradation	Stable
Difficulty	Extremely Hard to Control	Relatively Easy
Performance Ceiling	Theoretically Highest	Slightly Lower (95%+ of Full)
Catastrophic Forgetting	Severe	Minimal
Inference Overhead	No Extra Cost	Optional Merge/Dynamic Loading
Multi-task Adaptation	Requires Retraining	Multiple LoRAs in Parallel
Convergence Speed	Slower	Faster
Checkpoint Size	Full Model (GB)	LoRA Weights Only (MB)
Impact on General Capability	May Severely Degrade	Largely Preserved

Train: Valid: Test = 8:1:1 G1-G66590 train dataset G66591-G74915 valid dataset G74916-G83238 test dataset A total of 83,238 entries, with an approximate total duration of 87 hours. Additionally, to enhance the model's generalization capability, the WenetSpeech dataset, which aligns with the business scenario, was utilized. Given the dataset's small size, only the audio adapter layer was debugged.

Warm-up Training General Data: Specialized Data = 50:50 87h:87h Training Rounds: 5-10 epochs Objective: Activate the model's adaptability to diverse speech patterns
Domain Adaptation General Data: Specialized Data = 20:80 20h:87h Training Epochs: 15-20 epochs Objective: Strengthen specialized features while maintaining generalization
Purely technical data: 100% 87h Training epochs: 5-10 epochs Objective: Maximize domain accuracy

To reduce data preparation complexity, support for mixed-sample data is provided.

1. Generate an SCP file that meets the requirements.

tools/datasets_utils.pyThis utility class supports most file conversions, including converting TXT to SCP, JSON to JSONL, Excel to JSONL, and more. It covers Whisper and Funasr input features. When using this utility class, it is recommended to prepare WAV and TXT data according to the following structure and use this utility class to generate SCP files.

uv run tools/datasets_utils.py

2. Generate JSONL files for input features in Nano format

linux

# nano
 uv run tools/scp2jsonl.py \
  ++scp_file=data/domain/train/wav.scp \
  ++transcript_file=data/domain/train/wav.txt \
  ++jsonl_file=data/domain/train/wav.jsonl

  # paraformer models
scp2jsonl \
++scp_file_list='["data/domain/train/wav.scp", "data/domain/train/wav.txt"]' \
++data_type_list='["source", "target"]' \
++jsonl_file_out="data/domain/train/wav_paraformer.jsonl"

win

# nano
uv run tools/scp2jsonl.py ++scp_file=data/domain/train/wav.scp ++transcript_file=data/domain/train/wav.txt ++jsonl_file=data/domain/train/wav_nano.jsonl

# paraformer models
scp2jsonl ++scp_file_list='["data/domain/train/wav.scp", "data/domain/train/wav.txt"]' ++data_type_list='["source", "target"]' ++jsonl_file_out="data/domain/train/wav_paraformer.jsonl"

3.Use prepare_staged_data.py to blend datasets

If data employs a multilingual isolated storage structure, the following approach can be adopted to standardize the workflow:

If multilingual data is already mixed in storage: Perform the corresponding mixing and processing operations directly within the current data directory without additional steps.
If different language data is stored in separate directories: First perform data mixing operations within each language directory. Then manually create a staged directory and consolidate the processed data from all languages into this directory to complete the multilingual data integration.

Operational Data Preparation for Nano

uv run tools/prepare_staged_data.py \
  --general_train data/general/train/wav_nano.jsonl \
  --general_val data/general/valid/wav_nano.jsonl \
  --domain_train data/domain/train/wav_nano.jsonl \
  --domain_val data/domain/valid/wav_nano.jsonl \
  --output_dir data/staged

Operational Data Preparation for paraformer

uv run prepare_staged_data.py \
  --general_train data/general/train/wav_paraformer.jsonl \
  --general_val data/general/valid/wav_paraformer.jsonl \
  --domain_train data/domain/train/wav_paraformer.jsonl \
  --domain_val data/domain/valid/wav_paraformer.jsonl \
  --output_dir data/staged

# Output results：
# data/staged/
# ├── stage1/
# │   ├── train.jsonl (混合50/50)
# │   └── val.jsonl
# ├── stage2/
# │   ├── train.jsonl (混合20/80)
# │   └── val.jsonl
# └── stage3/
#     ├── train.jsonl (纯专业)
#     └── val.jsonl
# data-en 英文数据集
# ├── domain
# │   ├── test
# │   ├── train
# │   └── valid
# ├── general
# │   ├── test
# │   ├── train
# │   └── valid
# └── staged
#     ├── stage1
#     ├── stage2
#     └── stage3
#data-zh 中文数据集
# ├── domain
# │   ├── test
# │   ├── train
# │   └── valid
# ├── general
# │   ├── test
# │   ├── train
# │   └── valid
# └── staged
#     ├── stage1
#     ├── stage2
#     └── stage3

4.One-Click Fine-Tuning Training

nano training script reference: finetune_nano.sh paraformer training script reference: finetune_paraformer.sh qwen3-asr training script reference: finetune_qwen3asr.sh

# Pre-trained Model Path
model_name_or_model_dir="models/Fun-ASR-Nano-2512"

# Full-process encoder freeze
FREEZE_PARAMS="
++audio_encoder_conf.freeze=true \
++audio_adaptor_conf.freeze=false \
++llm_conf.freeze=true

For reference https://github.com/modelscope/FunASR/blob/main/examples/industrial_data_pretraining/paraformer/README_zh.md#%E6%A8%A1%E5%9E%8B%E8%AE%AD%E7%BB%83%E4%B8%8E%E6%B5%8B%E8%AF%95

model_name_or_model_dir Model path
audio_encoder_conf Acoustic encoder, true (frozen)
audio_adaptor_conf Acoustic adapter layer, false (unfrozen)
llm_conf High-level semantic module, true (frozen)

# Nano Model Training
nohup bash auto_finetune.sh > full_train_nano.log 2>&1 &
# Paraformer Autoregressive Model Training
nohup bash finetune_paraformer.sh > full_train_paraformer.log 2>&1 &
# qwen3-asr Model Training
nohup bash finetune_qwen3asr.sh > full_train_qwen3asr.log 2>&1 &

Docker Training

This project is ready to run directly. However, since the server handles extensive AI training tasks, Docker must be employed to ensure environment isolation and internal network migration. Docker training containers are designed for single-use. Therefore, during training, it is imperative to properly back up and persist data volumes (including model weights, logs, and intermediate outputs). After training completes, recreate and launch a new container for model evaluation or inference. Never reuse the original training container. This approach adheres to the design principles of immutable infrastructure and single responsibility. It clearly separates training and evaluation phases, facilitates detection and management throughout the project lifecycle, reduces cognitive burden for users, and enhances system maintainability and reproducibility.

# build image
docker build -t funasr-finetune:Dockerfile .

# qwen3-asr need speed by flash-attn
# cuda工具链 https://developer.nvidia.com/cuda-12-8-0-download-archive
docker build -f Dockerfile-flash-attn -t funasr-finetune:Dockerfile-flash-attn .

docker builder prune --filter "until=24h"

nano container training

Do not use the same mounted volume for multiple model containers, as this may lead to data corruption.

mkdir nano-finetune

# Launch a temporary container to copy files to the local machine.
docker run -it --name nano-finetune funasr-finetune:Dockerfile /bin/bash

# Open a new terminal Copy the data Copy any files you wish to debug yourself
docker cp nano-finetune:/workspace $PWD

# Exit the container and delete the temporary container
docker rm -f nano-finetune

mkdir $PWD/workspace/models $PWD/workspace/data  $PWD/workspace/outputs
# copy model in local
mv <model-path> $PWD/workspace/models

# copy data in local
mv <data-path> $PWD/workspace/data

# start container
docker run -it --network host --shm-size=32g \
--gpus all --cpus=12 \
-v $PWD/workspace:/workspace \
--restart=on-failure \
--name nano-finetune funasr-finetune:Dockerfile /bin/bash

# start train
nohup bash auto_finetune.sh > full_train.log 2>&1 &

shm-size Parameters must be explicitly specified. cpus It is recommended to have four times the number of graphics cards.

Paraformer Container Training

Do not use the same mounted volume for multiple model containers, as this may lead to data corruption.

mkdir paraformer-finetune

# Launch a temporary container to copy files to the local machine.
docker run -it --name paraformer-finetune funasr-finetune:Dockerfile /bin/bash

# Open a new terminal Copy the data Copy any files you wish to debug yourself
docker cp paraformer-finetune:/workspace $PWD

# Exit the container and delete the temporary container
docker rm -f paraformer-finetune

mkdir $PWD/workspace/models $PWD/workspace/data  $PWD/workspace/outputs
# Copy the model to your local machine
mv <model-path> $PWD/workspace/models

# Copy the model to your local machine
mv <model-path> $PWD/workspace/data

docker run -it --shm-size=8g --gpus=all --cpus=8 \
  -p 10097:10095 \
  -v $PWD/workspace:/workspace \
  -e LANG=C.UTF-8 \
  -e LC_ALL=C.UTF-8 \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
  --name paraformer-funasr \
  funasr-finetune:Dockerfile /bin/bash

# Start training
nohup bash finetune_paraformer.sh > full_train_paraformer.log 2>&1 &

Multi-card training

As of now, for funasr-nano-2512, you need to add the following configuration to your model settings:

Merge Nano Model

After training completion, if using a nano model, you must configure tools/lora_merge.py to perform the final model merging. You may observe that the LoRA-trained model is significantly larger, often exceeding several megabytes, because it retains both the audio encoder and base LLM weights to support resume training. This allows you to pause and restart training at any point without needing the original base model files.

# For specific parameter modifications, refer to the merge script.
uv run tools/lora_merge.py

Decoding Test

# decode
uv run decode.py  ++model_dir=models/Fun-ASR-Nano-merged   ++scp_file=data/domain/test/wav.scp   ++output_file=output.txt
uv run decode.py  ++model_dir=models/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch   ++scp_file=data/domain/test/wav.scp   ++output_file=output.txt
# itn inverse text normalization
uv run tools/whisper_mix_normalize.py data/val_text.txt data/val_norm.txt
uv run tools/whisper_mix_normalize.py output.txt output_norm.txt
# compute-wer
compute-wer data/val_norm.txt output_norm.txt cer.txt
tail -n8 cer.txt

Log Analysis

uv run train_log_analyzer.py log.txt

Remarkable Third-Party Work

vLLM (GPU) Deployment Best Practices: An accelerated implementation of Fun-ASR using vLLM. Repository
llama（GGUF） (GGUF) Best Inference Practices:Repository

Citations

@misc{an2025funasrtechnicalreport,
      title={Fun-ASR Technical Report},
      author={Keyu An and Yanni Chen and Zhigao Chen and Chong Deng and Zhihao Du and Changfeng Gao and Zhifu Gao and Bo Gong and Xiangang Li and Yabin Li and Ying Liu and Xiang Lv and Yunjie Ji and Yiheng Jiang and Bin Ma and Haoneng Luo and Chongjia Ni and Zexu Pan and Yiping Peng and Zhendong Peng and Peiyao Wang and Hao Wang and Haoxu Wang and Wen Wang and Wupeng Wang and Yuzhong Wu and Biao Tian and Zhentao Tan and Nan Yang and Bin Yuan and Jieping Ye and Jixing Yu and Qinglin Zhang and Kun Zou and Han Zhao and Shengkui Zhao and Jingren Zhou and Yanqiao Zhu},
      year={2025},
      eprint={2509.12508},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.12508},
}

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
deepspeed_conf		deepspeed_conf
docs		docs
images		images
resource		resource
tools		tools
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
Dockerfile-flash-attn		Dockerfile-flash-attn
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
__init__.py		__init__.py
auto_finetune.sh		auto_finetune.sh
ctc.py		ctc.py
decode.py		decode.py
demo1.py		demo1.py
demo2.py		demo2.py
demo3_paraformer.py		demo3_paraformer.py
demo4_nano.py		demo4_nano.py
finetune.sh		finetune.sh
finetune_nano.sh		finetune_nano.sh
finetune_paraformer.sh		finetune_paraformer.sh
finetune_qwen3asr.sh		finetune_qwen3asr.sh
main.py		main.py
model.py		model.py
pyproject.toml		pyproject.toml
train_log_analyzer.py		train_log_analyzer.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Fun-ASR

Project Kickoff Briefing

Homepage ｜ Core Features ｜ Performance Evaluation ｜ Environment Setup ｜ Usage Tutorial

What's New 🔥

Core Features 🎯

Environment Setup 🐍

TODO

Usage 🛠️

Inference

Using funasr for inference

Direct Inference

Finetune

Performance 📝

1. Open-Source Dataset Performance (WER %)

2. Industry Dataset Performance (WER %)

Phased Mixed Training

1. Generate an SCP file that meets the requirements.

2. Generate JSONL files for input features in Nano format

3.Use prepare_staged_data.py to blend datasets

4.One-Click Fine-Tuning Training

Docker Training

nano container training

Paraformer Container Training

Multi-card training

Merge Nano Model

Decoding Test

Log Analysis

Remarkable Third-Party Work

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages