[TOC]
「简体中文」|「English」
Fun-ASR is an end-to-end speech recognition large model launched by Tongyi Lab. It is trained on tens of millions of hours of real speech data, possessing powerful contextual understanding capabilities and industry adaptability. It supports low-latency real-time transcription and covers 31 languages. It excels in vertical domains such as education and finance, accurately recognizing professional terminology and industry expressions, effectively addressing challenges like "hallucination" generation and language confusion, achieving "clear hearing, understanding meaning, and accurate writing."
uv sync --extra cu128
uv pip install transformers==4.57.6 peft funasr==1.3.1 deepspeed
# Training qwen3-asr requires the additional installation of the following plugins:
uv pip install datasets qwen_asr
# export MAX_JOBS=2
# uv pip install -U flash-attn==2.8.3 --no-build-isolation
uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp311-cp311-linux_x86_64.whlModel Repository: modelscope, huggingface
Online Experience: ModelScope Community Space, huggingface space
- Qwen3-ASR Repository:Repository
| Model Name | Task Details | Training Data | Parameters |
|---|---|---|---|
| Fun-ASR-Nano (⭐ 🤗) |
Speech recognition supports Chinese, English, and Japanese. Chinese includes support for 7 dialects (Wu, Cantonese, Min, Hakka, Gan, Xiang, Jin) and 26 regional accents (Henan, Shanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi and more than 20 other regions). English and Japanese cover multiple regional accents. Additional features include lyric recognition and rap speech recognition. | Tens of millions of hours | 800M |
| Fun-ASR-MLT-Nano (⭐ 🤗) |
Speech recognition supports Chinese, English, Cantonese, Japanese, Korean, Vietnamese, Indonesian, Thai, Malay, Filipino, Arabic, Hindi, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish, and 31 languages in total. | Hundreds of thousands of hours | 800M |
- 2025/12: Fun-ASR-Nano-2512 is an end-to-end speech recognition large model trained on tens of millions of hours real speech data. It supports low-latency real-time transcription and covers 31 languages.
- 2024/7: FunASR is a fundamental speech recognition toolkit that offers a variety of features, including speech recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, Language Models, Speaker Verification, Speaker Diarization and multi-talker ASR.
Fun-ASR focuses on high-precision speech recognition, multi-language support, and industry customization capabilities
- Far-field High-noise Recognition: Deeply optimized for far-distance sound pickup and high-noise scenarios (such as conference rooms, in-vehicle environments, industrial sites, etc.), improving recognition accuracy to 93%.
- Chinese Dialects and Regional Accents:
- Supports 7 major dialects: Wu, Cantonese, Min, Hakka, Gan, Xiang, Jin
- Covers 26 regional accents: including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi and more than 20 other regions
- Multi-language Free Speech: Supports recognition of 31 languages, with focused optimization on East and Southeast Asian languages, supporting free language switching and mixed recognition.
- Music Background Lyric Recognition: Enhanced speech recognition performance under music background interference, supporting accurate recognition of lyric content in songs.
git clone https://github.com/FunAudioLLM/Fun-ASR.git
cd Fun-ASR
uv sync- Support returning timestamps
- Support speaker diarization
- Support model training
from funasr import AutoModel
def main():
model_dir = "FunAudioLLM/Fun-ASR-Nano-2512"
model = AutoModel(
model=model_dir,
trust_remote_code=True,
remote_code="./model.py",
device="cuda:0",
# hub:download models from ms (for ModelScope) or hf (for Hugging Face).
hub="hf"
)
wav_path = f"{model.model_path}/example/zh.mp3"
res = model.generate(
input=[wav_path],
cache={},
batch_size=1,
hotwords=["开放时间"],
# 中文、英文、日文 for Fun-ASR-Nano-2512
# 中文、英文、粤语、日文、韩文、越南语、印尼语、泰语、马来语、菲律宾语、阿拉伯语、
# 印地语、保加利亚语、克罗地亚语、捷克语、丹麦语、荷兰语、爱沙尼亚语、芬兰语、希腊语、
# 匈牙利语、爱尔兰语、拉脱维亚语、立陶宛语、马耳他语、波兰语、葡萄牙语、罗马尼亚语、
# 斯洛伐克语、斯洛文尼亚语、瑞典语 for Fun-ASR-MLT-Nano-2512
language="中文",
itn=True, # or False
)
text = res[0]["text"]
print(text)
model = AutoModel(
model=model_dir,
trust_remote_code=True,
vad_model="fsmn-vad",
vad_kwargs={"max_single_segment_time": 30000},
remote_code="./model.py",
device="cuda:0",
)
res = model.generate(input=[wav_path], cache={}, batch_size=1)
text = res[0]["text"]
print(text)
if __name__ == "__main__":
main()from model import FunASRNano
def main():
model_dir = "FunAudioLLM/Fun-ASR-Nano-2512"
m, kwargs = FunASRNano.from_pretrained(model=model_dir, device="cuda:0")
m.eval()
wav_path = f"{kwargs['model_path']}/example/zh.mp3"
res = m.inference(data_in=[wav_path], **kwargs)
text = res[0][0]["text"]
print(text)
if __name__ == "__main__":
main()Parameter Description (click to expand)
model_dir: Model name or local disk model path.trust_remote_code: Whether to trust remote code for loading custom model implementations.remote_code: Specify the location of specific model code (e.g.,model.pyin the current directory), supporting both absolute and relative paths.device: Specify the device to use, such as "cuda:0" or "cpu".
Please refer to docs/finetune.md
We evaluated Fun-ASR against other state-of-the-art models on open-source benchmarks, Chinese dialect datasets, and industry-specific test sets. The results demonstrate that Fun-ASR achieves superior performance across various scenarios.
| Test set | GLM-ASR-nano | GLM-ASR-nano* | Whisper-large-v3 | Seed-ASR | Seed-ASR* | Kimi-Audio | Step-Audio2 | FireRed-ASR | Fun-ASR-nano | Fun-ASR |
|---|---|---|---|---|---|---|---|---|---|---|
| Model Size | 1.5B | 1.5B | 1.6B | - | - | - | - | 1.1B | 0.8B | 7.7B |
| OpenSource | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ |
| AIShell1 | 1.81 | 2.17 | 4.72 | 0.68 | 1.63 | 0.71 | 0.63 | 0.54 | 1.80 | 1.22 |
| AIShell2 | - | 3.47 | 4.68 | 2.27 | 2.76 | 2.86 | 2.10 | 2.58 | 2.75 | 2.39 |
| Fleurs-zh | - | 3.65 | 5.18 | 3.43 | 3.23 | 3.11 | 2.68 | 4.81 | 2.56 | 2.53 |
| Fleurs-en | 5.78 | 6.95 | 6.23 | 9.39 | 9.39 | 6.99 | 3.03 | 10.79 | 5.96 | 4.74 |
| Librispeech-clean | 2.00 | 2.17 | 1.86 | 1.58 | 2.8 | 1.32 | 1.17 | 1.84 | 1.76 | 1.51 |
| Librispeech-other | 4.19 | 4.43 | 3.43 | 2.84 | 5.69 | 2.63 | 2.42 | 4.52 | 4.33 | 3.03 |
| WenetSpeech Meeting | 6.73 | 8.21 | 18.39 | 5.69 | 7.07 | 6.24 | 4.75 | 4.95 | 6.60 | 6.17 |
| WenetSpeech Net | - | 6.33 | 11.89 | 4.66 | 4.84 | 6.45 | 4.67 | 4.94 | 6.01 | 5.46 |
Note: Seed-ASR* results are evaluated using the official API on volcengine; GLM-ASR-nano* results are evaluated using the open-source checkpoint.
| Test set | GLM-ASR-Nano | Whisper-large-v3 | Seed-ASR | FireRed-ASR | Kimi-Audio | Paraformer v2 | Fun-ASR-nano | Fun-ASR |
|---|---|---|---|---|---|---|---|---|
| Model Size | 1.5B | 1.6B | - | 1.1B | 8B | 0.2B | 0.8B | 7.7B |
| OpenSource | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ |
| Nearfield | 16.95 | 16.58 | 7.20 | 10.10 | 9.02 | 8.11 | 7.79 | 6.31 |
| Farfield | 9.44 | 22.21 | 4.59 | 7.49 | 10.95 | 9.55 | 5.79 | 4.34 |
| Complex Background | 23.79 | 32.57 | 12.90 | 15.56 | 15.56 | 15.19 | 14.59 | 11.45 |
| English General | 16.47 | 18.56 | 15.65 | 21.62 | 18.12 | 19.48 | 15.28 | 13.73 |
| Opensource | 4.67 | 7.05 | 3.83 | 5.31 | 3.79 | 6.23 | 4.22 | 3.38 |
| Dialect | 54.21 | 66.14 | 29.45 | 52.82 | 71.94 | 41.16 | 28.18 | 15.21 |
| Accent | 19.78 | 36.03 | 10.23 | 14.05 | 27.20 | 17.80 | 12.90 | 10.31 |
| Lyrics | 46.56 | 54.82 | 30.26 | 42.87 | 65.18 | 50.14 | 30.85 | 21.00 |
| Hiphop | 43.32 | 46.56 | 29.46 | 33.88 | 57.25 | 43.79 | 30.87 | 28.58 |
| Average | 26.13 | 33.39 | 15.95 | 22.63 | 31.00 | 23.49 | 16.72 | 12.70 |
Refer to the official website:https://gitee.com/WangJiaHui202144/funasr-nano/blob/main/docs/fintune_zh.md Choose the training method that suits your needs based on different data volumes.
| Dimension | Full Fine-tuning | LoRA |
|---|---|---|
| Parameters | All LLM params (GB+) | Low-rank matrices (MB+) |
| Trainable Ratio | 100% | 0.1%-1% |
| Overfitting Risk | Extremely High | Low |
| Training Cost | Extremely High (VRAM/Time) | Low (Save 70%+ VRAM) |
| LR Sensitivity | Highly Sensitive (Precise Tuning) | More Tolerant |
| Data Requirements | 1000h+ | 10h-1000h |
| Small Data Performance | Prone to Collapse/Degradation | Stable |
| Difficulty | Extremely Hard to Control | Relatively Easy |
| Performance Ceiling | Theoretically Highest | Slightly Lower (95%+ of Full) |
| Catastrophic Forgetting | Severe | Minimal |
| Inference Overhead | No Extra Cost | Optional Merge/Dynamic Loading |
| Multi-task Adaptation | Requires Retraining | Multiple LoRAs in Parallel |
| Convergence Speed | Slower | Faster |
| Checkpoint Size | Full Model (GB) | LoRA Weights Only (MB) |
| Impact on General Capability | May Severely Degrade | Largely Preserved |
Train: Valid: Test = 8:1:1 G1-G66590 train dataset G66591-G74915 valid dataset G74916-G83238 test dataset A total of 83,238 entries, with an approximate total duration of 87 hours. Additionally, to enhance the model's generalization capability, the WenetSpeech dataset, which aligns with the business scenario, was utilized. Given the dataset's small size, only the audio adapter layer was debugged.
-
Warm-up Training General Data: Specialized Data = 50:50 87h:87h Training Rounds: 5-10 epochs Objective: Activate the model's adaptability to diverse speech patterns
-
Domain Adaptation General Data: Specialized Data = 20:80 20h:87h Training Epochs: 15-20 epochs Objective: Strengthen specialized features while maintaining generalization
-
Purely technical data: 100% 87h Training epochs: 5-10 epochs Objective: Maximize domain accuracy
To reduce data preparation complexity, support for mixed-sample data is provided.
tools/datasets_utils.pyThis utility class supports most file conversions, including converting TXT to SCP, JSON to JSONL, Excel to JSONL, and more. It covers Whisper and Funasr input features. When using this utility class, it is recommended to prepare WAV and TXT data according to the following structure and use this utility class to generate SCP files.
uv run tools/datasets_utils.pylinux
# nano
uv run tools/scp2jsonl.py \
++scp_file=data/domain/train/wav.scp \
++transcript_file=data/domain/train/wav.txt \
++jsonl_file=data/domain/train/wav.jsonl
# paraformer models
scp2jsonl \
++scp_file_list='["data/domain/train/wav.scp", "data/domain/train/wav.txt"]' \
++data_type_list='["source", "target"]' \
++jsonl_file_out="data/domain/train/wav_paraformer.jsonl"win
# nano
uv run tools/scp2jsonl.py ++scp_file=data/domain/train/wav.scp ++transcript_file=data/domain/train/wav.txt ++jsonl_file=data/domain/train/wav_nano.jsonl
# paraformer models
scp2jsonl ++scp_file_list='["data/domain/train/wav.scp", "data/domain/train/wav.txt"]' ++data_type_list='["source", "target"]' ++jsonl_file_out="data/domain/train/wav_paraformer.jsonl"If data employs a multilingual isolated storage structure, the following approach can be adopted to standardize the workflow:
- If multilingual data is already mixed in storage: Perform the corresponding mixing and processing operations directly within the current data directory without additional steps.
- If different language data is stored in separate directories: First perform data mixing operations within each language directory. Then manually create a
stageddirectory and consolidate the processed data from all languages into this directory to complete the multilingual data integration.
Operational Data Preparation for Nano
uv run tools/prepare_staged_data.py \
--general_train data/general/train/wav_nano.jsonl \
--general_val data/general/valid/wav_nano.jsonl \
--domain_train data/domain/train/wav_nano.jsonl \
--domain_val data/domain/valid/wav_nano.jsonl \
--output_dir data/stagedOperational Data Preparation for paraformer
uv run prepare_staged_data.py \
--general_train data/general/train/wav_paraformer.jsonl \
--general_val data/general/valid/wav_paraformer.jsonl \
--domain_train data/domain/train/wav_paraformer.jsonl \
--domain_val data/domain/valid/wav_paraformer.jsonl \
--output_dir data/staged# Output results:
# data/staged/
# ├── stage1/
# │ ├── train.jsonl (混合50/50)
# │ └── val.jsonl
# ├── stage2/
# │ ├── train.jsonl (混合20/80)
# │ └── val.jsonl
# └── stage3/
# ├── train.jsonl (纯专业)
# └── val.jsonl
# data-en 英文数据集
# ├── domain
# │ ├── test
# │ ├── train
# │ └── valid
# ├── general
# │ ├── test
# │ ├── train
# │ └── valid
# └── staged
# ├── stage1
# ├── stage2
# └── stage3
#data-zh 中文数据集
# ├── domain
# │ ├── test
# │ ├── train
# │ └── valid
# ├── general
# │ ├── test
# │ ├── train
# │ └── valid
# └── staged
# ├── stage1
# ├── stage2
# └── stage3nano training script reference: finetune_nano.sh paraformer training script reference: finetune_paraformer.sh qwen3-asr training script reference: finetune_qwen3asr.sh
# Pre-trained Model Path
model_name_or_model_dir="models/Fun-ASR-Nano-2512"
# Full-process encoder freeze
FREEZE_PARAMS="
++audio_encoder_conf.freeze=true \
++audio_adaptor_conf.freeze=false \
++llm_conf.freeze=trueFor reference https://github.com/modelscope/FunASR/blob/main/examples/industrial_data_pretraining/paraformer/README_zh.md#%E6%A8%A1%E5%9E%8B%E8%AE%AD%E7%BB%83%E4%B8%8E%E6%B5%8B%E8%AF%95
- model_name_or_model_dir Model path
- audio_encoder_conf Acoustic encoder, true (frozen)
- audio_adaptor_conf Acoustic adapter layer, false (unfrozen)
- llm_conf High-level semantic module, true (frozen)
# Nano Model Training
nohup bash auto_finetune.sh > full_train_nano.log 2>&1 &
# Paraformer Autoregressive Model Training
nohup bash finetune_paraformer.sh > full_train_paraformer.log 2>&1 &
# qwen3-asr Model Training
nohup bash finetune_qwen3asr.sh > full_train_qwen3asr.log 2>&1 &This project is ready to run directly. However, since the server handles extensive AI training tasks, Docker must be employed to ensure environment isolation and internal network migration. Docker training containers are designed for single-use. Therefore, during training, it is imperative to properly back up and persist data volumes (including model weights, logs, and intermediate outputs). After training completes, recreate and launch a new container for model evaluation or inference. Never reuse the original training container. This approach adheres to the design principles of immutable infrastructure and single responsibility. It clearly separates training and evaluation phases, facilitates detection and management throughout the project lifecycle, reduces cognitive burden for users, and enhances system maintainability and reproducibility.
# build image
docker build -t funasr-finetune:Dockerfile .
# qwen3-asr need speed by flash-attn
# cuda工具链 https://developer.nvidia.com/cuda-12-8-0-download-archive
docker build -f Dockerfile-flash-attn -t funasr-finetune:Dockerfile-flash-attn .
docker builder prune --filter "until=24h"Do not use the same mounted volume for multiple model containers, as this may lead to data corruption.
mkdir nano-finetune
# Launch a temporary container to copy files to the local machine.
docker run -it --name nano-finetune funasr-finetune:Dockerfile /bin/bash
# Open a new terminal Copy the data Copy any files you wish to debug yourself
docker cp nano-finetune:/workspace $PWD
# Exit the container and delete the temporary container
docker rm -f nano-finetune
mkdir $PWD/workspace/models $PWD/workspace/data $PWD/workspace/outputs
# copy model in local
mv <model-path> $PWD/workspace/models
# copy data in local
mv <data-path> $PWD/workspace/data
# start container
docker run -it --network host --shm-size=32g \
--gpus all --cpus=12 \
-v $PWD/workspace:/workspace \
--restart=on-failure \
--name nano-finetune funasr-finetune:Dockerfile /bin/bash
# start train
nohup bash auto_finetune.sh > full_train.log 2>&1 &shm-size Parameters must be explicitly specified.
cpus It is recommended to have four times the number of graphics cards.
Do not use the same mounted volume for multiple model containers, as this may lead to data corruption.
mkdir paraformer-finetune
# Launch a temporary container to copy files to the local machine.
docker run -it --name paraformer-finetune funasr-finetune:Dockerfile /bin/bash
# Open a new terminal Copy the data Copy any files you wish to debug yourself
docker cp paraformer-finetune:/workspace $PWD
# Exit the container and delete the temporary container
docker rm -f paraformer-finetune
mkdir $PWD/workspace/models $PWD/workspace/data $PWD/workspace/outputs
# Copy the model to your local machine
mv <model-path> $PWD/workspace/models
# Copy the model to your local machine
mv <model-path> $PWD/workspace/data
docker run -it --shm-size=8g --gpus=all --cpus=8 \
-p 10097:10095 \
-v $PWD/workspace:/workspace \
-e LANG=C.UTF-8 \
-e LC_ALL=C.UTF-8 \
-e NVIDIA_VISIBLE_DEVICES=all \
-e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
--name paraformer-funasr \
funasr-finetune:Dockerfile /bin/bash
# Start training
nohup bash finetune_paraformer.sh > full_train_paraformer.log 2>&1 &As of now, for funasr-nano-2512, you need to add the following configuration to your model settings:
After training completion, if using a nano model, you must configure tools/lora_merge.py to perform the final model merging. You may observe that the LoRA-trained model is significantly larger, often exceeding several megabytes, because it retains both the audio encoder and base LLM weights to support resume training. This allows you to pause and restart training at any point without needing the original base model files.
# For specific parameter modifications, refer to the merge script.
uv run tools/lora_merge.py# decode
uv run decode.py ++model_dir=models/Fun-ASR-Nano-merged ++scp_file=data/domain/test/wav.scp ++output_file=output.txt
uv run decode.py ++model_dir=models/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch ++scp_file=data/domain/test/wav.scp ++output_file=output.txt
# itn inverse text normalization
uv run tools/whisper_mix_normalize.py data/val_text.txt data/val_norm.txt
uv run tools/whisper_mix_normalize.py output.txt output_norm.txt
# compute-wer
compute-wer data/val_norm.txt output_norm.txt cer.txt
tail -n8 cer.txtuv run train_log_analyzer.py log.txt- vLLM (GPU) Deployment Best Practices: An accelerated implementation of Fun-ASR using vLLM. Repository
- llama(GGUF) (GGUF) Best Inference Practices:Repository
@misc{an2025funasrtechnicalreport,
title={Fun-ASR Technical Report},
author={Keyu An and Yanni Chen and Zhigao Chen and Chong Deng and Zhihao Du and Changfeng Gao and Zhifu Gao and Bo Gong and Xiangang Li and Yabin Li and Ying Liu and Xiang Lv and Yunjie Ji and Yiheng Jiang and Bin Ma and Haoneng Luo and Chongjia Ni and Zexu Pan and Yiping Peng and Zhendong Peng and Peiyao Wang and Hao Wang and Haoxu Wang and Wen Wang and Wupeng Wang and Yuzhong Wu and Biao Tian and Zhentao Tan and Nan Yang and Bin Yuan and Jieping Ye and Jixing Yu and Qinglin Zhang and Kun Zou and Han Zhao and Shengkui Zhao and Jingren Zhou and Yanqiao Zhu},
year={2025},
eprint={2509.12508},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.12508},
}






