Skip to content

mxr-vector/Fun-ASR-finetune

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

144 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fun-ASR

[TOC]

简体中文」|「English」

Fun-ASR is an end-to-end speech recognition large model launched by Tongyi Lab. It is trained on tens of millions of hours of real speech data, possessing powerful contextual understanding capabilities and industry adaptability. It supports low-latency real-time transcription and covers 31 languages. It excels in vertical domains such as education and finance, accurately recognizing professional terminology and industry expressions, effectively addressing challenges like "hallucination" generation and language confusion, achieving "clear hearing, understanding meaning, and accurate writing."

Project Kickoff Briefing

uv sync --extra cu128
uv pip install transformers==4.57.6 peft funasr==1.3.1 deepspeed

# Training qwen3-asr requires the additional installation of the following plugins:
uv pip install datasets qwen_asr
# export MAX_JOBS=2
# uv pip install -U flash-attn==2.8.3 --no-build-isolation
uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
Model Name Task Details Training Data Parameters
Fun-ASR-Nano
( 🤗)
Speech recognition supports Chinese, English, and Japanese. Chinese includes support for 7 dialects (Wu, Cantonese, Min, Hakka, Gan, Xiang, Jin) and 26 regional accents (Henan, Shanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi and more than 20 other regions). English and Japanese cover multiple regional accents. Additional features include lyric recognition and rap speech recognition. Tens of millions of hours 800M
Fun-ASR-MLT-Nano
( 🤗)
Speech recognition supports Chinese, English, Cantonese, Japanese, Korean, Vietnamese, Indonesian, Thai, Malay, Filipino, Arabic, Hindi, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish, and 31 languages in total. Hundreds of thousands of hours 800M

What's New 🔥

  • 2025/12: Fun-ASR-Nano-2512 is an end-to-end speech recognition large model trained on tens of millions of hours real speech data. It supports low-latency real-time transcription and covers 31 languages.
  • 2024/7: FunASR is a fundamental speech recognition toolkit that offers a variety of features, including speech recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, Language Models, Speaker Verification, Speaker Diarization and multi-talker ASR.

Core Features 🎯

Fun-ASR focuses on high-precision speech recognition, multi-language support, and industry customization capabilities

  • Far-field High-noise Recognition: Deeply optimized for far-distance sound pickup and high-noise scenarios (such as conference rooms, in-vehicle environments, industrial sites, etc.), improving recognition accuracy to 93%.
  • Chinese Dialects and Regional Accents:
    • Supports 7 major dialects: Wu, Cantonese, Min, Hakka, Gan, Xiang, Jin
    • Covers 26 regional accents: including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi and more than 20 other regions
  • Multi-language Free Speech: Supports recognition of 31 languages, with focused optimization on East and Southeast Asian languages, supporting free language switching and mixed recognition.
  • Music Background Lyric Recognition: Enhanced speech recognition performance under music background interference, supporting accurate recognition of lyric content in songs.

Environment Setup 🐍

git clone https://github.com/FunAudioLLM/Fun-ASR.git
cd Fun-ASR
uv sync

TODO

  • Support returning timestamps
  • Support speaker diarization
  • Support model training

Usage 🛠️

Inference

Using funasr for inference

from funasr import AutoModel


def main():
    model_dir = "FunAudioLLM/Fun-ASR-Nano-2512"
    model = AutoModel(
        model=model_dir,
        trust_remote_code=True,
        remote_code="./model.py",
        device="cuda:0",
        # hub:download models from ms (for ModelScope) or hf (for Hugging Face).
        hub="hf"
    )

    wav_path = f"{model.model_path}/example/zh.mp3"
    res = model.generate(
        input=[wav_path],
        cache={},
        batch_size=1,
        hotwords=["开放时间"],
        # 中文、英文、日文 for Fun-ASR-Nano-2512
        # 中文、英文、粤语、日文、韩文、越南语、印尼语、泰语、马来语、菲律宾语、阿拉伯语、
        # 印地语、保加利亚语、克罗地亚语、捷克语、丹麦语、荷兰语、爱沙尼亚语、芬兰语、希腊语、
        # 匈牙利语、爱尔兰语、拉脱维亚语、立陶宛语、马耳他语、波兰语、葡萄牙语、罗马尼亚语、
        # 斯洛伐克语、斯洛文尼亚语、瑞典语 for Fun-ASR-MLT-Nano-2512
        language="中文",
        itn=True, # or False
    )
    text = res[0]["text"]
    print(text)

    model = AutoModel(
        model=model_dir,
        trust_remote_code=True,
        vad_model="fsmn-vad",
        vad_kwargs={"max_single_segment_time": 30000},
        remote_code="./model.py",
        device="cuda:0",
    )
    res = model.generate(input=[wav_path], cache={}, batch_size=1)
    text = res[0]["text"]
    print(text)


if __name__ == "__main__":
    main()

Direct Inference

from model import FunASRNano


def main():
    model_dir = "FunAudioLLM/Fun-ASR-Nano-2512"
    m, kwargs = FunASRNano.from_pretrained(model=model_dir, device="cuda:0")
    m.eval()

    wav_path = f"{kwargs['model_path']}/example/zh.mp3"
    res = m.inference(data_in=[wav_path], **kwargs)
    text = res[0][0]["text"]
    print(text)


if __name__ == "__main__":
    main()
Parameter Description (click to expand)
  • model_dir: Model name or local disk model path.
  • trust_remote_code: Whether to trust remote code for loading custom model implementations.
  • remote_code: Specify the location of specific model code (e.g., model.py in the current directory), supporting both absolute and relative paths.
  • device: Specify the device to use, such as "cuda:0" or "cpu".

Finetune

Please refer to docs/finetune.md

Performance 📝

We evaluated Fun-ASR against other state-of-the-art models on open-source benchmarks, Chinese dialect datasets, and industry-specific test sets. The results demonstrate that Fun-ASR achieves superior performance across various scenarios.

1. Open-Source Dataset Performance (WER %)

Test set GLM-ASR-nano GLM-ASR-nano* Whisper-large-v3 Seed-ASR Seed-ASR* Kimi-Audio Step-Audio2 FireRed-ASR Fun-ASR-nano Fun-ASR
Model Size 1.5B 1.5B 1.6B - - - - 1.1B 0.8B 7.7B
OpenSource
AIShell1 1.81 2.17 4.72 0.68 1.63 0.71 0.63 0.54 1.80 1.22
AIShell2 - 3.47 4.68 2.27 2.76 2.86 2.10 2.58 2.75 2.39
Fleurs-zh - 3.65 5.18 3.43 3.23 3.11 2.68 4.81 2.56 2.53
Fleurs-en 5.78 6.95 6.23 9.39 9.39 6.99 3.03 10.79 5.96 4.74
Librispeech-clean 2.00 2.17 1.86 1.58 2.8 1.32 1.17 1.84 1.76 1.51
Librispeech-other 4.19 4.43 3.43 2.84 5.69 2.63 2.42 4.52 4.33 3.03
WenetSpeech Meeting 6.73 8.21 18.39 5.69 7.07 6.24 4.75 4.95 6.60 6.17
WenetSpeech Net - 6.33 11.89 4.66 4.84 6.45 4.67 4.94 6.01 5.46

Note: Seed-ASR* results are evaluated using the official API on volcengine; GLM-ASR-nano* results are evaluated using the open-source checkpoint.

2. Industry Dataset Performance (WER %)

Test set GLM-ASR-Nano Whisper-large-v3 Seed-ASR FireRed-ASR Kimi-Audio Paraformer v2 Fun-ASR-nano Fun-ASR
Model Size 1.5B 1.6B - 1.1B 8B 0.2B 0.8B 7.7B
OpenSource
Nearfield 16.95 16.58 7.20 10.10 9.02 8.11 7.79 6.31
Farfield 9.44 22.21 4.59 7.49 10.95 9.55 5.79 4.34
Complex Background 23.79 32.57 12.90 15.56 15.56 15.19 14.59 11.45
English General 16.47 18.56 15.65 21.62 18.12 19.48 15.28 13.73
Opensource 4.67 7.05 3.83 5.31 3.79 6.23 4.22 3.38
Dialect 54.21 66.14 29.45 52.82 71.94 41.16 28.18 15.21
Accent 19.78 36.03 10.23 14.05 27.20 17.80 12.90 10.31
Lyrics 46.56 54.82 30.26 42.87 65.18 50.14 30.85 21.00
Hiphop 43.32 46.56 29.46 33.88 57.25 43.79 30.87 28.58
Average 26.13 33.39 15.95 22.63 31.00 23.49 16.72 12.70

Phased Mixed Training

Refer to the official website:https://gitee.com/WangJiaHui202144/funasr-nano/blob/main/docs/fintune_zh.md Choose the training method that suits your needs based on different data volumes.

Dimension Full Fine-tuning LoRA
Parameters All LLM params (GB+) Low-rank matrices (MB+)
Trainable Ratio 100% 0.1%-1%
Overfitting Risk Extremely High Low
Training Cost Extremely High (VRAM/Time) Low (Save 70%+ VRAM)
LR Sensitivity Highly Sensitive (Precise Tuning) More Tolerant
Data Requirements 1000h+ 10h-1000h
Small Data Performance Prone to Collapse/Degradation Stable
Difficulty Extremely Hard to Control Relatively Easy
Performance Ceiling Theoretically Highest Slightly Lower (95%+ of Full)
Catastrophic Forgetting Severe Minimal
Inference Overhead No Extra Cost Optional Merge/Dynamic Loading
Multi-task Adaptation Requires Retraining Multiple LoRAs in Parallel
Convergence Speed Slower Faster
Checkpoint Size Full Model (GB) LoRA Weights Only (MB)
Impact on General Capability May Severely Degrade Largely Preserved

Train: Valid: Test = 8:1:1 G1-G66590 train dataset G66591-G74915 valid dataset G74916-G83238 test dataset A total of 83,238 entries, with an approximate total duration of 87 hours. Additionally, to enhance the model's generalization capability, the WenetSpeech dataset, which aligns with the business scenario, was utilized. Given the dataset's small size, only the audio adapter layer was debugged.

  1. Warm-up Training General Data: Specialized Data = 50:50 87h:87h Training Rounds: 5-10 epochs Objective: Activate the model's adaptability to diverse speech patterns

  2. Domain Adaptation General Data: Specialized Data = 20:80 20h:87h Training Epochs: 15-20 epochs Objective: Strengthen specialized features while maintaining generalization

  3. Purely technical data: 100% 87h Training epochs: 5-10 epochs Objective: Maximize domain accuracy

To reduce data preparation complexity, support for mixed-sample data is provided.

1. Generate an SCP file that meets the requirements.

tools/datasets_utils.pyThis utility class supports most file conversions, including converting TXT to SCP, JSON to JSONL, Excel to JSONL, and more. It covers Whisper and Funasr input features. When using this utility class, it is recommended to prepare WAV and TXT data according to the following structure and use this utility class to generate SCP files.

img2

uv run tools/datasets_utils.py

2. Generate JSONL files for input features in Nano format

linux

# nano
 uv run tools/scp2jsonl.py \
  ++scp_file=data/domain/train/wav.scp \
  ++transcript_file=data/domain/train/wav.txt \
  ++jsonl_file=data/domain/train/wav.jsonl

  # paraformer models
scp2jsonl \
++scp_file_list='["data/domain/train/wav.scp", "data/domain/train/wav.txt"]' \
++data_type_list='["source", "target"]' \
++jsonl_file_out="data/domain/train/wav_paraformer.jsonl"

win

# nano
uv run tools/scp2jsonl.py ++scp_file=data/domain/train/wav.scp ++transcript_file=data/domain/train/wav.txt ++jsonl_file=data/domain/train/wav_nano.jsonl

# paraformer models
scp2jsonl ++scp_file_list='["data/domain/train/wav.scp", "data/domain/train/wav.txt"]' ++data_type_list='["source", "target"]' ++jsonl_file_out="data/domain/train/wav_paraformer.jsonl"

3.Use prepare_staged_data.py to blend datasets

If data employs a multilingual isolated storage structure, the following approach can be adopted to standardize the workflow:

  • If multilingual data is already mixed in storage: Perform the corresponding mixing and processing operations directly within the current data directory without additional steps.
  • If different language data is stored in separate directories: First perform data mixing operations within each language directory. Then manually create a staged directory and consolidate the processed data from all languages into this directory to complete the multilingual data integration.

Operational Data Preparation for Nano

uv run tools/prepare_staged_data.py \
  --general_train data/general/train/wav_nano.jsonl \
  --general_val data/general/valid/wav_nano.jsonl \
  --domain_train data/domain/train/wav_nano.jsonl \
  --domain_val data/domain/valid/wav_nano.jsonl \
  --output_dir data/staged

Operational Data Preparation for paraformer

uv run prepare_staged_data.py \
  --general_train data/general/train/wav_paraformer.jsonl \
  --general_val data/general/valid/wav_paraformer.jsonl \
  --domain_train data/domain/train/wav_paraformer.jsonl \
  --domain_val data/domain/valid/wav_paraformer.jsonl \
  --output_dir data/staged
# Output results:
# data/staged/
# ├── stage1/
# │   ├── train.jsonl (混合50/50)
# │   └── val.jsonl
# ├── stage2/
# │   ├── train.jsonl (混合20/80)
# │   └── val.jsonl
# └── stage3/
#     ├── train.jsonl (纯专业)
#     └── val.jsonl
# data-en 英文数据集
# ├── domain
# │   ├── test
# │   ├── train
# │   └── valid
# ├── general
# │   ├── test
# │   ├── train
# │   └── valid
# └── staged
#     ├── stage1
#     ├── stage2
#     └── stage3
#data-zh 中文数据集
# ├── domain
# │   ├── test
# │   ├── train
# │   └── valid
# ├── general
# │   ├── test
# │   ├── train
# │   └── valid
# └── staged
#     ├── stage1
#     ├── stage2
#     └── stage3

img3

img3-

4.One-Click Fine-Tuning Training

nano training script reference: finetune_nano.sh paraformer training script reference: finetune_paraformer.sh qwen3-asr training script reference: finetune_qwen3asr.sh

# Pre-trained Model Path
model_name_or_model_dir="models/Fun-ASR-Nano-2512"

# Full-process encoder freeze
FREEZE_PARAMS="
++audio_encoder_conf.freeze=true \
++audio_adaptor_conf.freeze=false \
++llm_conf.freeze=true

For reference https://github.com/modelscope/FunASR/blob/main/examples/industrial_data_pretraining/paraformer/README_zh.md#%E6%A8%A1%E5%9E%8B%E8%AE%AD%E7%BB%83%E4%B8%8E%E6%B5%8B%E8%AF%95

  • model_name_or_model_dir Model path
  • audio_encoder_conf Acoustic encoder, true (frozen)
  • audio_adaptor_conf Acoustic adapter layer, false (unfrozen)
  • llm_conf High-level semantic module, true (frozen)
# Nano Model Training
nohup bash auto_finetune.sh > full_train_nano.log 2>&1 &
# Paraformer Autoregressive Model Training
nohup bash finetune_paraformer.sh > full_train_paraformer.log 2>&1 &
# qwen3-asr Model Training
nohup bash finetune_qwen3asr.sh > full_train_qwen3asr.log 2>&1 &

Docker Training

This project is ready to run directly. However, since the server handles extensive AI training tasks, Docker must be employed to ensure environment isolation and internal network migration. Docker training containers are designed for single-use. Therefore, during training, it is imperative to properly back up and persist data volumes (including model weights, logs, and intermediate outputs). After training completes, recreate and launch a new container for model evaluation or inference. Never reuse the original training container. This approach adheres to the design principles of immutable infrastructure and single responsibility. It clearly separates training and evaluation phases, facilitates detection and management throughout the project lifecycle, reduces cognitive burden for users, and enhances system maintainability and reproducibility.

# build image
docker build -t funasr-finetune:Dockerfile .

# qwen3-asr need speed by flash-attn
# cuda工具链 https://developer.nvidia.com/cuda-12-8-0-download-archive
docker build -f Dockerfile-flash-attn -t funasr-finetune:Dockerfile-flash-attn .

docker builder prune --filter "until=24h"

nano container training

Do not use the same mounted volume for multiple model containers, as this may lead to data corruption.

mkdir nano-finetune

# Launch a temporary container to copy files to the local machine.
docker run -it --name nano-finetune funasr-finetune:Dockerfile /bin/bash

# Open a new terminal Copy the data Copy any files you wish to debug yourself
docker cp nano-finetune:/workspace $PWD

# Exit the container and delete the temporary container
docker rm -f nano-finetune

mkdir $PWD/workspace/models $PWD/workspace/data  $PWD/workspace/outputs
# copy model in local
mv <model-path> $PWD/workspace/models

# copy data in local
mv <data-path> $PWD/workspace/data

# start container
docker run -it --network host --shm-size=32g \
--gpus all --cpus=12 \
-v $PWD/workspace:/workspace \
--restart=on-failure \
--name nano-finetune funasr-finetune:Dockerfile /bin/bash

# start train
nohup bash auto_finetune.sh > full_train.log 2>&1 &

shm-size Parameters must be explicitly specified. cpus It is recommended to have four times the number of graphics cards.

Paraformer Container Training

Do not use the same mounted volume for multiple model containers, as this may lead to data corruption.

mkdir paraformer-finetune

# Launch a temporary container to copy files to the local machine.
docker run -it --name paraformer-finetune funasr-finetune:Dockerfile /bin/bash

# Open a new terminal Copy the data Copy any files you wish to debug yourself
docker cp paraformer-finetune:/workspace $PWD

# Exit the container and delete the temporary container
docker rm -f paraformer-finetune

mkdir $PWD/workspace/models $PWD/workspace/data  $PWD/workspace/outputs
# Copy the model to your local machine
mv <model-path> $PWD/workspace/models

# Copy the model to your local machine
mv <model-path> $PWD/workspace/data

docker run -it --shm-size=8g --gpus=all --cpus=8 \
  -p 10097:10095 \
  -v $PWD/workspace:/workspace \
  -e LANG=C.UTF-8 \
  -e LC_ALL=C.UTF-8 \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
  --name paraformer-funasr \
  funasr-finetune:Dockerfile /bin/bash

# Start training
nohup bash finetune_paraformer.sh > full_train_paraformer.log 2>&1 &

Multi-card training

As of now, for funasr-nano-2512, you need to add the following configuration to your model settings:

Multi-card-training

Merge Nano Model

After training completion, if using a nano model, you must configure tools/lora_merge.py to perform the final model merging. You may observe that the LoRA-trained model is significantly larger, often exceeding several megabytes, because it retains both the audio encoder and base LLM weights to support resume training. This allows you to pause and restart training at any point without needing the original base model files.

# For specific parameter modifications, refer to the merge script.
uv run tools/lora_merge.py

Decoding Test

# decode
uv run decode.py  ++model_dir=models/Fun-ASR-Nano-merged   ++scp_file=data/domain/test/wav.scp   ++output_file=output.txt
uv run decode.py  ++model_dir=models/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch   ++scp_file=data/domain/test/wav.scp   ++output_file=output.txt
# itn inverse text normalization
uv run tools/whisper_mix_normalize.py data/val_text.txt data/val_norm.txt
uv run tools/whisper_mix_normalize.py output.txt output_norm.txt
# compute-wer
compute-wer data/val_norm.txt output_norm.txt cer.txt
tail -n8 cer.txt

Log Analysis

uv run train_log_analyzer.py log.txt

Training Log Analyzer1

Training Log Analyzer2

Remarkable Third-Party Work

  • vLLM (GPU) Deployment Best Practices: An accelerated implementation of Fun-ASR using vLLM. Repository
  • llama(GGUF) (GGUF) Best Inference Practices:Repository

Citations

@misc{an2025funasrtechnicalreport,
      title={Fun-ASR Technical Report},
      author={Keyu An and Yanni Chen and Zhigao Chen and Chong Deng and Zhihao Du and Changfeng Gao and Zhifu Gao and Bo Gong and Xiangang Li and Yabin Li and Ying Liu and Xiang Lv and Yunjie Ji and Yiheng Jiang and Bin Ma and Haoneng Luo and Chongjia Ni and Zexu Pan and Yiping Peng and Zhendong Peng and Peiyao Wang and Hao Wang and Haoxu Wang and Wen Wang and Wupeng Wang and Yuzhong Wu and Biao Tian and Zhentao Tan and Nan Yang and Bin Yuan and Jieping Ye and Jixing Yu and Qinglin Zhang and Kun Zou and Han Zhao and Shengkui Zhao and Jingren Zhou and Yanqiao Zhu},
      year={2025},
      eprint={2509.12508},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.12508},
}

About

Fun-ASR is an end-to-end speech recognition large model launched by Tongyi Lab.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 89.4%
  • Shell 9.8%
  • Dockerfile 0.8%