đ FunASR: A Fundamental End-to-End Speech Recognition Toolkit
FunASR aims to bridge the gap between academic research and industrial applications in speech recognition. By supporting the training and fine-tuning of industrial-grade speech recognition models, it enables researchers and developers to conduct research and production of speech recognition models more conveniently, thereby promoting the development of the speech recognition ecosystem. ASR for Fun!

Highlights | News | Installation | Quick Start | Runtime | Model Zoo | Contact
⨠Features
- FunASR is a fundamental speech recognition toolkit offering a wide range of features, including speech recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, Language Models, Speaker Verification, Speaker Diarization, and multi-talker ASR. It provides convenient scripts and tutorials, supporting inference and fine-tuning of pre-trained models.
- We have released a large number of academic and industrial pre-trained models on ModelScope and huggingface, which can be accessed through our Model Zoo. The representative Paraformer-large, a non-autoregressive end-to-end speech recognition model, offers high accuracy, high efficiency, and convenient deployment, facilitating the rapid construction of speech recognition services. For more details on service deployment, please refer to the service deployment document.
đĻ Installation
pip3 install -U funasr
Or install from source code:
git clone https://github.com/alibaba/FunASR.git && cd FunASR
pip3 install -e ./
Install modelscope for the pre-trained models (Optional):
pip3 install -U modelscope
đ Model Zoo
FunASR has open-sourced a large number of pre-trained models on industrial data. You are free to use, copy, modify, and share FunASR models under the Model License Agreement. Below are some representative models. For more models, please refer to the Model Zoo.
(Note: đ¤ represents the Huggingface model zoo link, â represents the ModelScope model zoo link)
Property |
Details |
Model Name |
paraformer-zh (â đ¤ ) |
Task Details |
speech recognition, with timestamps, non-streaming |
Training Data |
60000 hours, Mandarin |
Parameters |
220M |
Model Name |
paraformer-zh-streaming ( â đ¤ ) |
Task Details |
speech recognition, streaming |
Training Data |
60000 hours, Mandarin |
Parameters |
220M |
Model Name |
paraformer-en ( â đ¤ ) |
Task Details |
speech recognition, with timestamps, non-streaming |
Training Data |
50000 hours, English |
Parameters |
220M |
Model Name |
conformer-en ( â đ¤ ) |
Task Details |
speech recognition, non-streaming |
Training Data |
50000 hours, English |
Parameters |
220M |
Model Name |
ct-punc ( â đ¤ ) |
Task Details |
punctuation restoration |
Training Data |
100M, Mandarin and English |
Parameters |
1.1G |
Model Name |
fsmn-vad ( â đ¤ ) |
Task Details |
voice activity detection |
Training Data |
5000 hours, Mandarin and English |
Parameters |
0.4M |
Model Name |
fa-zh ( â đ¤ ) |
Task Details |
timestamp prediction |
Training Data |
5000 hours, Mandarin |
Parameters |
38M |
Model Name |
cam++ ( â đ¤ ) |
Task Details |
speaker verification/diarization |
Training Data |
5000 hours |
Parameters |
7.2M |
đ Quick Start
Below is a quick start tutorial. Test audio files (Mandarin, English).
đģ Usage Examples
Basic Usage
Command-line usage
funasr +model=paraformer-zh +vad_model="fsmn-vad" +punc_model="ct-punc" +input=asr_example_zh.wav
Notes: Support recognition of single audio file, as well as file list in Kaldi-style wav.scp format: wav_id wav_pat
Speech Recognition (Non-streaming)
from funasr import AutoModel
model = AutoModel(model="paraformer-zh", model_revision="v2.0.4",
vad_model="fsmn-vad", vad_model_revision="v2.0.4",
punc_model="ct-punc-c", punc_model_revision="v2.0.4",
)
res = model.generate(input=f"{model.model_path}/example/asr_example.wav",
batch_size_s=300,
hotword='éæ')
print(res)
Note: model_hub
: represents the model repository, ms
stands for selecting ModelScope download, hf
stands for selecting Huggingface download.
Advanced Usage
Speech Recognition (Streaming)
from funasr import AutoModel
chunk_size = [0, 10, 5]
encoder_chunk_look_back = 4
decoder_chunk_look_back = 1
model = AutoModel(model="paraformer-zh-streaming", model_revision="v2.0.4")
import soundfile
import os
wav_file = os.path.join(model.model_path, "example/asr_example.wav")
speech, sample_rate = soundfile.read(wav_file)
chunk_stride = chunk_size[1] * 960
cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
for i in range(total_chunk_num):
speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
is_final = i == total_chunk_num - 1
res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, decoder_chunk_look_back=decoder_chunk_look_back)
print(res)
Note: chunk_size
is the configuration for streaming latency. [0,10,5]
indicates that the real-time display granularity is 10*60=600ms
, and the lookahead information is 5*60=300ms
. Each inference input is 600ms
(sample points are 16000*0.6=960
), and the output is the corresponding text. For the last speech segment input, is_final=True
needs to be set to force the output of the last word.
Voice Activity Detection (Non-Streaming)
from funasr import AutoModel
model = AutoModel(model="fsmn-vad", model_revision="v2.0.4")
wav_file = f"{model.model_path}/example/asr_example.wav"
res = model.generate(input=wav_file)
print(res)
Voice Activity Detection (Streaming)
from funasr import AutoModel
chunk_size = 200
model = AutoModel(model="fsmn-vad", model_revision="v2.0.4")
import soundfile
wav_file = f"{model.model_path}/example/vad_example.wav"
speech, sample_rate = soundfile.read(wav_file)
chunk_stride = int(chunk_size * sample_rate / 1000)
cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
for i in range(total_chunk_num):
speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
is_final = i == total_chunk_num - 1
res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size)
if len(res[0]["value"]):
print(res)
Punctuation Restoration
from funasr import AutoModel
model = AutoModel(model="ct-punc", model_revision="v2.0.4")
res = model.generate(input="éŖäģ夊įäŧå°ąå°čŋéå§ happy new year æåš´č§")
print(res)
Timestamp Prediction
from funasr import AutoModel
model = AutoModel(model="fa-zh", model_revision="v2.0.4")
wav_file = f"{model.model_path}/example/asr_example.wav"
text_file = f"{model.model_path}/example/text.txt"
res = model.generate(input=(wav_file, text_file), data_type=("sound", "text"))
print(res)
More examples can be found in docs.
đ License
This project is licensed under the Model License Agreement.