FSMN-VAD Open-Source Speech Model - Freely Support Speech Recognition, Detection, and Punctuation Restoration

Fsmn Vad

Developed by funasr

FunASR is a foundational toolkit dedicated to bridging academic research and industrial applications in speech recognition, supporting various functions such as speech recognition, voice activity detection, and punctuation restoration.

Speech Recognition Open Source License:Other #End-to-End Speech Recognition #Multitask Speech Processing #Industrial-Grade Model

Downloads 107

Release Time : 2/1/2024

Model Overview

FunASR provides full-stack speech processing capabilities, including speech recognition (ASR), voice activity detection (VAD), punctuation restoration, language models, etc., supporting both inference and fine-tuning of pre-trained models.

Model Features

Industrial-Grade Model Support

Provides pre-trained models trained on industrial data, ready for direct deployment in production environments.

Full-Stack Speech Processing

Integrates complete speech processing workflows including ASR, VAD, punctuation restoration, and speaker verification.

Efficient Inference

The Paraformer model combines high accuracy with efficiency, making it suitable for real-time applications.

Model Capabilities

Speech Recognition

Voice Activity Detection

Punctuation Restoration

Speaker Verification

Multi-Speaker Recognition

Timestamp Prediction

Use Cases

Speech Transcription

Automatic Meeting Minutes Generation

Automatically transcribes meeting recordings into text with punctuation and speaker information.

Accuracy can exceed 90% (dependent on audio quality).

Real-Time Speech Processing

Real-Time Captioning

Provides real-time captions for live streams or video conferences.

Latency can be controlled within 600ms.

🚀 FunASR: A Fundamental End-to-End Speech Recognition Toolkit

FunASR aims to bridge the gap between academic research and industrial applications in speech recognition. By supporting the training and fine-tuning of industrial-grade speech recognition models, it enables researchers and developers to conduct research and production of speech recognition models more conveniently, thereby promoting the development of the speech recognition ecosystem. ASR for Fun!

✨ Features

FunASR is a fundamental speech recognition toolkit offering a wide range of features, including speech recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, Language Models, Speaker Verification, Speaker Diarization, and multi-talker ASR. It provides convenient scripts and tutorials, supporting inference and fine-tuning of pre-trained models.
We have released a large number of academic and industrial pre-trained models on ModelScope and huggingface, which can be accessed through our Model Zoo. The representative Paraformer-large, a non-autoregressive end-to-end speech recognition model, offers high accuracy, high efficiency, and convenient deployment, facilitating the rapid construction of speech recognition services. For more details on service deployment, please refer to the service deployment document.

📦 Installation

pip3 install -U funasr

Or install from source code:

git clone https://github.com/alibaba/FunASR.git && cd FunASR
pip3 install -e ./

Install modelscope for the pre-trained models (Optional):

pip3 install -U modelscope

📚 Model Zoo

FunASR has open-sourced a large number of pre-trained models on industrial data. You are free to use, copy, modify, and share FunASR models under the Model License Agreement. Below are some representative models. For more models, please refer to the Model Zoo.

(Note: 🤗 represents the Huggingface model zoo link, ⭐ represents the ModelScope model zoo link)

Property	Details
Model Name	paraformer-zh (⭐ 🤗 )
Task Details	speech recognition, with timestamps, non-streaming
Training Data	60000 hours, Mandarin
Parameters	220M
Model Name	paraformer-zh-streaming ( ⭐ 🤗 )
Task Details	speech recognition, streaming
Training Data	60000 hours, Mandarin
Parameters	220M
Model Name	paraformer-en ( ⭐ 🤗 )
Task Details	speech recognition, with timestamps, non-streaming
Training Data	50000 hours, English
Parameters	220M
Model Name	conformer-en ( ⭐ 🤗 )
Task Details	speech recognition, non-streaming
Training Data	50000 hours, English
Parameters	220M
Model Name	ct-punc ( ⭐ 🤗 )
Task Details	punctuation restoration
Training Data	100M, Mandarin and English
Parameters	1.1G
Model Name	fsmn-vad ( ⭐ 🤗 )
Task Details	voice activity detection
Training Data	5000 hours, Mandarin and English
Parameters	0.4M
Model Name	fa-zh ( ⭐ 🤗 )
Task Details	timestamp prediction
Training Data	5000 hours, Mandarin
Parameters	38M
Model Name	cam++ ( ⭐ 🤗 )
Task Details	speaker verification/diarization
Training Data	5000 hours
Parameters	7.2M

🚀 Quick Start

Below is a quick start tutorial. Test audio files (Mandarin, English).

💻 Usage Examples

Basic Usage

Command-line usage

funasr +model=paraformer-zh +vad_model="fsmn-vad" +punc_model="ct-punc" +input=asr_example_zh.wav

Notes: Support recognition of single audio file, as well as file list in Kaldi-style wav.scp format: wav_id wav_pat

Speech Recognition (Non-streaming)

from funasr import AutoModel
# paraformer-zh is a multi-functional asr model
# use vad, punc, spk or not as you need
model = AutoModel(model="paraformer-zh", model_revision="v2.0.4",
                  vad_model="fsmn-vad", vad_model_revision="v2.0.4",
                  punc_model="ct-punc-c", punc_model_revision="v2.0.4",
                  # spk_model="cam++", spk_model_revision="v2.0.2",
                  )
res = model.generate(input=f"{model.model_path}/example/asr_example.wav", 
                     batch_size_s=300, 
                     hotword='魔搭')
print(res)

Note: model_hub: represents the model repository, ms stands for selecting ModelScope download, hf stands for selecting Huggingface download.

Advanced Usage

Speech Recognition (Streaming)

from funasr import AutoModel

chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention

model = AutoModel(model="paraformer-zh-streaming", model_revision="v2.0.4")

import soundfile
import os

wav_file = os.path.join(model.model_path, "example/asr_example.wav")
speech, sample_rate = soundfile.read(wav_file)
chunk_stride = chunk_size[1] * 960 # 600ms

cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
for i in range(total_chunk_num):
    speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
    is_final = i == total_chunk_num - 1
    res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, decoder_chunk_look_back=decoder_chunk_look_back)
    print(res)

Note: chunk_size is the configuration for streaming latency. [0,10,5] indicates that the real-time display granularity is 10*60=600ms, and the lookahead information is 5*60=300ms. Each inference input is 600ms (sample points are 16000*0.6=960), and the output is the corresponding text. For the last speech segment input, is_final=True needs to be set to force the output of the last word.

Voice Activity Detection (Non-Streaming)

from funasr import AutoModel

model = AutoModel(model="fsmn-vad", model_revision="v2.0.4")
wav_file = f"{model.model_path}/example/asr_example.wav"
res = model.generate(input=wav_file)
print(res)

Voice Activity Detection (Streaming)

from funasr import AutoModel

chunk_size = 200 # ms
model = AutoModel(model="fsmn-vad", model_revision="v2.0.4")

import soundfile

wav_file = f"{model.model_path}/example/vad_example.wav"
speech, sample_rate = soundfile.read(wav_file)
chunk_stride = int(chunk_size * sample_rate / 1000)

cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
for i in range(total_chunk_num):
    speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
    is_final = i == total_chunk_num - 1
    res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size)
    if len(res[0]["value"]):
        print(res)

Punctuation Restoration

from funasr import AutoModel

model = AutoModel(model="ct-punc", model_revision="v2.0.4")
res = model.generate(input="那今天的会就到这里吧 happy new year 明年见")
print(res)

Timestamp Prediction

from funasr import AutoModel

model = AutoModel(model="fa-zh", model_revision="v2.0.4")
wav_file = f"{model.model_path}/example/asr_example.wav"
text_file = f"{model.model_path}/example/text.txt"
res = model.generate(input=(wav_file, text_file), data_type=("sound", "text"))
print(res)

More examples can be found in docs.

📄 License

This project is licensed under the Model License Agreement.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご