The open-source model stt_zh_conformer_transducer_large - Free and accurate transcription of Mandarin speech

Stt Zh Conformer Transducer Large

Developed by nvidia

This is a large Conformer-Transducer model for transcribing Mandarin speech, with approximately 120 million parameters, trained on the AISHELL-2 dataset.

Speech Recognition

PyTorch

Chinese#Mandarin speech recognition #Conformer architecture #Low CER

Downloads 72

Release Time : 6/29/2022

Model Overview

This model is an automatic speech recognition model based on the Conformer-Transducer architecture, specifically designed for Mandarin speech transcription tasks.

Model Features

High-performance Transcription

Achieves a character error rate (CER) of 5.3-5.7% on the AISHELL-2 test set

Large-scale Training

Utilizes a large model architecture with approximately 120 million parameters for more accurate transcription

Mandarin Optimization

Specially trained and optimized for Mandarin speech

Model Capabilities

Mandarin speech recognition

Audio transcription

Speech-to-text

Use Cases

Speech Transcription

Meeting Minutes

Automatically transcribe Mandarin meeting recordings into text records

Approximately 94.3-94.7% accuracy

Voice Assistant

Provide speech recognition capabilities for Mandarin voice assistants

🚀 NVIDIA Conformer-Transducer Large (zh-ZH)

This model is designed to transcribe Mandarin speech. It's a large - scale Conformer - Transducer model with approximately 120M parameters. For detailed architecture information, refer to the model architecture section and the NeMo documentation.

🚀 Quick Start

📦 Installation

To train, fine - tune or use the model, you need to install NVIDIA NeMo. It's recommended to install it after installing the latest Pytorch version.

pip install nemo_toolkit['all']

💻 Usage Examples

Basic Usage

Automatically instantiate the model:

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecRNNTModel.from_pretrained("nvidia/stt_zh_conformer_transducer_large")

Advanced Usage

Transcribing using Python

You can transcribe an audio file as follows:

output = asr_model.transcribe(['sample.wav'])
print(output[0].text)

Transcribing many audio files

python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py 
 pretrained_name="nvidia/stt_zh_conformer_transducer_large" 
 audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"

Input

This model accepts 16000 KHz Mono - channel Audio (wav files) as input.

Output

This model provides transcribed speech as a string for a given audio sample.

📚 Documentation

Model Architecture

The Conformer - Transducer model is an autoregressive variant of the Conformer model [1] for Automatic Speech Recognition. It uses Transducer loss/decoding instead of CTC Loss. More details about this model can be found here: Conformer - Transducer Model.

Training

The NeMo toolkit [3] was used to train the models for over several hundred epochs. These models were trained with this example script and this base config.

Datasets

All the models in this collection are trained on AISHELL2 [4], which consists of Mandarin speech.

Performance

The list of available models in this collection is shown in the following table. The performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.

Version	Tokenizer	Vocabulary Size	AISHELL2 Test IOS	AISHELL2 Test Android	AISHELL2 Test Mic	Train Dataset
1.10.0	Characters	5026	5.3	5.7	5.6	AISHELL - 2

Limitations

Since this model was trained on publicly available speech datasets, its performance might degrade for speech containing technical terms or vernacular that the model has not been trained on. It may also perform worse for accented speech.

NVIDIA Riva: Deployment

NVIDIA Riva is an accelerated speech AI SDK that can be deployed on - prem, in all clouds, multi - cloud, hybrid, on edge, and embedded. Additionally, Riva provides:

World - class out - of - the - box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU - compute hours
Best in class accuracy with run - time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
Streaming speech recognition, Kubernetes compatible scaling, and enterprise - grade support

Although this model isn’t supported yet by Riva, the list of supported models is here.
Check out Riva live demo.

References

[1] Conformer: Convolution - augmented Transformer for Speech Recognition [2] Google Sentencepiece Tokenizer [3] NVIDIA NeMo Toolkit [4] AISHELL - 2: Transforming Mandarin ASR Research Into Industrial Scale

📄 License

The license to use this model is covered by the CC - BY - 4.0. By downloading the public and release version of the model, you accept the terms and conditions of the CC - BY - 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご