๐ NVIDIA Conformer-Transducer Large (zh-ZH)
This model is designed to transcribe Mandarin speech. It's a large - scale Conformer - Transducer model with approximately 120M parameters. For detailed architecture information, refer to the model architecture section and the NeMo documentation.
๐ Quick Start
๐ฆ Installation
To train, fine - tune or use the model, you need to install NVIDIA NeMo. It's recommended to install it after installing the latest Pytorch version.
pip install nemo_toolkit['all']
๐ป Usage Examples
Basic Usage
Automatically instantiate the model:
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecRNNTModel.from_pretrained("nvidia/stt_zh_conformer_transducer_large")
Advanced Usage
Transcribing using Python
You can transcribe an audio file as follows:
output = asr_model.transcribe(['sample.wav'])
print(output[0].text)
Transcribing many audio files
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
pretrained_name="nvidia/stt_zh_conformer_transducer_large"
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
Input
This model accepts 16000 KHz Mono - channel Audio (wav files) as input.
Output
This model provides transcribed speech as a string for a given audio sample.
๐ Documentation
Model Architecture
The Conformer - Transducer model is an autoregressive variant of the Conformer model [1] for Automatic Speech Recognition. It uses Transducer loss/decoding instead of CTC Loss. More details about this model can be found here: Conformer - Transducer Model.
Training
The NeMo toolkit [3] was used to train the models for over several hundred epochs. These models were trained with this example script and this base config.
Datasets
All the models in this collection are trained on AISHELL2 [4], which consists of Mandarin speech.
Performance
The list of available models in this collection is shown in the following table. The performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
Version |
Tokenizer |
Vocabulary Size |
AISHELL2 Test IOS |
AISHELL2 Test Android |
AISHELL2 Test Mic |
Train Dataset |
1.10.0 |
Characters |
5026 |
5.3 |
5.7 |
5.6 |
AISHELL - 2 |
Limitations
Since this model was trained on publicly available speech datasets, its performance might degrade for speech containing technical terms or vernacular that the model has not been trained on. It may also perform worse for accented speech.
NVIDIA Riva: Deployment
NVIDIA Riva is an accelerated speech AI SDK that can be deployed on - prem, in all clouds, multi - cloud, hybrid, on edge, and embedded.
Additionally, Riva provides:
- World - class out - of - the - box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU - compute hours
- Best in class accuracy with run - time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
- Streaming speech recognition, Kubernetes compatible scaling, and enterprise - grade support
Although this model isnโt supported yet by Riva, the list of supported models is here.
Check out Riva live demo.
References
[1] Conformer: Convolution - augmented Transformer for Speech Recognition
[2] Google Sentencepiece Tokenizer
[3] NVIDIA NeMo Toolkit
[4] AISHELL - 2: Transforming Mandarin ASR Research Into Industrial Scale
๐ License
The license to use this model is covered by the CC - BY - 4.0. By downloading the public and release version of the model, you accept the terms and conditions of the CC - BY - 4.0 license.