đ NVIDIA Streaming Conformer-Hybrid Large (en-US)
This collection offers large-size cache-aware FastConformer-Hybrid models (around 114M parameters) with multiple look-ahead support. These models are trained on a large scale of English speech for streaming ASR, suitable for applications with various latencies.
đ Quick Start
To use this model, you need to install NVIDIA NeMo. It's recommended to install it after the latest Pytorch version.
pip install nemo_toolkit['all']
⨠Features
- Cache-aware Design: These models are cache-aware versions of Hybrid FastConfomer, trained for streaming ASR.
- Multiple Look-ahead Support: Trained with multiple look-aheads, enabling support for different latencies.
- Multitask Training: Trained in a multitask setup with joint Transducer and CTC decoder loss.
đĻ Installation
Install the NeMo toolkit with the following command:
pip install nemo_toolkit['all']
đģ Usage Examples
Basic Usage
Simulate Streaming ASR
You can use this script to simulate streaming ASR: cache-aware streaming simulation. You can set the context size using --att_context_size
; otherwise, the default (1040ms) will be used.
Transcribing using Python
First, get a sample audio file:
wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
Then, use the following Python code to transcribe:
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="nvidia/stt_en_fastconformer_hybrid_large_streaming_multi")
asr_model.encoder.set_default_att_context_size([70,13])
asr_model.change_decoding_strategy(decoder_type='rnnt')
output = asr_model.transcribe(['2086-149220-0033.wav'])
print(output[0].text)
Advanced Usage
Transcribing many audio files
Using Transducer mode inference:
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
pretrained_name="stt_en_fastconformer_hybrid_large_streaming_multi" \
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
Using CTC mode inference:
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
pretrained_name="stt_en_fastconformer_hybrid_large_streaming_multi" \
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \
decoder_type="ctc"
To change between different look-aheads, set att_context_size
in the transcribe_speech.py
script:
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
pretrained_name="stt_en_fastconformer_hybrid_large_streaming_multi" \
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \
att_context_size=[70,0]
Supported values for att_context_size
: {[70,0]: 0ms, [70,1]: 80ms, [70,16]: 480ms, [70,33]: 1040ms}.
đ Documentation
Model Architecture
These models are cache-aware Hybrid FastConfomer models for streaming ASR. More info on cache-aware models can be found here: Cache-aware Streaming Conformer [5].
FastConformer [4] is an optimized version of the Conformer model [1]. More details on FastConformer can be found here: Fast-Conformer Model.
The model is trained with joint Transducer and CTC decoder loss [5]. More about Hybrid Transducer-CTC training can be found here: Hybrid Transducer-CTC.
Training
The NeMo toolkit [3] was used to train the models for over several hundred epochs. The models were trained with this example script and this base config. The SentencePiece tokenizers [2] were built using the text transcripts of the train set with this script.
Datasets
All models in this collection are trained on a composite dataset (NeMo ASRSET) consisting of several thousand hours of English speech:
- Librispeech: 960 hours of English speech
- Fisher Corpus
- Switchboard-1 Dataset
- WSJ-0 and WSJ-1
- National Speech Corpus (Part 1, Part 6)
- VCTK
- VoxPopuli (EN)
- Europarl-ASR (EN)
- Multilingual Librispeech (MLS EN) - 2,000 hours subset
- Mozilla Common Voice (v7.0)
- People's Speech - 12,000 hrs subset
Performance
The performance of the ASR models is reported in terms of Word Error Rate (WER%) with greedy decoding.
Transducer Decoder
att_context_sizes |
LS test-other ([70,13]-1040ms) |
LS test-other ([70,6]-480ms) |
LS test-other ([70,1]-80ms) |
LS test-other ([70,0]-0s) |
Train Dataset |
[[70,13],[70,6],[70,1],[70,0]] |
5.4 |
5.7 |
6.4 |
7.0 |
NeMo ASRSET 3.0 |
CTC Decoder
att_context_sizes |
LS test-other ([70,13]-1040ms) |
LS test-other ([70,6]-480ms) |
LS test-other ([70,1]-80ms) |
LS test-other ([70,0]-0s) |
Train Dataset |
[[70,13],[70,6],[70,1],[70,0]] |
6.2 |
6.7 |
7.8 |
8.4 |
NeMo ASRSET 3.0 |
Input
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
Output
This model provides transcribed speech as a string for a given audio sample.
đ§ Technical Details
These models are cache-aware versions of Hybrid FastConfomer, trained for streaming ASR. They are trained with multiple look-aheads to support different latencies. The model is trained in a multitask setup with joint Transducer and CTC decoder loss. More details can be found in the references:
đ License
This model is licensed under cc-by-4.0.
Limitations
Since this model was trained on publicly available speech datasets, its performance might degrade for speech with technical terms or vernacular that the model has not been trained on. It might also perform worse for accented speech.
NVIDIA Riva: Deployment
NVIDIA Riva is an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, on edge, and embedded. It provides world-class accuracy, run-time word boosting, and customization options. Although this model isn't supported by Riva yet, the list of supported models is here. Check out Riva live demo.
References
[1] Conformer: Convolution-augmented Transformer for Speech Recognition
[2] Google Sentencepiece Tokenizer
[3] NVIDIA NeMo Toolkit
[4] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition
[5] Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition