stt_en_fastconformer_hybrid_large_streaming_multi Open-source Model - Streaming Speech Recognition Adapted to Multi-latency Scenarios

Stt En Fastconformer Hybrid Large Streaming Multi

Developed by nvidia

Cache-aware FastConformer-Hybrid large model supporting multiple look-ahead windows, specifically designed for streaming automatic speech recognition, adaptable to various latency scenarios

Speech Recognition

PyTorch

English#Streaming Speech Recognition #Adjustable Multi-Latency #Hybrid Decoder

Downloads 1,400

Release Time : 10/5/2023

Model Overview

Streaming automatic speech recognition model trained on large-scale English speech data, employing a hybrid FastConformer architecture with flexible latency adjustment support

Model Features

Multi-Latency Streaming

Supports four latency levels: 0ms/80ms/480ms/1040ms, with actual latency approximately half of the nominal value

Hybrid Architecture

Combines the advantages of Transducer and CTC decoders, supporting runtime switching of decoding strategies

Cache-Aware Technology

Utilizes advanced caching mechanisms for streaming processing, maintaining consistency between offline and streaming mode predictions

Large-Scale Training Data

Trained on thousands of hours of diverse English speech data, covering multiple scenarios and accents

Model Capabilities

Real-time speech-to-text

Streaming audio processing

Low-latency speech recognition

Multi-scenario speech transcription

Use Cases

Real-time Transcription

Meeting Live Captioning

Provides low-latency real-time captions for online meetings

5.7% WER at 480ms latency

Customer Service Voice Analysis

Real-time transcription of audio conversations for quality analysis

Supports dynamic latency adjustment to meet various scenario requirements

Media Processing

Video Subtitle Generation

Automatically generates high-precision subtitles for media content

5.4% WER in 1040ms mode

🚀 NVIDIA Streaming Conformer-Hybrid Large (en-US)

This collection offers large-size cache-aware FastConformer-Hybrid models (around 114M parameters) with multiple look-ahead support. These models are trained on a large scale of English speech for streaming ASR, suitable for applications with various latencies.

🚀 Quick Start

To use this model, you need to install NVIDIA NeMo. It's recommended to install it after the latest Pytorch version.

pip install nemo_toolkit['all']

✨ Features

Cache-aware Design: These models are cache-aware versions of Hybrid FastConfomer, trained for streaming ASR.
Multiple Look-ahead Support: Trained with multiple look-aheads, enabling support for different latencies.
Multitask Training: Trained in a multitask setup with joint Transducer and CTC decoder loss.

📦 Installation

Install the NeMo toolkit with the following command:

pip install nemo_toolkit['all']

💻 Usage Examples

Basic Usage

Simulate Streaming ASR

You can use this script to simulate streaming ASR: cache-aware streaming simulation. You can set the context size using --att_context_size; otherwise, the default (1040ms) will be used.

Transcribing using Python

First, get a sample audio file:

wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav

Then, use the following Python code to transcribe:

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="nvidia/stt_en_fastconformer_hybrid_large_streaming_multi")

# Optional: change the default latency. Default latency is 1040ms. Supported latencies: {0: 0ms, 1: 80ms, 16: 480ms, 33: 1040ms}.
# Note: These are the worst latency and average latency would be half of these numbers.
asr_model.encoder.set_default_att_context_size([70,13]) 

#Optional: change the default decoder. Default decoder is Transducer (RNNT). Supported decoders: {ctc, rnnt}.
asr_model.change_decoding_strategy(decoder_type='rnnt')

output = asr_model.transcribe(['2086-149220-0033.wav'])
print(output[0].text)

Advanced Usage

Transcribing many audio files

Using Transducer mode inference:

python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
  pretrained_name="stt_en_fastconformer_hybrid_large_streaming_multi" \
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"

Using CTC mode inference:

python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
  pretrained_name="stt_en_fastconformer_hybrid_large_streaming_multi" \
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \
  decoder_type="ctc"

To change between different look-aheads, set att_context_size in the transcribe_speech.py script:

python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
  pretrained_name="stt_en_fastconformer_hybrid_large_streaming_multi" \
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \
  att_context_size=[70,0]

Supported values for att_context_size: {[70,0]: 0ms, [70,1]: 80ms, [70,16]: 480ms, [70,33]: 1040ms}.

📚 Documentation

Model Architecture

These models are cache-aware Hybrid FastConfomer models for streaming ASR. More info on cache-aware models can be found here: Cache-aware Streaming Conformer [5].

FastConformer [4] is an optimized version of the Conformer model [1]. More details on FastConformer can be found here: Fast-Conformer Model.

The model is trained with joint Transducer and CTC decoder loss [5]. More about Hybrid Transducer-CTC training can be found here: Hybrid Transducer-CTC.

Training

The NeMo toolkit [3] was used to train the models for over several hundred epochs. The models were trained with this example script and this base config. The SentencePiece tokenizers [2] were built using the text transcripts of the train set with this script.

Datasets

All models in this collection are trained on a composite dataset (NeMo ASRSET) consisting of several thousand hours of English speech:

Librispeech: 960 hours of English speech
Fisher Corpus
Switchboard-1 Dataset
WSJ-0 and WSJ-1
National Speech Corpus (Part 1, Part 6)
VCTK
VoxPopuli (EN)
Europarl-ASR (EN)
Multilingual Librispeech (MLS EN) - 2,000 hours subset
Mozilla Common Voice (v7.0)
People's Speech - 12,000 hrs subset

Performance

The performance of the ASR models is reported in terms of Word Error Rate (WER%) with greedy decoding.

Transducer Decoder

att_context_sizes	LS test-other ([70,13]-1040ms)	LS test-other ([70,6]-480ms)	LS test-other ([70,1]-80ms)	LS test-other ([70,0]-0s)	Train Dataset
[[70,13],[70,6],[70,1],[70,0]]	5.4	5.7	6.4	7.0	NeMo ASRSET 3.0

CTC Decoder

att_context_sizes	LS test-other ([70,13]-1040ms)	LS test-other ([70,6]-480ms)	LS test-other ([70,1]-80ms)	LS test-other ([70,0]-0s)	Train Dataset
[[70,13],[70,6],[70,1],[70,0]]	6.2	6.7	7.8	8.4	NeMo ASRSET 3.0

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model provides transcribed speech as a string for a given audio sample.

🔧 Technical Details

These models are cache-aware versions of Hybrid FastConfomer, trained for streaming ASR. They are trained with multiple look-aheads to support different latencies. The model is trained in a multitask setup with joint Transducer and CTC decoder loss. More details can be found in the references:

📄 License

This model is licensed under cc-by-4.0.

Limitations

Since this model was trained on publicly available speech datasets, its performance might degrade for speech with technical terms or vernacular that the model has not been trained on. It might also perform worse for accented speech.

NVIDIA Riva: Deployment

NVIDIA Riva is an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, on edge, and embedded. It provides world-class accuracy, run-time word boosting, and customization options. Although this model isn't supported by Riva yet, the list of supported models is here. Check out Riva live demo.

References

[1] Conformer: Convolution-augmented Transformer for Speech Recognition [2] Google Sentencepiece Tokenizer [3] NVIDIA NeMo Toolkit [4] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition [5] Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご