The open-source model of stt_fa_fastconformer_hybrid_large - Empowering fast and accurate automatic speech recognition for Persian

Stt Fa Fastconformer Hybrid Large

Developed by nvidia

This is a hybrid model for Persian Automatic Speech Recognition (ASR), combining transducer and CTC decoder losses, optimized based on the FastConformer architecture.

Speech Recognition

PyTorch

Other#Persian Speech Recognition #Hybrid Transducer-CTC #High Accuracy Transcription

Downloads 2,398

Release Time : 11/21/2023

Model Overview

This model is used to transcribe Persian speech into text, being the 'large' version of the FastConformer transducer-CTC model with 115M parameters.

Model Features

Hybrid Training

Trained with both transducer and CTC decoder losses to enhance model robustness

Optimized Architecture

Based on the FastConformer architecture with 8x depthwise separable convolution downsampling

High Accuracy

Achieves excellent performance with 13.16% WER and 3.85% CER on Persian test sets

Model Capabilities

Persian Speech Recognition

Audio Transcription

Real-time Speech Processing

Use Cases

Speech-to-Text

Persian Speech Transcription

Convert Persian speech into text

Achieves 13.16% WER on the CommonVoice test set

Voice Assistants

Persian Voice Command Recognition

Used for developing Persian voice assistants

🚀 NVIDIA FastConformer-Hybrid Large (fa)

This model is designed for transcribing speech in the Persian alphabet. It's a "large" FastConformer Transducer - CTC model with around 115M parameters. It's a hybrid model trained on both Transducer (default) and CTC losses. For detailed architecture information, refer to the model architecture section and the NeMo documentation.

🚀 Quick Start

Installation

To train, fine-tune or use the model, you need to install NVIDIA NeMo. It's recommended to install it after the latest Pytorch version.

pip install nemo_toolkit['all']

Usage

The model can be used in the NeMo toolkit. It can serve as a pre - trained checkpoint for inference or fine - tuning on other datasets.

Automatically instantiate the model

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="nvidia/stt_fa_fastconformer_hybrid_large")

Transcribing using Python

After instantiating the model, you can transcribe audio as follows:

output = asr_model.transcribe(['sample.wav'])
print(output[0].text)

Transcribing many audio files

Using Transducer mode inference:

python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py 
 pretrained_name="nvidia/stt_fa_fastconformer_hybrid_large" 
 audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"

Using CTC mode inference:

python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py 
 pretrained_name="nvidia/stt_fa_fastconformer_hybrid_large" 
 audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
 decoder_type="ctc"

Input

This model accepts 16000 Hz Mono - channel Audio (wav files) as input.

Output

The model provides transcribed speech as a string for a given audio sample.

✨ Features

Multilingual Support: Specifically designed for Persian speech transcription.
Hybrid Model: Trained on both Transducer and CTC losses for better performance.
Large Model: With around 115M parameters, it can capture complex speech patterns.

📦 Installation

To install the necessary dependencies, run the following command:

pip install nemo_toolkit['all']

💻 Usage Examples

Basic Usage

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="nvidia/stt_fa_fastconformer_hybrid_large")
output = asr_model.transcribe(['sample.wav'])
print(output[0].text)

Advanced Usage

# Transcribing multiple audio files in CTC mode
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="nvidia/stt_fa_fastconformer_hybrid_large")
audio_files = ['file1.wav', 'file2.wav']
output = asr_model.transcribe(audio_files, decoder_type="ctc")
for result in output:
    print(result.text)

📚 Documentation

Model Architecture

FastConformer [1] is an optimized version of the Conformer model with 8x depthwise - separable convolutional downsampling. This is a hybrid model trained on two losses: Transducer (default) and CTC. For complete architecture details, see the model architecture section and NeMo documentation.

Training

The NeMo toolkit [3] was used for training the models over several hundred epochs. These models were trained with this example script and this base config.

The tokenizers for these models were built using the text transcripts of the train set with this script.

This model was initialized with the weights of English FastConformer Hybrid (Transducer and CTC) Large P&C model and fine - tuned to Persian data.

Datasets

This model was trained on Mozilla CommonVoice Persian Corpus 15.0.

The standard train/dev/test splits were discarded and replaced with custom splits to leverage the entire validated data portion. The custom splits can be reproduced as follows:

Group utterances with identical transcripts and sort them ascendingly by the (transcript occupancy, transcript) pairs.
Select the first 10540 utterances for the test set.
Select the second 10540 utterances for the dev set.
Select the remaining data for the training set.

The transcripts were additionally normalized according to the following script (empty results were discarded):

import unicodedata
import string

SKIP = set(
    list(string.ascii_letters)
    + [
        "=",  # occurs only 2x in utterance (transl.): "twenty = xx"
        "ā",  # occurs only 4x together with "š"
        "š",
        # Arabic letters
        "ة",  # TEH MARBUTA
    ]
)

DISCARD = [
    # "(laughter)" in Farsi
    "(خنده)",
    # ASCII
    "!",
    '"',
    "#",
    "&",
    "'",
    "(",
    ")",
    ",",
    "-",
    ".",
    ":",
    ";",
    # Unicode punctuation?
    "–",
    "“",
    "”",
    "…",
    "؟",
    "،",
    "؛",
    "ـ",
    # Unicode whitespace?
    "ً",
    "ٌ",
    "َ",
    "ُ",
    "ِ",
    "ّ",
    "ْ",
    "ٔ",
    # Other
    "«",
    "»",
]

REPLACEMENTS = {
    "أ": "ا",
    "ۀ": "ە",
    "ك": "ک",
    "ي": "ی",
    "ى": "ی",
    "ﯽ": "ی",
    "ﻮ": "و",
    "ے": "ی",
    "ﺒ": "ب",
    "ﻢ": "ﻡ",
    "٬": " ",
    "ە": "ه",
}


def maybe_normalize(text: str) -> str | None:

    # Skip selected with banned characters
    if set(text) & SKIP:
        return None  # skip this

    # Remove hashtags - they are not being read in Farsi CV
    text = " ".join(w for w in text.split() if not w.startswith("#"))

    # Replace selected characters with others
    for lhs, rhs in REPLACEMENTS.items():
        text = text.replace(lhs, rhs)

    # Replace selected characters with empty strings
    for tok in DISCARD:
        text = text.replace(tok, "")

    # Unify the symbols that have the same meaning but different Unicode representation.
    text = unicodedata.normalize("NFKC", text)

    # Remove hamza's that were not merged with any letter by NFKC.
    text = text.replace("ء", "")

    # Remove double whitespace etc.
    return " ".join(t for t in text.split() if t)

Performance

The performance of Automatic Speech Recognition models is measured using Character Error Rate (CER) and Word Error Rate (WER).

The model obtains the following scores on our custom dev and test splits of Mozilla CommonVoice Persian 15.0:

Model	%WER/CER dev	%WER/CER test
RNNT head	15.44 / 3.89	15.48 / 4.63
CTC head	13.18 / 3.38	13.16 / 3.85

Limitations

Since this model was trained on publicly available speech datasets, its performance might degrade for speech that includes technical terms or vernacular that the model has not been trained on. The model might also perform worse for accented speech.

NVIDIA Riva: Deployment

NVIDIA Riva is an accelerated speech AI SDK that can be deployed on - prem, in all clouds, multi - cloud, hybrid, on edge, and embedded.

Additionally, Riva provides:

World - class out - of - the - box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU - compute hours.
Best in class accuracy with run - time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization.
Streaming speech recognition, Kubernetes compatible scaling, and enterprise - grade support.

Although this model isn't supported yet by Riva, the list of supported models is here. Check out Riva live demo.

📄 License

The license to use this model is covered by the CC - BY - 4.0. By downloading the public and release version of the model, you accept the terms and conditions of the CC - BY - 4.0 license.

📚 References

[1] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition [2] Google Sentencepiece Tokenizer [3] NVIDIA NeMo Toolkit

📋 Model Information

Property	Details
Model Type	FastConformer - Transducer CTC
Training Data	Mozilla Common Voice 15.0 Persian
License	CC - BY - 4.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご