Model Overview
Model Features
Model Capabilities
Use Cases
🚀 NVIDIA FastConformer-Hybrid Large (fa)
This model is designed for transcribing speech in the Persian alphabet. It's a "large" FastConformer Transducer - CTC model with around 115M parameters. It's a hybrid model trained on both Transducer (default) and CTC losses. For detailed architecture information, refer to the model architecture section and the NeMo documentation.
🚀 Quick Start
Installation
To train, fine-tune or use the model, you need to install NVIDIA NeMo. It's recommended to install it after the latest Pytorch version.
pip install nemo_toolkit['all']
Usage
The model can be used in the NeMo toolkit. It can serve as a pre - trained checkpoint for inference or fine - tuning on other datasets.
Automatically instantiate the model
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="nvidia/stt_fa_fastconformer_hybrid_large")
Transcribing using Python
After instantiating the model, you can transcribe audio as follows:
output = asr_model.transcribe(['sample.wav'])
print(output[0].text)
Transcribing many audio files
Using Transducer mode inference:
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
pretrained_name="nvidia/stt_fa_fastconformer_hybrid_large"
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
Using CTC mode inference:
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
pretrained_name="nvidia/stt_fa_fastconformer_hybrid_large"
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
decoder_type="ctc"
Input
This model accepts 16000 Hz Mono - channel Audio (wav files) as input.
Output
The model provides transcribed speech as a string for a given audio sample.
✨ Features
- Multilingual Support: Specifically designed for Persian speech transcription.
- Hybrid Model: Trained on both Transducer and CTC losses for better performance.
- Large Model: With around 115M parameters, it can capture complex speech patterns.
📦 Installation
To install the necessary dependencies, run the following command:
pip install nemo_toolkit['all']
💻 Usage Examples
Basic Usage
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="nvidia/stt_fa_fastconformer_hybrid_large")
output = asr_model.transcribe(['sample.wav'])
print(output[0].text)
Advanced Usage
# Transcribing multiple audio files in CTC mode
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="nvidia/stt_fa_fastconformer_hybrid_large")
audio_files = ['file1.wav', 'file2.wav']
output = asr_model.transcribe(audio_files, decoder_type="ctc")
for result in output:
print(result.text)
📚 Documentation
Model Architecture
FastConformer [1] is an optimized version of the Conformer model with 8x depthwise - separable convolutional downsampling. This is a hybrid model trained on two losses: Transducer (default) and CTC. For complete architecture details, see the model architecture section and NeMo documentation.
Training
The NeMo toolkit [3] was used for training the models over several hundred epochs. These models were trained with this example script and this base config.
The tokenizers for these models were built using the text transcripts of the train set with this script.
This model was initialized with the weights of English FastConformer Hybrid (Transducer and CTC) Large P&C model and fine - tuned to Persian data.
Datasets
This model was trained on Mozilla CommonVoice Persian Corpus 15.0.
The standard train/dev/test splits were discarded and replaced with custom splits to leverage the entire validated data portion. The custom splits can be reproduced as follows:
- Group utterances with identical transcripts and sort them ascendingly by the (transcript occupancy, transcript) pairs.
- Select the first 10540 utterances for the test set.
- Select the second 10540 utterances for the dev set.
- Select the remaining data for the training set.
The transcripts were additionally normalized according to the following script (empty results were discarded):
import unicodedata
import string
SKIP = set(
list(string.ascii_letters)
+ [
"=", # occurs only 2x in utterance (transl.): "twenty = xx"
"ā", # occurs only 4x together with "š"
"š",
# Arabic letters
"ة", # TEH MARBUTA
]
)
DISCARD = [
# "(laughter)" in Farsi
"(خنده)",
# ASCII
"!",
'"',
"#",
"&",
"'",
"(",
")",
",",
"-",
".",
":",
";",
# Unicode punctuation?
"–",
"“",
"”",
"…",
"؟",
"،",
"؛",
"ـ",
# Unicode whitespace?
"ً",
"ٌ",
"َ",
"ُ",
"ِ",
"ّ",
"ْ",
"ٔ",
# Other
"«",
"»",
]
REPLACEMENTS = {
"أ": "ا",
"ۀ": "ە",
"ك": "ک",
"ي": "ی",
"ى": "ی",
"ﯽ": "ی",
"ﻮ": "و",
"ے": "ی",
"ﺒ": "ب",
"ﻢ": "ﻡ",
"٬": " ",
"ە": "ه",
}
def maybe_normalize(text: str) -> str | None:
# Skip selected with banned characters
if set(text) & SKIP:
return None # skip this
# Remove hashtags - they are not being read in Farsi CV
text = " ".join(w for w in text.split() if not w.startswith("#"))
# Replace selected characters with others
for lhs, rhs in REPLACEMENTS.items():
text = text.replace(lhs, rhs)
# Replace selected characters with empty strings
for tok in DISCARD:
text = text.replace(tok, "")
# Unify the symbols that have the same meaning but different Unicode representation.
text = unicodedata.normalize("NFKC", text)
# Remove hamza's that were not merged with any letter by NFKC.
text = text.replace("ء", "")
# Remove double whitespace etc.
return " ".join(t for t in text.split() if t)
Performance
The performance of Automatic Speech Recognition models is measured using Character Error Rate (CER) and Word Error Rate (WER).
The model obtains the following scores on our custom dev and test splits of Mozilla CommonVoice Persian 15.0:
Model | %WER/CER dev | %WER/CER test |
---|---|---|
RNNT head | 15.44 / 3.89 | 15.48 / 4.63 |
CTC head | 13.18 / 3.38 | 13.16 / 3.85 |
Limitations
Since this model was trained on publicly available speech datasets, its performance might degrade for speech that includes technical terms or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
NVIDIA Riva: Deployment
NVIDIA Riva is an accelerated speech AI SDK that can be deployed on - prem, in all clouds, multi - cloud, hybrid, on edge, and embedded.
Additionally, Riva provides:
- World - class out - of - the - box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU - compute hours.
- Best in class accuracy with run - time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization.
- Streaming speech recognition, Kubernetes compatible scaling, and enterprise - grade support.
Although this model isn't supported yet by Riva, the list of supported models is here. Check out Riva live demo.
📄 License
The license to use this model is covered by the CC - BY - 4.0. By downloading the public and release version of the model, you accept the terms and conditions of the CC - BY - 4.0 license.
📚 References
[1] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition [2] Google Sentencepiece Tokenizer [3] NVIDIA NeMo Toolkit
📋 Model Information
Property | Details |
---|---|
Model Type | FastConformer - Transducer CTC |
Training Data | Mozilla Common Voice 15.0 Persian |
License | CC - BY - 4.0 |

