đ NVIDIA Conformer-Transducer Large (fr)
This large-sized Conformer-Transducer model (about 120M parameters) was trained on a composite dataset of over 1500 hours of French speech, offering high - quality automatic speech recognition for French.
đ Quick Start
This model is available for use in the NeMo toolkit and can serve as a pre - trained checkpoint for inference or fine - tuning on another dataset.
đĻ Installation
To train, fine - tune, or play with the model, you need to install NVIDIA NeMo. We recommend installing it after installing the latest Pytorch version.
pip install nemo_toolkit['all']
đģ Usage Examples
Basic Usage
Automatically instantiate the model:
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained("nvidia/stt_fr_conformer_transducer_large")
Advanced Usage
Transcribing using Python
First, get a sample:
wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
Then simply do:
output = asr_model.transcribe(['2086-149220-0033.wav'])
print(output[0].text)
Transcribing many audio files
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
pretrained_name="nvidia/stt_fr_conformer_transducer_large"
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
đ Documentation
Model Information
Property |
Details |
Model Type |
Conformer - Transducer |
Language |
fr |
Params |
120M |
Training Data |
MozillaCommonVoice 7.0 (356 hours), Multilingual LibriSpeech (1036 hours), VoxPopuli (182 hours) |
This model accepts 16000 kHz Mono - channel Audio (wav files) as input and provides transcribed speech as a string for a given audio sample.
See the model architecture section and NeMo documentation for complete architecture details.
NVIDIA NeMo: Training
The NeMo toolkit [3] was used for training the models for over several hundred epochs. These models are trained with this example script and this base config.
The sentence - piece tokenizers [2] for these models were built using the text transcripts of the train set with this script.
đ§ Technical Details
Conformer - Transducer model is an autoregressive variant of Conformer model [1] for Automatic Speech Recognition which uses Transducer loss/decoding instead of CTC Loss. You may find more info on the detail of this model here: Conformer - Transducer Model.
đ Performance
The performance of Automatic Speech Recognition models is measured using Word Error Rate. Since this dataset is trained on multiple domains and a much larger corpus, it will generally perform better at transcribing audio in general.
The latest model obtains the following greedy scores on the following evaluation datasets:
- 6.85 % on MCV7.0 dev
- 7.95 % on MCV7.0 test
- 5.05 % on MLS dev
- 4.10 % on MLS test
Note that these evaluation datasets have been filtered and preprocessed to only contain French alphabet characters and are removed of punctuation outside of hyphenation and apostrophe.
â ī¸ Limitations
â ī¸ Important Note
Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
Further, since portions of the training set contain text from both pre - and post - 1990 orthographic reform, regularity of punctuation may vary between the two styles.
For downstream tasks requiring more consistency, finetuning or downstream processing may be required. If exact orthography is not necessary, then using secondary model is advised.
đ License
This model is licensed under cc - by - 4.0.
đ References