🚀 QuartzNet 15x5 CTC Bambara
stt-bm-quartznet15x5-V0
is a fine - tuned version of NVIDIA’s stt_fr_quartznet15x5
optimized for Bambara Automatic Speech Recognition (ASR). It uses a character encoding scheme and transcribes text in the standard character set from the training set.
|
| 
🚀 Quick Start
stt-bm-quartznet15x5-V0
is a fine - tuned version of NVIDIA’s stt_fr_quartznet15x5
optimized for Bambara ASR. This model cannot write Punctuations and Capitalizations. It utilizes a character encoding scheme and transcribes text in the standard character set provided in the training set of the bam - asr - all dataset. The model was fine - tuned using NVIDIA NeMo and is trained with CTC (Connectionist Temporal Classification) Loss.
⚠️ Important Note
⚠️ Important Note
This model, along with its associated resources, is part of an ongoing research effort. Improvements and refinements are expected in future versions. Users should be aware that:
- The model may not generalize very well across all speaking conditions and dialects.
- Community feedback is welcome, and contributions are encouraged to refine the model further.
✨ Features
- Optimized for Bambara Automatic Speech Recognition.
- Utilizes CTC Loss for training.
- Based on NVIDIA NeMo toolkit.
📦 Installation
To fine - tune or use the model, install NVIDIA NeMo. We recommend installing it after setting up the latest PyTorch version.
pip install nemo_toolkit['asr']
💻 Usage Examples
Basic Usage
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="RobotsMali/stt-bm-quartznet15x5")
Advanced Usage
asr_model.transcribe(['sample_audio.wav'])
Input
This model accepts 16 kHz mono - channel audio (wav files) as input.
Output
This model provides transcribed speech as a string for a given speech sample.
📚 Documentation
Model Architecture
QuartzNet is a convolutional architecture, which consists of 1D time - channel separable convolutions optimized for speech recognition. More information on QuartzNet can be found here: QuartzNet Model.
Training
The NeMo toolkit was used to fine - tune this model for 25939 steps over the stt_fr_quartznet15x5
model. This model is trained with this [base config](https://github.com/RobotsMali - AI/bambara - asr/blob/main/configs/quartznet - 20m - config - v2.yaml). The full training configurations, scripts, and experimental logs are available here:
🔗 [Bambara - ASR Experiments](https://github.com/RobotsMali - AI/bambara - asr)
Dataset
This model was fine - tuned on the [bam - asr - early](https://huggingface.co/datasets/RobotsMali/bam - asr - early) dataset, which consists of 37 hours of transcribed Bambara speech data. The dataset is primarily derived from Jeli - ASR dataset (~87%).
Performance
The performance of Automatic Speech Recognition models is measured using Word Error Rate (WER%).
Version |
Tokenizer |
Vocabulary Size |
bam - asr - all (test set) |
V2 |
Character - wise |
45 |
46.5 |
These are greedy WER numbers without external LM.
📄 License
This model is released under the CC - BY - 4.0 license. By using this model, you agree to the terms of the license.
More details are available in the Experimental Technical Report:
📄 [Draft Technical Report - Weights & Biases](https://wandb.ai/yacoudiarra - wl/bam - asr - nemo - training/reports/Draft - Technical - Report - V1--VmlldzoxMTIyOTMzOA).
Feel free to open a discussion on Hugging Face or [file an issue](https://github.com/RobotsMali - AI/bambara - asr/issues) on GitHub if you have any contributions.