🚀 Distilled Medium Whisper ASR Model for Thai
This is a distilled Automatic Speech Recognition (ASR) model based on the Whisper architecture, specifically designed for Thai language speech recognition. It features 4 decoder layers (compared to 24 in the teacher model) and is distilled from a larger teacher model to enhance performance and efficiency.
✨ Features
- Specifically tailored for Thai language speech recognition.
- Distilled from a larger teacher model to improve performance and efficiency.
- Has 4 decoder layers, reducing complexity compared to the teacher model.
📦 Installation
No installation steps were provided in the original document, so this section is skipped.
💻 Usage Examples
No code examples were provided in the original document, so this section is skipped.
📚 Documentation
Model Description
This is a distilled Automatic Speech Recognition (ASR) model, based on the Whisper architecture. It has been specifically tailored for Thai language speech recognition. The model features 4 decoder layers (vs 24 in teacher model) and has been distilled from a larger teacher model, focusing on enhancing performance and efficiency.
Distillation Details
Property |
Details |
Teacher Model |
Medium Whisper ASR model |
Datasets Used for Distillation |
|
Model Performance
- DeepCut Tokenized WER on Common Voice 13 Test Set:
- Distilled Model: 7.58%
- Teacher Model: 7.42%
Additional datasets for distillation or more decoder layers might improve the WER. More to come soon!
Intended Use
This model is intended for use in applications requiring Thai language speech recognition.
Limitations
- The model is specifically trained for the Thai language and may not perform well with other languages.
- Performance might vary across different Thai dialects and accents.
- As with any ASR system, background noise and speech clarity can impact recognition accuracy.
Acknowledgments
This model was developed using resources and datasets provided by the speech and language technology community. Special thanks to the teams behind Common Voice, Gowajee, SLSCU, and the Thai Elderly Speech Corpus for their valuable datasets.
Framework versions
Property |
Details |
Transformers |
4.35.2 |
Pytorch |
2.1.2 |
Datasets |
2.16.1 |
Tokenizers |
0.15.0 |
Citation
Cite using Bibtex:
@inproceedings{aung-etal-2024-thonburian,
title = "Thonburian Whisper: Robust Fine-tuned and Distilled Whisper for {T}hai",
author = "Aung, Zaw Htet and
Thavornmongkol, Thanachot and
Boribalburephan, Atirut and
Tangsriworakan, Vittavas and
Pipatsrisawat, Knot and
Achakulvisut, Titipat",
editor = "Abbas, Mourad and
Freihat, Abed Alhakim",
booktitle = "Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024)",
month = oct,
year = "2024",
address = "Trento",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.icnlsp-1.17",
pages = "149--156",
}
📄 License
The model is released under the MIT license.