distill-whisper-th-medium Open-source Thai Speech Recognition Model - Efficient and Accurate Thai Speech Recognition

Distill Whisper Th Medium

Developed by biodatlab

A distilled automatic speech recognition model based on the Whisper architecture, optimized for Thai language with balanced performance and efficiency

Speech Recognition

Transformers

Open Source License:MIT #Thai speech recognition #Distillation model #Dialect optimization

Downloads 303

Release Time : 1/16/2024

Model Overview

This is a distilled Whisper model specifically designed for Thai speech recognition. It is distilled from a large teacher model, improving efficiency while maintaining high recognition accuracy.

Model Features

Efficient Distillation Architecture

Adopts a 4-layer decoder structure (original teacher model has 24 layers), significantly improving efficiency while maintaining performance

Thai Language Optimization

Specially optimized and trained for the characteristics of Thai speech

Multi-source Training Data

Trained using multi-source data including Common Voice, Gowajee, and Thai elderly speech corpus

Dialect Support

Includes dialect data such as Central Thai, enhancing recognition capability for dialects

Model Capabilities

Thai speech recognition

Dialect recognition

Efficient speech-to-text

Use Cases

Speech Transcription

Thai Meeting Minutes

Real-time transcription of Thai meeting content into text

Voice Notes

Convert Thai voice notes into searchable text

Accessibility Applications

Hearing Assistance

Provide real-time captions for the hearing impaired

🚀 Distilled Medium Whisper ASR Model for Thai

This is a distilled Automatic Speech Recognition (ASR) model based on the Whisper architecture, specifically designed for Thai language speech recognition. It features 4 decoder layers (compared to 24 in the teacher model) and is distilled from a larger teacher model to enhance performance and efficiency.

✨ Features

Specifically tailored for Thai language speech recognition.
Distilled from a larger teacher model to improve performance and efficiency.
Has 4 decoder layers, reducing complexity compared to the teacher model.

📦 Installation

No installation steps were provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples were provided in the original document, so this section is skipped.

📚 Documentation

Model Description

This is a distilled Automatic Speech Recognition (ASR) model, based on the Whisper architecture. It has been specifically tailored for Thai language speech recognition. The model features 4 decoder layers (vs 24 in teacher model) and has been distilled from a larger teacher model, focusing on enhancing performance and efficiency.

Distillation Details

Property	Details
Teacher Model	Medium Whisper ASR model
Datasets Used for Distillation	Common Voice v13 Gowajee Thai Elderly Speech Corpus Custom Scraped Data Thai - Central Dialect from SLSCU Thai Dialect Corpus

Model Performance

DeepCut Tokenized WER on Common Voice 13 Test Set:
- Distilled Model: 7.58%
- Teacher Model: 7.42%

Additional datasets for distillation or more decoder layers might improve the WER. More to come soon!

Intended Use

This model is intended for use in applications requiring Thai language speech recognition.

Limitations

The model is specifically trained for the Thai language and may not perform well with other languages.
Performance might vary across different Thai dialects and accents.
As with any ASR system, background noise and speech clarity can impact recognition accuracy.

Acknowledgments

This model was developed using resources and datasets provided by the speech and language technology community. Special thanks to the teams behind Common Voice, Gowajee, SLSCU, and the Thai Elderly Speech Corpus for their valuable datasets.

Framework versions

Property	Details
Transformers	4.35.2
Pytorch	2.1.2
Datasets	2.16.1
Tokenizers	0.15.0

Citation

Cite using Bibtex:

@inproceedings{aung-etal-2024-thonburian,
    title = "Thonburian Whisper: Robust Fine-tuned and Distilled Whisper for {T}hai",
    author = "Aung, Zaw Htet  and
      Thavornmongkol, Thanachot  and
      Boribalburephan, Atirut  and
      Tangsriworakan, Vittavas  and
      Pipatsrisawat, Knot  and
      Achakulvisut, Titipat",
    editor = "Abbas, Mourad  and
      Freihat, Abed Alhakim",
    booktitle = "Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024)",
    month = oct,
    year = "2024",
    address = "Trento",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.icnlsp-1.17",
    pages = "149--156",
}

📄 License

The model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご