๐ Whisper Small Vi V1.1: Whisper Small for Vietnamese Fine-Tuned by Nam Phung ๐
This is a fine-tuned version of the openai/whisper-small model on Vietnamese speech data. It aims to enhance transcription accuracy and robustness for Vietnamese automatic speech recognition (ASR) tasks, especially in real - world scenarios.
โจ Features
- Language Specialization: Specifically fine - tuned for Vietnamese, improving performance on Vietnamese ASR tasks.
- Fine - tuning Results: Achieved a Word Error Rate (WER) of 9.3485 on a diverse test set.
๐ฆ Installation
To use the fine - tuned model, you need to install the required dependencies:
!pip install transformers torch librosa soundfile --quiet
import torch
import librosa
import soundfile as sf
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
print("Environment setup completed!")
๐ป Usage Examples
Basic Usage
import torch
import librosa
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
model_id = "namphungdn134/whisper-small-vi"
print(f"Loading model from: {model_id}")
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id).to(device)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="vi", task="transcribe")
model.config.forced_decoder_ids = forced_decoder_ids
print(f"Forced decoder IDs for Vietnamese: {forced_decoder_ids}")
audio_path = "example.wav"
print(f"Loading audio from: {audio_path}")
audio, sr = librosa.load(audio_path, sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to(device)
print(f"Input features shape: {input_features.shape}")
print("Generating transcription...")
with torch.no_grad():
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print("๐ Transcription:", transcription)
print("Predicted IDs:", predicted_ids[0].tolist())
๐ Documentation
Model Description
The Whisper small model is a transformer - small sequence - to - sequence model designed for automatic speech recognition and translation tasks. It has been trained on over 680,000 hours of labeled audio data in multiple languages. The fine - tuned version of this model focuses on the Vietnamese language, aiming to improve transcription accuracy and handling of local dialects. This model works with the WhisperProcessor to pre - process audio inputs into log - Mel spectrograms and decode them into text.
Dataset
- Total Duration: More than 250 hours of high - quality Vietnamese speech data.
- Sources: Public Vietnamese datasets.
- Format: 16kHz WAV files with corresponding text transcripts.
- Preprocessing: Audio was normalized and segmented. Transcripts were cleaned and tokenized.
Fine - tuning Results
- Word Error Rate (WER): 9.3485
Evaluation was performed on a held - out test set with diverse regional accents and speaking styles.
๐ง Technical Details
- Model Type: Transformer - small sequence - to - sequence model.
- Base Model: openai/whisper - small
- Training Data: Over 250 hours of high - quality Vietnamese speech data from public datasets.
| Property | Details |
|----------|---------|
| Model Type | Transformer - small sequence - to - sequence model |
| Training Data | Over 250 hours of high - quality Vietnamese speech data from public datasets |
โ ๏ธ Limitations
โ ๏ธ Important Note
- This model is specifically fine - tuned for the Vietnamese language. It might not perform well on other languages.
- Struggles with overlapping speech or noisy background.
- Performance may drop with strong dialectal variations not well represented in training data.
๐ License
This model is licensed under the MIT License.
๐ Citation
If you use this model in your research or application, please cite the original Whisper model and this fine - tuning work as follows:
@article{Whisper2021,
title={Whisper: A Multilingual Speech Recognition Model},
author={OpenAI},
year={2021},
journal={arXiv:2202.12064},
url={https://arxiv.org/abs/2202.12064}
}
@misc{title={Whisper small Vi V1.1 - Nam Phung},
author={Nam Phรนng},
organization={DUT},
year={2025},
url={https://huggingface.co/namphungdn134/whisper-small-vi},
url={https://github.com/namphung134/ASR-Vietnamese}
}
๐ฌ Contact
For questions, collaborations, or suggestions, feel free to reach out via [namphungdn134@gmail.com].