whisper-small-vi Open Source Speech Recognition Model - Free Deployment, Enhance Accuracy and Robustness of Vietnamese Transcription

Whisper Small Vi

Developed by namphungdn134

An automatic speech recognition model fine-tuned on Vietnamese speech data based on openai/whisper-small, improving Vietnamese transcription accuracy and robustness

Speech Recognition

Transformers

OtherOpen Source License:MIT #Vietnamese speech recognition #Low word error rate #Dialect optimization

Downloads 334

Release Time : 4/13/2025

Model Overview

An automatic speech recognition (ASR) model optimized for Vietnamese, suitable for speech-to-text tasks, with special optimization for Vietnamese accents and dialects

Model Features

Vietnamese optimization

Specially fine-tuned for Vietnamese speech characteristics, enhancing dialect and accent recognition capabilities

Lightweight model

Based on the Whisper small architecture, it reduces computational resource requirements while maintaining high accuracy

High-quality transcription

Achieves a word error rate (WER) of 9.3485 on test sets, demonstrating excellent performance

Model Capabilities

Vietnamese speech recognition

Audio-to-text conversion

Speech transcription

Use Cases

Speech transcription

Meeting minutes

Automatically transcribe Vietnamese meeting recordings into text records

Accuracy exceeds 90%

Media subtitle generation

Automatically generate subtitles for Vietnamese video content

Voice assistant

Vietnamese voice command recognition

Used for Vietnamese smart home or device control

🚀 Whisper Small Vi V1.1: Whisper Small for Vietnamese Fine-Tuned by Nam Phung 🚀

This is a fine-tuned version of the openai/whisper-small model on Vietnamese speech data. It aims to enhance transcription accuracy and robustness for Vietnamese automatic speech recognition (ASR) tasks, especially in real - world scenarios.

✨ Features

Language Specialization: Specifically fine - tuned for Vietnamese, improving performance on Vietnamese ASR tasks.
Fine - tuning Results: Achieved a Word Error Rate (WER) of 9.3485 on a diverse test set.

📦 Installation

To use the fine - tuned model, you need to install the required dependencies:

# Install required libraries
!pip install transformers torch librosa soundfile --quiet

# Import necessary libraries
import torch
import librosa
import soundfile as sf
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

print("Environment setup completed!")

💻 Usage Examples

Basic Usage

import torch
import librosa
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load processor and model
model_id = "namphungdn134/whisper-small-vi"
print(f"Loading model from: {model_id}")
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id).to(device)

# config language and task
forced_decoder_ids = processor.get_decoder_prompt_ids(language="vi", task="transcribe")
model.config.forced_decoder_ids = forced_decoder_ids
print(f"Forced decoder IDs for Vietnamese: {forced_decoder_ids}")

# Preprocess
audio_path = "example.wav"  
print(f"Loading audio from: {audio_path}")
audio, sr = librosa.load(audio_path, sr=16000)  
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to(device)
print(f"Input features shape: {input_features.shape}")

# Generate
print("Generating transcription...")
with torch.no_grad():
    predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print("📝 Transcription:", transcription)

# Debug: Print token to check
print("Predicted IDs:", predicted_ids[0].tolist())

📚 Documentation

Model Description

The Whisper small model is a transformer - small sequence - to - sequence model designed for automatic speech recognition and translation tasks. It has been trained on over 680,000 hours of labeled audio data in multiple languages. The fine - tuned version of this model focuses on the Vietnamese language, aiming to improve transcription accuracy and handling of local dialects. This model works with the WhisperProcessor to pre - process audio inputs into log - Mel spectrograms and decode them into text.

Dataset

Total Duration: More than 250 hours of high - quality Vietnamese speech data.
Sources: Public Vietnamese datasets.
Format: 16kHz WAV files with corresponding text transcripts.
Preprocessing: Audio was normalized and segmented. Transcripts were cleaned and tokenized.

Fine - tuning Results

Word Error Rate (WER): 9.3485

Evaluation was performed on a held - out test set with diverse regional accents and speaking styles.

🔧 Technical Details

Model Type: Transformer - small sequence - to - sequence model.
Base Model: openai/whisper - small
Training Data: Over 250 hours of high - quality Vietnamese speech data from public datasets. | Property | Details | |----------|---------| | Model Type | Transformer - small sequence - to - sequence model | | Training Data | Over 250 hours of high - quality Vietnamese speech data from public datasets |

⚠️ Limitations

⚠️ Important Note

This model is specifically fine - tuned for the Vietnamese language. It might not perform well on other languages.

Struggles with overlapping speech or noisy background.

Performance may drop with strong dialectal variations not well represented in training data.

📄 License

This model is licensed under the MIT License.

📚 Citation

If you use this model in your research or application, please cite the original Whisper model and this fine - tuning work as follows:

@article{Whisper2021,
  title={Whisper: A Multilingual Speech Recognition Model},
  author={OpenAI},
  year={2021},
  journal={arXiv:2202.12064},
  url={https://arxiv.org/abs/2202.12064}
}

@misc{title={Whisper small Vi V1.1 - Nam Phung},
  author={Nam Phùng},
  organization={DUT},
  year={2025},
  url={https://huggingface.co/namphungdn134/whisper-small-vi},
  url={https://github.com/namphung134/ASR-Vietnamese}
}

📬 Contact

For questions, collaborations, or suggestions, feel free to reach out via [namphungdn134@gmail.com].

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご