Open-source model vi-whisper-large-v3-turbo-v1 - Achieve accurate recognition of Vietnamese speech for free

Vi Whisper Large V3 Turbo V1

Developed by suzii

Whisper-V3-Turbo model optimized for Vietnamese automatic speech recognition (ASR) tasks, fine-tuned using multiple Vietnamese datasets

Speech Recognition

Transformers

Other#Vietnamese speech recognition #Multi-dataset fine-tuning #Whisper-V3 architecture

Downloads 182

Release Time : 1/9/2025

Model Overview

Vietnamese automatic speech recognition model based on Whisper-V3-Turbo architecture, optimized for Vietnamese recognition performance through 240 hours of training

Model Features

Vietnamese optimization

Specially fine-tuned and optimized for Vietnamese speech recognition

Multi-dataset training

Trained on 10 different Vietnamese speech datasets

Efficient training

Completed 240 hours of training using a single Nvidia A6000 GPU

Model Capabilities

Vietnamese speech recognition

Speech-to-text

Timestamp generation

Use Cases

Speech transcription

Vietnamese meeting minutes

Convert Vietnamese meeting recordings into text transcripts

Audio content indexing

Create searchable text indexes for Vietnamese audio content

Assistive technology

Real-time caption generation

Generate real-time captions for Vietnamese video content

🚀 Fine-tuned Whisper-V3-Turbo for Vietnamese ASR

This project fine - tunes the Whisper - V3 - Turbo model to enhance its performance in Vietnamese Automatic Speech Recognition (ASR).

🚀 Quick Start

This project involves fine - tuning the Whisper - V3 - Turbo model to improve its performance for Automatic Speech Recognition (ASR) in the Vietnamese language. The model was trained for 240 hours using a single Nvidia A6000 GPU.

✨ Features

Multilingual Adaptation: The base model, Whisper, is a multilingual ASR model, and this project fine - tunes it specifically for Vietnamese.
Diverse Data Utilization: Utilizes a wide range of Vietnamese speech corpora for training, ensuring comprehensive language coverage.

📦 Installation

No specific installation steps are provided in the original README. So, this section is skipped.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "suzii/vi-whisper-large-v3-turbo-v1"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)
result = pipe("your-audio.mp3", return_timestamps=True)

Advanced Usage

No advanced usage code is provided in the original README. So, this part is skipped.

📚 Documentation

Data Sources

The training data comes from various Vietnamese speech corpora. Below is a list of datasets used for training:

capleaf/viVoice
NhutP/VSV - 1100
doof - ferb/fpt_fosd
doof - ferb/infore1_25hours
google/fleurs (vi_vn)
doof - ferb/LSVSC
quocanh34/viet_vlsp
linhtran92/viet_youtube_asr_corpus_v2
doof - ferb/infore2_audiobooks
linhtran92/viet_bud500

Model

The model used in this project is the Whisper - V3 - Turbo. Whisper is a multilingual ASR model trained on a large and diverse dataset. The version used here has been fine - tuned specifically for the Vietnamese language.

Training Configuration

GPU Used: Nvidia A6000
Training Time: 240 hours
Wandb report

Acknowledgements

This project would not be possible without the following datasets:

capleaf/viVoice
[NhutP/VSV - 1100](https://huggingface.co/datasets/nhutp/vsv - 1100)
[doof - ferb/fpt_fosd](https://huggingface.co/datasets/doof - ferb/fpt_fosd)
[doof - ferb/infore1_25hours](https://huggingface.co/datasets/doof - ferb/infore1_25hours)
google/fleurs
[doof - ferb/LSVSC](https://huggingface.co/datasets/doof - ferb/LSVSC)
[quocanh34/viet_vlsp](https://huggingface.co/datasets/quocanh34/viet - vlsp)
linhtran92/viet_youtube_asr_corpus_v2
[doof - ferb/infore2_audiobooks](https://huggingface.co/datasets/doof - ferb/infore2_audiobooks/)
linhtran92/viet_bud500

Information Table

Property	Details
Datasets	capleaf/viVoice, NhutP/VSV - 1100, doof - ferb/fpt_fosd, doof - ferb/infore1_25hours, google/fleurs (vi_vn), doof - ferb/LSVSC, quocanh34/viet_vlsp, linhtran92/viet_youtube_asr_corpus_v2, doof - ferb/infore2_audiobooks, linhtran92/viet_bud500
Language	vi
Metrics	wer
Base Model	openai/whisper - large - v3 - turbo
Library Name	transformers

🔧 Technical Details

No specific technical details are provided in the original README. So, this section is skipped.

📄 License

No license information is provided in the original README. So, this section is skipped.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご