A localized Vietnamese-enhanced version of Whisper Large-v3 Turbo optimized with CTranslate2, supporting multilingual speech recognition with high speed and accuracy
This is an optimized speech-to-text model based on the Whisper Large-v3 Turbo architecture, specially enhanced for Vietnamese while supporting multiple languages. The model is optimized with CTranslate2, providing ultra-fast transcription capabilities.
Model Features
Ultra-fast transcription
Processes 30 seconds of audio in approximately 350ms, supporting real-time transcription
Multilingual support
Supports 11 languages, with special optimization for 8 Vietnamese regional accents
High accuracy
Achieves a word error rate (WER) of about 12% for major languages, capable of handling various accents
CTranslate2 optimization
Achieves 2.5x speedup through CTranslate2 library, suitable for low-latency applications
Model Capabilities
Speech-to-text
Multilingual recognition
Real-time transcription
Accent adaptation
Use Cases
Real-time transcription
Meeting minutes
Real-time transcription of meeting content
Near real-time text records
Interview records
Automatically transcribe interview audio
Fast and accurate interview records
Accessibility tools
Hearing assistance
Provides real-time captions for hearing-impaired individuals
Improved communication accessibility
Media production
Video subtitles
Automatically generate subtitles for videos
Fast and accurate subtitle generation
🚀 EraX-WoW-Turbo V1.1-CT2: Whisper Large-v3 Turbo with CTranslate2 for Vietnamese and then some, Supercharged and Localized!
EraX-WoW-Turbo V1.1-CT2 is a powerful speech recognition model. Built upon the Whisper Large-v3 Turbo, it offers real - time transcription, multilingual support, high accuracy, and is trained on a large dataset. It's open - source under the MIT License, providing a great solution for various speech - related applications.
🚀 Quick Start
To start using EraX-WoW-Turbo V1.1-CT2, you need to install the necessary packages and run the provided Python code.
from faster_whisper import WhisperModel
model_path = "erax-ai/EraX-WoW-Turbo-V1.1-CT2"# Convert audio into MONO & 16000 nếu cần thiếtfrom pydub import AudioSegment
defconvert16k(audio_path):
audio = AudioSegment.from_file(audio_path, format="wav")
audio = audio.split_to_mono()[0]
audio = audio.set_frame_rate(16000)
audio.export("test.wav", format="wav")
returnTrue# Run on GPU with FP16
fast_model = WhisperModel(model_path, device="cuda", compute_type="bfloat16", )
segments, info = fast_model.transcribe(test["path"], beam_size=5,
#word_timestamps=True,
language="vi",
temperature=0.0,
vad_filter=True,
#vad_parameters=dict(min_silence_duration_ms=2000),
)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
✨ Features
Blazing Fast
With the CTranslate2 library, EraX-WoW-Turbo can achieve real - time transcription. It can process 30 seconds of audio in about 350ms, much faster than the original Medium model.
Multilingual Maestro
The model is fine - tuned on a diverse dataset covering 11 key languages, including Vietnamese, English (US), Chinese (Mandarin), Cantonese, Indonesian, Korean, Japanese, Russian, German, French, and Dutch.
Accuracy You Can Trust
Preliminary tests show an impressive WER (Word Error Rate) around 12% across major languages, including challenging Vietnamese dialects.
Trained with Care
It was trained on a substantial dataset (600,000 samples, roughly 1000 hours), covering real - world audio conditions, so it can handle noise well.
Open Source (MIT License)
The model is open - source under the MIT License, allowing users to do whatever they want without restrictions.
Try it
You can try the model with the following audio sample:
"Chị Lan Anh ơi, em xin lỗi vì sự cố mất sóng vừa rồi. Em đã ghi nhận được hầu hết thông tin rồi ạ. Bây giờ em muốn hỏi chị là hiện tại xe của chị đang ở đâu ạ? Xe vẫn còn ở hiện trường hay đã được di chuyển đến gara hay nơi nào khác?"
📦 Installation
To use EraX-WoW-Turbo V1.1-CT2, you need to install the following packages:
The basic usage is shown in the code example above. You can install the necessary packages and run the Python code to perform speech recognition.
Advanced Usage
You can further optimize the performance by using the CTranslate2 library. It can potentially provide a 2.5x speedup, making it ideal for applications requiring the absolute lowest latency.
📚 Documentation
Use Cases
Real - time Transcription: Suitable for live captioning, meetings, interviews, etc.
Voice Assistants: Build responsive and accurate voice - controlled applications.
Media Subtitling: Generate subtitles for videos and podcasts quickly and accurately.
Accessibility Tools: Empower individuals with hearing impairments.
Language Learning: Practice pronunciation and receive instant feedback.
Multilingual Communication: Combine it with the upcoming EraX translator for a complete multilingual communication solution, such as for international conferences or travel apps.
Limitations
This model is trained on adult speech and might struggle with the high - pitched cries of infants or very quiet, hushed whispers.
Get Involved
Try it out: Download the model and test it.
Provide feedback: Let the developers know what works, what doesn't, and what features you'd like to see.
Contribute: If you're a developer, consider contributing to the project.
📄 License
This project follows the MIT license, just like Whisper.
📝 Citation
If you find our project useful, we would appreciate it if you could star our repository and cite our work as follows:
@article{title={EraX-WoW-Turbo-V1.1-CT2: Lắng nghe để Yêu thương.},
author={Nguyễn Anh Nguyên - Phạm Huỳnh Nhật - Cty Bảo hiểm AAA (504h)},
organization={EraX},
year={2025},
url={https://huggingface.co/erax-ai/EraX-WoW-Turbo-V1.1-CT2}
}