đ Monsoon-Whisper-Medium-Gigaspeech2
Monsoon-Whisper-Medium-Gigaspeech2 is a đšđ Thai Automatic Speech Recognition (ASR) model. It's built upon Whisper-Medium and fine - tuned on GigaSpeech2. Originally developed as a scale experiment for research on emergent capabilities in ASR tasks, it performs well in real - world scenarios, including with YouTube - sourced audio and in noisy environments. More details can be found in our Typhoon - Audio Release Blog.
đ Quick Start
Monsoon - Whisper - Medium - Gigaspeech2 is a Thai ASR model based on Whisper - Medium and fine - tuned on GigaSpeech2. It's suitable for various ASR tasks, especially in real - world and noisy environments.
⨠Features
- Based on the well - known Whisper - Medium architecture.
- Fine - tuned on GigaSpeech2 for better performance on Thai speech recognition.
- Performs well in real - world scenarios, including with YouTube audio and in noisy environments.
đĻ Installation
The model requires transformers
4.38.0 or newer. You can install it using pip
:
pip install transformers>=4.38.0
đģ Usage Examples
Basic Usage
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
import torch
model_path = "scb10x/monsoon-whisper-medium-gigaspeech2"
device = "cuda"
filepath = 'audio.wav'
processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(
model_path, torch_dtype=torch.bfloat16
)
model.to(device)
model.eval()
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(
language="th", task="transcribe"
)
array, sr = torchaudio.load(filepath)
input_features = (
processor(array, sampling_rate=sr, return_tensors="pt")
.to(device)
.to(torch.bfloat16)
.input_features
)
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)
đ Documentation
Model Description
Property |
Details |
Model Type |
Whisper Medium |
Requirement |
transformers 4.38.0 or newer |
Primary Language(s) |
Thai đšđ |
License |
Apache 2.0 |
Evaluation Results
Model |
WER (GS2) |
WER (CV17) |
CER (GS2) |
CER (CV17) |
whisper-large-v3 |
37.02 |
22.63 |
24.03 |
8.49 |
whisper-medium |
55.64 |
43.01 |
37.55 |
16.41 |
biodatlab-whisper-th-medium-combined |
31.00 |
14.25 |
21.20 |
5.69 |
biodatlab-whisper-th-large-v3-combined |
29.02 |
15.72 |
19.96 |
6.32 |
monsoon-whisper-medium-gigaspeech2 |
22.74 |
20.79 |
14.15 |
6.92 |
Intended Uses & Limitations
â ī¸ Important Note
This model is experimental and may not always be accurate. Developers should carefully assess potential risks in the context of their specific applications.
đ Follow us & Support
đĨ Typhoon Team
Kunat Pipatanakul, Potsawee Manakul, Sittipong Sripaisarnmongkol, Natapong Nitarach, Warit Sirichotedumrong, Adisai Na - Thalang, Phatrasek Jirabovonvisut, Parinthapat Pengpun, Krisanapong Jirayoot, Pathomporn Chokchainant, Kasima Tharnpipitchai
đ License
This model is licensed under the Apache 2.0 license.