Monsoon-Whisper-Medium-GigaSpeech2 Open-Source Thai Speech Recognition Model

Monsoon Whisper Medium Gigaspeech2

Developed by scb10x

Monsoon-Whisper-Medium-GigaSpeech2 is a Thai automatic speech recognition (ASR) model, based on Whisper-Medium and fine-tuned on the GigaSpeech2 dataset, suitable for speech recognition in real-world scenarios.

Speech Recognition

Transformers

Open Source License:Apache-2.0 #Thai speech recognition #Low word error rate #Noisy environment adaptation

Downloads 546

Release Time : 7/12/2024

Model Overview

This model focuses on Thai automatic speech recognition tasks and performs excellently in YouTube audio and noisy environment speech recognition.

Model Features

Thai speech recognition

Focuses on Thai speech recognition tasks and performs excellently in real-world scenarios.

Fine-tuned based on Whisper-Medium

Based on the Whisper-Medium architecture and fine-tuned on the GigaSpeech2 dataset.

High performance

Outperforms similar models in WER and CER metrics.

Model Capabilities

Thai speech recognition

Speech recognition in noisy environments

Use Cases

Speech recognition

YouTube audio transcription

Suitable for transcribing Thai speech content in YouTube videos.

Speech recognition in noisy environments

Maintains high recognition accuracy even in noisy environments.

🚀 Monsoon-Whisper-Medium-Gigaspeech2

Monsoon-Whisper-Medium-Gigaspeech2 is a 🇹🇭 Thai Automatic Speech Recognition (ASR) model. It's built upon Whisper-Medium and fine - tuned on GigaSpeech2. Originally developed as a scale experiment for research on emergent capabilities in ASR tasks, it performs well in real - world scenarios, including with YouTube - sourced audio and in noisy environments. More details can be found in our Typhoon - Audio Release Blog.

🚀 Quick Start

Monsoon - Whisper - Medium - Gigaspeech2 is a Thai ASR model based on Whisper - Medium and fine - tuned on GigaSpeech2. It's suitable for various ASR tasks, especially in real - world and noisy environments.

✨ Features

Based on the well - known Whisper - Medium architecture.
Fine - tuned on GigaSpeech2 for better performance on Thai speech recognition.
Performs well in real - world scenarios, including with YouTube audio and in noisy environments.

📦 Installation

The model requires transformers 4.38.0 or newer. You can install it using pip:

pip install transformers>=4.38.0

💻 Usage Examples

Basic Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
import torch

model_path = "scb10x/monsoon-whisper-medium-gigaspeech2"
device = "cuda"
filepath = 'audio.wav'

processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(
    model_path, torch_dtype=torch.bfloat16
)
model.to(device)
model.eval()

model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(
    language="th", task="transcribe"
)
array, sr = torchaudio.load(filepath)
input_features = (
    processor(array, sampling_rate=sr, return_tensors="pt")
    .to(device)
    .to(torch.bfloat16)
    .input_features
)
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)

📚 Documentation

Model Description

Property	Details
Model Type	Whisper Medium
Requirement	transformers 4.38.0 or newer
Primary Language(s)	Thai 🇹🇭
License	Apache 2.0

Evaluation Results

Model	WER (GS2)	WER (CV17)	CER (GS2)	CER (CV17)
whisper-large-v3	37.02	22.63	24.03	8.49
whisper-medium	55.64	43.01	37.55	16.41
biodatlab-whisper-th-medium-combined	31.00	14.25	21.20	5.69
biodatlab-whisper-th-large-v3-combined	29.02	15.72	19.96	6.32
monsoon-whisper-medium-gigaspeech2	22.74	20.79	14.15	6.92

Intended Uses & Limitations

⚠️ Important Note

This model is experimental and may not always be accurate. Developers should carefully assess potential risks in the context of their specific applications.

🔗 Follow us & Support

👥 Typhoon Team

Kunat Pipatanakul, Potsawee Manakul, Sittipong Sripaisarnmongkol, Natapong Nitarach, Warit Sirichotedumrong, Adisai Na - Thalang, Phatrasek Jirabovonvisut, Parinthapat Pengpun, Krisanapong Jirayoot, Pathomporn Chokchainant, Kasima Tharnpipitchai

📄 License

This model is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご