Whisper-Hindi2Hinglish-Primeオープンソースモデル - ノイズ環境下でのヒンディー語からヒングリッシュへの高精度転写

ホーム

Whisper Hindi2Hinglish Prime

Oriserveによって開発

Whisperアーキテクチャを最適化したヒンディー語-ヒングリッシュ混合音声自動認識モデルで、ノイズ環境下での高精度転写をサポート

音声認識

Transformers

複数言語対応オープンソースライセンス:Apache-2.0 #ヒンディー語と英語混合音声認識 #ノイズ耐性音声書き起こし #インドアクセント最適化

ダウンロード数 1,812

リリース時間 : 1/7/2025

モデル概要

このモデルはインドアクセントに特化した音声認識システムで、ヒンディー語とヒングリッシュ混合音声を正確にテキスト化でき、特にノイズ環境下での認識能力と幻聴抑制機能を最適化

モデル特徴

ヒングリッシュ対応

音声を口語的なヒングリッシュに転写する能力を追加し、文法エラーを削減

ノイズ耐性強化

インドの高ノイズ環境向けに最適化し、ノイズ下での認識精度を大幅向上

幻聴抑制

特別なトレーニング戦略により音声認識中の幻聴現象を大幅低減

性能向上

オリジナルWhisperモデルと比較し、ベンチマークテストで平均39%性能向上

インドアクセント適応

550時間のインドアクセントデータで微調整し、インド現地の音声特性に特別対応

モデル能力

ヒンディー語音声認識

ヒングリッシュ転写

ノイズ環境音声処理

長時間音声分割処理

複数話者識別

使用事例

音声転写サービス

コールセンター通話記録転写

インド地域のコールセンターにおけるヒングリッシュ通話をテキスト化

ノイズ環境下でWERを32.43%に低減

教育コンテンツ字幕生成

インド現地の教育動画に自動字幕を生成

ヒンディー語とヒングリッシュのバイリンガル字幕をサポート

音声アシスタント

インド方言音声アシスタント

インドユーザーがヒングリッシュで音声アシスタントと対話可能

口語表現を正確に理解

language:

en
hi tags:
audio
automatic-speech-recognition
whisper-event
pytorch
hinglish inference: true model-index:
name: Whisper-Hindi2Hinglish-Prime results:
- task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: google/fleurs type: google/fleurs config: hi_in split: test metrics:
  - type: wer value: 28.6806 name: WER
- task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: mozilla-foundation/common_voice_20_0 type: mozilla-foundation/common_voice_20_0 config: hi split: test metrics:
  - type: wer value: 32.4314 name: WER
- task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Indic-Voices type: Indic-Voices config: hi split: test metrics:
  - type: wer value: 60.8224 name: WER widget:
src: audios/c0637211-7384-4abc-af69-5aacf7549824_1_2629072_2656224.wav output: text: Mehnat to poora karte hain.
src: audios/c0faba11-27ba-4837-a2eb-ccd67be07f40_1_3185088_3227568.wav output: text: Haan vahi ek aapko bataaya na.
src: audios/663eb653-d6b5-4fda-b5f2-9ef98adc0a61_0_1098400_1118688.wav output: text: Aap pandrah log hain.
src: audios/f5e0178c-354c-40c9-b3a7-687c86240a77_1_2613728_2630112.wav output: text: Kitne saal ki?
src: audios/f5e0178c-354c-40c9-b3a7-687c86240a77_1_1152496_1175488.wav output: text: Lander cycle chaahie.
src: audios/c0637211-7384-4abc-af69-5aacf7549824_1_2417088_2444224.wav output: text: Haan haan, dekhe hain.
src: audios/common_voice_hi_23796065.mp3 example_title: Speech Example 1
src: audios/common_voice_hi_41666099.mp3 example_title: Speech Example 2
src: audios/common_voice_hi_41429198.mp3 example_title: Speech Example 3
src: audios/common_voice_hi_41429259.mp3 example_title: Speech Example 4
src: audios/common_voice_hi_40904697.mp3 example_title: Speech Example 5 pipeline_tag: automatic-speech-recognition license: apache-2.0 metrics:
wer base_model:
openai/whisper-large-v3 library_name: transformers

Whisper-Hindi2Hinglish-Prime:

GITHUB LINK: github link
SPEECH-TO-TEXT ARENA: Speech-To-Text Arena

Key Features:

Hinglish as a language: Added ability to transcribe audio into spoken Hinglish language reducing chances of grammatical errors
Whisper Architecture: Based on the whisper architecture making it easy to use with the transformers package
Better Noise handling: The model is resistant to noise and thus does not return transcriptions for audios with just noise
Hallucination Mitigation: Minimizes transcription hallucinations to enhance accuracy.
Performance Increase: ~39% average performance increase versus pretrained model across benchmarking datasets

Training:

Data:

Duration: A total of ~550 Hrs of noisy Indian-accented Hindi data was used to finetune the model.
Collection: Due to a lack of ASR-ready hinglish datasets available, a specially curated proprietary dataset was used.
Labelling: This data was then labeled using a SOTA model and the transcriptions were improved by human intervention.
Quality: Emphasis was placed on collecting noisy data for the task as the intended use case of the model is in Indian environments where background noise is abundant.
Processing: It was ensured that the audios are all chunked into chunks of length <30s, and there are at max 2 speakers in a clip. No further processing steps were done so as to not change the quality of the source data.

Finetuning:

Novel Trainer Architecture: A custom trainer was written to ensure efficient supervised finetuning, with custom callbacks to enable higher observability during the training process.
Custom Dynamic Layer Freezing: Most active layers were identified in the model by running inference on a subset of the training data using the pre-trained models. These layers were then kept unfrozen during the training process while all the other layers were kept frozen. This enabled faster convergence and efficient finetuning
Deepspeed Integration: Deepspeed was also utilized to speed up, and optimize the training process.

Performance Overview

Qualitative Performance Overview

Audio	Whisper Large V3	Whisper-Hindi2Hinglish-Prime
	maynata pura, canta maynata	Mehnat to poora karte hain.
	Where did they come from?	Haan vahi ek aapko bataaya na.
	A Pantral Logan.	Aap pandrah log hain.
	Thank you, Sanchez.	Kitne saal ki?
	Rangers, I can tell you.	Lander cycle chaahie.
	Uh-huh. They can't.	Haan haan, dekhe hain.

Quantitative Performance Overview

Note:

The below WER scores are for Hinglish text generated by our model and the original whisper model
To check our model's real-world performance against other SOTA models please head to our Speech-To-Text Arena arena space.

Dataset	Whisper Large V3	Whisper-Hindi2Hinglish-Prime
Common-Voice	61.9432	32.4314
FLEURS	50.8425	28.6806
Indic-Voices	82.5621	60.8224

Usage:

Using Transformers

To run the model, first install the Transformers library

pip install -U transformers

The model can be used with the pipeline class to transcribe audios of arbitrary length:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

# Set device (GPU if available, otherwise CPU) and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Specify the pre-trained model ID
model_id = "Oriserve/Whisper-Hindi2Hinglish-Prime"

# Load the speech-to-text model with specified configurations
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype,        # Use appropriate precision (float16 for GPU, float32 for CPU)
    low_cpu_mem_usage=True,         # Optimize memory usage during loading
    use_safetensors=True            # Use safetensors format for better security
)
model.to(device)                    # Move model to specified device

# Load the processor for audio preprocessing and tokenization
processor = AutoProcessor.from_pretrained(model_id)

# Create speech recognition pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={
        "task": "transcribe",       # Set task to transcription
        "language": "en"            # Specify English language
    }
)

# Process audio file and print transcription
sample = "sample.wav"               # Input audio file path
result = pipe(sample)               # Run inference
print(result["text"])               # Print transcribed text

Using Flash Attention 2

Flash-Attention 2 can be used to make the transcription fast. If your GPU supports Flash-Attention you can use it by, first installing Flash Attention:

pip install flash-attn --no-build-isolation

Once installed you can then load the model using the below code:

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2")

Using the OpenAI Whisper module

First, install the openai-whisper library

pip install -U openai-whisper tqdm

Convert the huggingface checkpoint to a pytorch model

import torch
from transformers import AutoModelForSpeechSeq2Seq
import re
from tqdm import tqdm
from collections import OrderedDict
import json

# Load parameter name mapping from HF to OpenAI format
with open('convert_hf2openai.json', 'r') as f:
    reverse_translation = json.load(f)

reverse_translation = OrderedDict(reverse_translation)

def save_model(model, save_path):
    def reverse_translate(current_param):
        # Convert parameter names using regex patterns
        for pattern, repl in reverse_translation.items():
            if re.match(pattern, current_param):
                return re.sub(pattern, repl, current_param)

    # Extract model dimensions from config
    config = model.config
    model_dims = {
        "n_mels": config.num_mel_bins,           # Number of mel spectrogram bins
        "n_vocab": config.vocab_size,            # Vocabulary size
        "n_audio_ctx": config.max_source_positions,    # Max audio context length
        "n_audio_state": config.d_model,         # Audio encoder state dimension
        "n_audio_head": config.encoder_attention_heads,  # Audio encoder attention heads
        "n_audio_layer": config.encoder_layers,   # Number of audio encoder layers
        "n_text_ctx": config.max_target_positions,     # Max text context length
        "n_text_state": config.d_model,          # Text decoder state dimension
        "n_text_head": config.decoder_attention_heads,  # Text decoder attention heads
        "n_text_layer": config.decoder_layers,    # Number of text decoder layers
    }

    # Convert model state dict to Whisper format
    original_model_state_dict = model.state_dict()
    new_state_dict = {}

    for key, value in tqdm(original_model_state_dict.items()):
        key = key.replace("model.", "")          # Remove 'model.' prefix
        new_key = reverse_translate(key)         # Convert parameter names
        if new_key is not None:
            new_state_dict[new_key] = value

    # Create final model dictionary
    pytorch_model = {"dims": model_dims, "model_state_dict": new_state_dict}

    # Save converted model
    torch.save(pytorch_model, save_path)

# Load Hugging Face model
model_id = "Oriserve/Whisper-Hindi2Hinglish-Prime"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    low_cpu_mem_usage=True,        # Optimize memory usage
    use_safetensors=True           # Use safetensors format
)

# Convert and save model
model_save_path = "Whisper-Hindi2Hinglish-Prime.pt"
save_model(model,model_save_path)

Transcribe

import whisper
# Load converted model with Whisper and transcribe
model = whisper.load_model("Whisper-Hindi2Hinglish-Prime.pt")
result = model.transcribe("sample.wav")
print(result["text"])

Miscellaneous

This model is from a family of transformers-based ASR models trained by Oriserve. To compare this model against other models from the same family or other SOTA models please head to our Speech-To-Text Arena. To learn more about our other models, and other queries regarding AI voice agents you can reach out to us at our email ai-team@oriserve.com