Whisper - Hindi2Hinglish - Swiftオープンソース音声認識モデル - インドのアクセントや騒音環境下の音声を高精度に認識

ホーム

Whisper Hindi2Hinglish Swift

Oriserveによって開発

Whisperアーキテクチャを基に最適化されたヒンディー語-ヒンディー英語混合音声認識モデル、インド訛りとノイズ環境向けに特別設計

音声認識

Transformers

複数言語対応オープンソースライセンス:Apache-2.0 #ヒンディー語と英語の混合認識 #ノイズ環境最適化 #インド訛り対応

ダウンロード数 496

リリース時間 : 1/7/2025

モデル概要

このモデルはWhisper-baseのファインチューニング版で、ヒンディー語音声を口語的なヒンディー英語混合テキストに変換することに特化しており、インド地域の音声認識シナリオに適しています

モデル特徴

ヒンディー英語混合言語サポート

音声を口語的なヒンディー英語混合テキストに変換する機能を追加、文法エラーの発生率を低減

ノイズ環境最適化

インドで一般的な背景ノイズ環境向けに特別最適化、騒がしい状況での認識精度向上

幻覚抑制

トレーニング技術により転写時の幻覚現象を最小化、出力テキストの正確性向上

動的レイヤーフリーズ技術

革新的なトレーニング技術により迅速な収束と効率的なファインチューニングを実現

モデル能力

ヒンディー語音声認識

ヒンディー英語混合テキスト生成

ノイズ環境下での音声転写

長音声処理

使用事例

音声文字起こしサービス

コールセンター通話記録

インド地域のカスタマーサポート通話内容を文字記録に変換

ノイズ環境下でも高い認識精度を維持

会議議事録

ヒンディー英語混合の会議議事録を自動生成

複数人対話シーンに対応

音声アシスタント

ローカライズ音声コマンド認識

インド地域ユーザー向けに精度の高い音声コマンド認識を提供

ヒンディー英語混合口語表現に対応

language:

en
hi tags:
audio
automatic-speech-recognition
whisper-event
pytorch inference: true model-index:
name: Whisper-Hindi2Hinglish-Swift results:
- task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: google/fleurs type: google/fleurs config: hi_in split: test metrics:
  - type: wer value: 35.0888 name: WER
- task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: mozilla-foundation/common_voice_20_0 type: mozilla-foundation/common_voice_20_0 config: hi split: test metrics:
  - type: wer value: 38.6549 name: WER
- task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Indic-Voices type: Indic-Voices config: hi split: test metrics:
  - type: wer value: 65.2147 name: WER widget:
src: audios/f89b6428-c58a-4355-ad63-0752b69f2d30.wav output: text: vah bas din mein kitni baar chalti hai?
src: audios/09cf2547-9d09-4914-926a-cf2043549c15.wav output: text: >- Salmaan ki image se prabhaavit hote hain is company ke share bhaav jaane kaise?
src: audios/6f7df89f-91a7-4cbd-be43-af7bce71a34b.wav output: text: vah roya aur aur roya.
src: audios/969bede5-d816-461b-9bf2-bd115e098439.wav output: text: helmet na pahnne se bhaarat mein har gante hoti hai chaar logon ki maut.
src: audios/cef43941-72c9-4d28-88dd-cb62808dc056.wav output: text: usne mujhe chithi ka javaab na dene ke lie daanta.
src: audios/b27d49fe-fced-4a17-9887-7bfbc5d4a899.wav output: text: puraana shahar divaaron se ghera hua hai.
src: audios/common_voice_hi_23796065.mp3 example_title: Speech Example 1
src: audios/common_voice_hi_41666099.mp3 example_title: Speech Example 2
src: audios/common_voice_hi_41429198.mp3 example_title: Speech Example 3
src: audios/common_voice_hi_41429259.mp3 example_title: Speech Example 4
src: audios/common_voice_hi_40904697.mp3 example_title: Speech Example 5 pipeline_tag: automatic-speech-recognition license: apache-2.0 metrics:
wer base_model:
openai/whisper-base library_name: transformers

Whisper-Hindi2Hinglish-Swift:

GITHUB LINK: github link
SPEECH-TO-TEXT ARENA: Speech-To-Text Arena

Key Features:

Hinglish as a language: Added ability to transcribe audio into spoken Hinglish language reducing chances of grammatical errors
Whisper Architecture: Based on the whisper architecture making it easy to use with the transformers package
Hallucination Mitigation: Minimizes transcription hallucinations to enhance accuracy.
Performance Increase: ~57% average performance increase versus pretrained model across benchmarking datasets

Training:

Data:

Duration: A total of ~550 Hrs of noisy Indian-accented Hindi data was used to finetune the model.
Collection: Due to a lack of ASR-ready hinglish datasets available, a specially curated proprietary dataset was used.
Labelling: This data was then labeled using a SOTA model and the transcriptions were improved by human intervention.
Quality: Emphasis was placed on collecting noisy data for the task as the intended use case of the model is in Indian environments where background noise is abundant.
Processing: It was ensured that the audios are all chunked into chunks of length <30s, and there are at max 2 speakers in a clip. No further processing steps were done to not change the quality of the source data.

Finetuning:

Novel Trainer Architecture: A custom trainer was written to ensure efficient supervised finetuning, with custom callbacks to enable higher observability during the training process.
Custom Dynamic Layer Freezing: Most active layers were identified in the model by running inference on a subset of the training data using the pre-trained models. These layers were then kept unfrozen during the training process while all the other layers were kept frozen. This enabled faster convergence and efficient finetuning
Deepspeed Integration: Deepspeed was also utilized to speed up, and optimize the training process.

Performance Overview

Qualitative Performance Overview

Audio	Whisper Base	Whisper-Hindi2Hinglish-Swift
	وہاں بس دن میں کتنی بار چلتی ہے	vah bas din mein kitni baar chalti hai?
	سلمان کی ایمیت سے پراوہویت ہوتے ہیں اس کمپنی کے سیر بھاؤ جانے کیسے	salmaan ki image se prabhaavit hote hain is company ke share bhaav jaane kaise?
	تو لویا تو لویا	vah roya aur aur roya.
	حلمت نہ پیننے سے بھارت میں ہر گنٹے ہوتی ہے چار لوگوں کی موت	helmet na pahnne se bhaarat mein har gante hoti hai chaar logon ki maut.
	اوستہ مجھے چٹھیکہ جواب نہ دینے کے لیٹانٹہ	usne mujhe chithi ka javaab na dene ke lie daanta.
	پرانا شاہ دیواروں سے گیرا ہوا ہے	puraana shahar divaaron se ghera hua hai.

Quantitative Performance Overview

Note:

The below WER scores are for Hinglish text generated by our model and the original whisper model
To check our model's real-world performance against other SOTA models please head to our Speech-To-Text Arena arena space.

Dataset	Whisper Base	Whisper-Hindi2Hinglish-Swift
Common-Voice	106.7936	38.6549
FLEURS	104.2783	35.0888
Indic-Voices	110.8399	65.2147

Usage:

Using Transformers

To run the model, first install the Transformers library

pip install --upgrade transformers

The model can be used with the pipeline class to transcribe audios of arbitrary length:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

# Set device (GPU if available, otherwise CPU) and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Specify the pre-trained model ID
model_id = "Oriserve/Whisper-Hindi2Hinglish-Swift"

# Load the speech-to-text model with specified configurations
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype,        # Use appropriate precision (float16 for GPU, float32 for CPU)
    low_cpu_mem_usage=True,         # Optimize memory usage during loading
    use_safetensors=True            # Use safetensors format for better security
)
model.to(device)                    # Move model to specified device

# Load the processor for audio preprocessing and tokenization
processor = AutoProcessor.from_pretrained(model_id)

# Create speech recognition pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={
        "task": "transcribe",       # Set task to transcription
        "language": "en"            # Specify English language
    }
)

# Process audio file and print transcription
sample = "sample.wav"               # Input audio file path
result = pipe(sample)               # Run inference
print(result["text"])               # Print transcribed text

Using the OpenAI Whisper module

First, install the openai-whisper library

pip install -U openai-whisper tqdm

Convert the huggingface checkpoint to a pytorch model

import torch
from transformers import AutoModelForSpeechSeq2Seq
import re
from tqdm import tqdm
from collections import OrderedDict
import json

# Load parameter name mapping from HF to OpenAI format
with open('convert_hf2openai.json', 'r') as f:
    reverse_translation = json.load(f)

reverse_translation = OrderedDict(reverse_translation)

def save_model(model, save_path):
    def reverse_translate(current_param):
        # Convert parameter names using regex patterns
        for pattern, repl in reverse_translation.items():
            if re.match(pattern, current_param):
                return re.sub(pattern, repl, current_param)

    # Extract model dimensions from config
    config = model.config
    model_dims = {
        "n_mels": config.num_mel_bins,           # Number of mel spectrogram bins
        "n_vocab": config.vocab_size,            # Vocabulary size
        "n_audio_ctx": config.max_source_positions,    # Max audio context length
        "n_audio_state": config.d_model,         # Audio encoder state dimension
        "n_audio_head": config.encoder_attention_heads,  # Audio encoder attention heads
        "n_audio_layer": config.encoder_layers,   # Number of audio encoder layers
        "n_text_ctx": config.max_target_positions,     # Max text context length
        "n_text_state": config.d_model,          # Text decoder state dimension
        "n_text_head": config.decoder_attention_heads,  # Text decoder attention heads
        "n_text_layer": config.decoder_layers,    # Number of text decoder layers
    }

    # Convert model state dict to Whisper format
    original_model_state_dict = model.state_dict()
    new_state_dict = {}

    for key, value in tqdm(original_model_state_dict.items()):
        key = key.replace("model.", "")          # Remove 'model.' prefix
        new_key = reverse_translate(key)         # Convert parameter names
        if new_key is not None:
            new_state_dict[new_key] = value

    # Create final model dictionary
    pytorch_model = {"dims": model_dims, "model_state_dict": new_state_dict}

    # Save converted model
    torch.save(pytorch_model, save_path)

# Load Hugging Face model
model_id = "Oriserve/Whisper-Hindi2Hinglish-Swift"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    low_cpu_mem_usage=True,        # Optimize memory usage
    use_safetensors=True           # Use safetensors format
)

# Convert and save model
model_save_path = "Whisper-Hindi2Hinglish-Swift.pt"
save_model(model,model_save_path)

Transcribe

import whisper
# Load converted model with Whisper and transcribe
model = whisper.load_model("Whisper-Hindi2Hinglish-Swift.pt")
result = model.transcribe("sample.wav")
print(result["text"])

Miscellaneous

This model is from a family of transformers-based ASR models trained by Oriserve. To compare this model against other models from the same family or other SOTA models please head to our Speech-To-Text Arena. To learn more about our other models, and other queries regarding AI voice agents you can reach out to us at our email ai-team@oriserve.com