Whisper-Hindi2Hinglish-Prime開源模型 - 噪聲環境下印地語到印英混合語高精度轉錄

首頁

Whisper Hindi2Hinglish Prime

由Oriserve開發

基於Whisper架構優化的印地語-印英混合語自動語音識別模型，支持噪聲環境下的高精度轉錄

語音識別

Transformers

支持多種語言開源協議:Apache-2.0 #印英混合語音識別 #抗噪語音轉寫 #印度口音優化

下載量 1,812

發布時間 : 1/7/2025

模型概述

該模型是專為印度口音設計的語音識別系統，能夠將印地語和印英混合語(Hinglish)的音頻準確轉錄為文本，特別優化了噪聲環境下的識別能力和幻聽抑制功能

模型特點

印英混合語支持

新增將音頻轉錄為口語化印英混合語的能力，減少語法錯誤

抗噪能力增強

針對印度本土高噪聲環境優化，顯著提升噪聲場景下的識別準確率

幻聽抑制

通過特殊訓練策略大幅降低語音識別中的幻聽現象

性能提升

相比原始Whisper模型，在基準測試集上平均性能提升約39%

印度口音適配

使用550小時印度口音數據進行微調，特別適應印度本土語音特徵

模型能力

印地語語音識別

印英混合語轉錄

噪聲環境語音處理

長音頻分段處理

多說話人識別

使用案例

語音轉錄服務

客服通話記錄轉錄

將印度地區客服中心的印英混合語通話轉錄為文本

在噪聲環境下WER降低至32.43%

教育內容字幕生成

為印度本土教育視頻自動生成字幕

支持印地語和印英混合語的雙語字幕

語音助手

印度方言語音助手

支持印度用戶使用印英混合語與語音助手交互

準確理解口語化表達

🚀 Whisper-Hindi2Hinglish-Prime

Whisper-Hindi2Hinglish-Prime 是一款基於 Whisper 架構的自動語音識別模型，能夠將音頻轉錄為印地英混合語（Hinglish），減少語法錯誤，同時具備更好的抗噪能力和更低的轉錄幻覺，在多個基準數據集上表現優於預訓練模型。

🚀 快速開始

若要運行此模型，首先需安裝 Transformers 庫：

pip install -U transformers

以下是使用 pipeline 類轉錄任意長度音頻的示例代碼：

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

# Set device (GPU if available, otherwise CPU) and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Specify the pre-trained model ID
model_id = "Oriserve/Whisper-Hindi2Hinglish-Prime"

# Load the speech-to-text model with specified configurations
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype,        # Use appropriate precision (float16 for GPU, float32 for CPU)
    low_cpu_mem_usage=True,         # Optimize memory usage during loading
    use_safetensors=True            # Use safetensors format for better security
)
model.to(device)                    # Move model to specified device

# Load the processor for audio preprocessing and tokenization
processor = AutoProcessor.from_pretrained(model_id)

# Create speech recognition pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={
        "task": "transcribe",       # Set task to transcription
        "language": "en"            # Specify English language
    }
)

# Process audio file and print transcription
sample = "sample.wav"               # Input audio file path
result = pipe(sample)               # Run inference
print(result["text"])               # Print transcribed text

✨ 主要特性

支持印地英混合語：能夠將音頻轉錄為印地英混合語，減少語法錯誤。
基於 Whisper 架構：依託 Whisper 架構，便於與 transformers 包集成使用。
出色的抗噪能力：模型對噪聲具有較強的抗性，不會對純噪聲音頻進行轉錄。
減少轉錄幻覺：最大程度減少轉錄幻覺，提高轉錄準確性。
性能顯著提升：在多個基準數據集上，與預訓練模型相比，平均性能提升約 39%。

📦 安裝指南

使用 Transformers

pip install -U transformers

使用 Flash Attention 2

若你的 GPU 支持 Flash Attention，可先安裝 Flash Attention：

pip install flash-attn --no-build-isolation

使用 OpenAI Whisper 模塊

pip install -U openai-whisper tqdm

💻 使用示例

基礎用法

使用 Transformers 庫運行模型：

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

# Set device (GPU if available, otherwise CPU) and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Specify the pre-trained model ID
model_id = "Oriserve/Whisper-Hindi2Hinglish-Prime"

# Load the speech-to-text model with specified configurations
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype,        # Use appropriate precision (float16 for GPU, float32 for CPU)
    low_cpu_mem_usage=True,         # Optimize memory usage during loading
    use_safetensors=True            # Use safetensors format for better security
)
model.to(device)                    # Move model to specified device

# Load the processor for audio preprocessing and tokenization
processor = AutoProcessor.from_pretrained(model_id)

# Create speech recognition pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={
        "task": "transcribe",       # Set task to transcription
        "language": "en"            # Specify English language
    }
)

# Process audio file and print transcription
sample = "sample.wav"               # Input audio file path
result = pipe(sample)               # Run inference
print(result["text"])               # Print transcribed text

高級用法

使用 Flash Attention 2 加速轉錄：

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2")

使用 OpenAI Whisper 模塊：

import torch
from transformers import AutoModelForSpeechSeq2Seq
import re
from tqdm import tqdm
from collections import OrderedDict
import json

# Load parameter name mapping from HF to OpenAI format
with open('convert_hf2openai.json', 'r') as f:
    reverse_translation = json.load(f)

reverse_translation = OrderedDict(reverse_translation)

def save_model(model, save_path):
    def reverse_translate(current_param):
        # Convert parameter names using regex patterns
        for pattern, repl in reverse_translation.items():
            if re.match(pattern, current_param):
                return re.sub(pattern, repl, current_param)

    # Extract model dimensions from config
    config = model.config
    model_dims = {
        "n_mels": config.num_mel_bins,           # Number of mel spectrogram bins
        "n_vocab": config.vocab_size,            # Vocabulary size
        "n_audio_ctx": config.max_source_positions,    # Max audio context length
        "n_audio_state": config.d_model,         # Audio encoder state dimension
        "n_audio_head": config.encoder_attention_heads,  # Audio encoder attention heads
        "n_audio_layer": config.encoder_layers,   # Number of audio encoder layers
        "n_text_ctx": config.max_target_positions,     # Max text context length
        "n_text_state": config.d_model,          # Text decoder state dimension
        "n_text_head": config.decoder_attention_heads,  # Text decoder attention heads
        "n_text_layer": config.decoder_layers,    # Number of text decoder layers
    }

    # Convert model state dict to Whisper format
    original_model_state_dict = model.state_dict()
    new_state_dict = {}

    for key, value in tqdm(original_model_state_dict.items()):
        key = key.replace("model.", "")          # Remove 'model.' prefix
        new_key = reverse_translate(key)         # Convert parameter names
        if new_key is not None:
            new_state_dict[new_key] = value

    # Create final model dictionary
    pytorch_model = {"dims": model_dims, "model_state_dict": new_state_dict}

    # Save converted model
    torch.save(pytorch_model, save_path)

# Load Hugging Face model
model_id = "Oriserve/Whisper-Hindi2Hinglish-Prime"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    low_cpu_mem_usage=True,        # Optimize memory usage
    use_safetensors=True           # Use safetensors format
)

# Convert and save model
model_save_path = "Whisper-Hindi2Hinglish-Prime.pt"
save_model(model,model_save_path)

轉錄音頻：

import whisper
# Load converted model with Whisper and transcribe
model = whisper.load_model("Whisper-Hindi2Hinglish-Prime.pt")
result = model.transcribe("sample.wav")
print(result["text"])

📚 詳細文檔

訓練

數據

時長：使用了約 550 小時帶有噪聲的印度口音印地語數據對模型進行微調。
收集：由於缺乏適用於自動語音識別的印地英混合語數據集，使用了專門策劃的專有數據集。
標註：使用最先進的模型對數據進行標註，並通過人工干預改進轉錄結果。
質量：鑑於模型的預期使用場景是在背景噪聲豐富的印度環境中，因此重點收集了帶有噪聲的數據。
處理：確保所有音頻被分割成長度小於 30 秒的片段，且每個片段中最多有 2 個說話者。未進行進一步處理，以免改變源數據的質量。

微調

新型訓練器架構：編寫了自定義訓練器，以確保高效的有監督微調，並使用自定義回調函數，在訓練過程中實現更高的可觀測性。
自定義動態層凍結：通過使用預訓練模型對部分訓練數據進行推理，確定模型中最活躍的層。在訓練過程中，保持這些層不凍結，而將其他所有層凍結，從而實現更快的收斂和高效的微調。
集成 DeepSpeed：還利用了 DeepSpeed 來加速和優化訓練過程。

性能概述

定性性能概述

音頻	Whisper Large V3	Whisper-Hindi2Hinglish-Prime
	maynata pura, canta maynata	Mehnat to poora karte hain.
	Where did they come from?	Haan vahi ek aapko bataaya na.
	A Pantral Logan.	Aap pandrah log hain.
	Thank you, Sanchez.	Kitne saal ki?
	Rangers, I can tell you.	Lander cycle chaahie.
	Uh-huh. They can't.	Haan haan, dekhe hain.

定量性能概述

⚠️ 重要提示

以下字錯率（WER）分數是針對我們的模型和原始 Whisper 模型生成的印地英混合語文本。若要查看我們的模型在現實世界中與其他最先進模型的性能對比，請訪問我們的語音轉文本競技場。

數據集	Whisper Large V3	Whisper-Hindi2Hinglish-Prime
Common-Voice	61.9432	32.4314
FLEURS	50.8425	28.6806
Indic-Voices	82.5621	60.8224

🔧 技術細節

本模型基於 Whisper 架構，使用 transformers 庫進行開發。在訓練過程中，採用了自定義訓練器、動態層凍結和 DeepSpeed 等技術，以提高訓練效率和模型性能。同時，為了支持印地英混合語的轉錄，對模型進行了專門的微調。

📄 許可證

本模型採用 Apache 2.0 許可證。

📋 雜項

本模型屬於 Oriserve 訓練的基於 Transformer 的自動語音識別模型家族。若要將此模型與同家族的其他模型或其他最先進的模型進行比較，請訪問我們的語音轉文本競技場。若要了解更多關於我們的其他模型以及有關 AI 語音代理的其他問題，可通過電子郵件 ai-team@oriserve.com 聯繫我們。