Whisper-Hindi2Hinglish-Prime开源模型 - 噪声环境下印地语到印英混合语高精度转录

首页

Whisper Hindi2Hinglish Prime

由 Oriserve 开发

基于Whisper架构优化的印地语-印英混合语自动语音识别模型，支持噪声环境下的高精度转录

语音识别

Transformers

支持多种语言开源协议:Apache-2.0 #印英混合语音识别 #抗噪语音转写 #印度口音优化

下载量 1,812

发布时间 : 1/7/2025

模型简介

该模型是专为印度口音设计的语音识别系统，能够将印地语和印英混合语(Hinglish)的音频准确转录为文本，特别优化了噪声环境下的识别能力和幻听抑制功能

模型特点

印英混合语支持

新增将音频转录为口语化印英混合语的能力，减少语法错误

抗噪能力增强

针对印度本土高噪声环境优化，显著提升噪声场景下的识别准确率

幻听抑制

通过特殊训练策略大幅降低语音识别中的幻听现象

性能提升

相比原始Whisper模型，在基准测试集上平均性能提升约39%

印度口音适配

使用550小时印度口音数据进行微调，特别适应印度本土语音特征

模型能力

印地语语音识别

印英混合语转录

噪声环境语音处理

长音频分段处理

多说话人识别

使用案例

语音转录服务

客服通话记录转录

将印度地区客服中心的印英混合语通话转录为文本

在噪声环境下WER降低至32.43%

教育内容字幕生成

为印度本土教育视频自动生成字幕

支持印地语和印英混合语的双语字幕

语音助手

印度方言语音助手

支持印度用户使用印英混合语与语音助手交互

准确理解口语化表达

🚀 Whisper-Hindi2Hinglish-Prime

Whisper-Hindi2Hinglish-Prime 是一款基于 Whisper 架构的自动语音识别模型，能够将音频转录为印地英混合语（Hinglish），减少语法错误，同时具备更好的抗噪能力和更低的转录幻觉，在多个基准数据集上表现优于预训练模型。

🚀 快速开始

若要运行此模型，首先需安装 Transformers 库：

pip install -U transformers

以下是使用 pipeline 类转录任意长度音频的示例代码：

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

# Set device (GPU if available, otherwise CPU) and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Specify the pre-trained model ID
model_id = "Oriserve/Whisper-Hindi2Hinglish-Prime"

# Load the speech-to-text model with specified configurations
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype,        # Use appropriate precision (float16 for GPU, float32 for CPU)
    low_cpu_mem_usage=True,         # Optimize memory usage during loading
    use_safetensors=True            # Use safetensors format for better security
)
model.to(device)                    # Move model to specified device

# Load the processor for audio preprocessing and tokenization
processor = AutoProcessor.from_pretrained(model_id)

# Create speech recognition pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={
        "task": "transcribe",       # Set task to transcription
        "language": "en"            # Specify English language
    }
)

# Process audio file and print transcription
sample = "sample.wav"               # Input audio file path
result = pipe(sample)               # Run inference
print(result["text"])               # Print transcribed text

✨ 主要特性

支持印地英混合语：能够将音频转录为印地英混合语，减少语法错误。
基于 Whisper 架构：依托 Whisper 架构，便于与 transformers 包集成使用。
出色的抗噪能力：模型对噪声具有较强的抗性，不会对纯噪声音频进行转录。
减少转录幻觉：最大程度减少转录幻觉，提高转录准确性。
性能显著提升：在多个基准数据集上，与预训练模型相比，平均性能提升约 39%。

📦 安装指南

使用 Transformers

pip install -U transformers

使用 Flash Attention 2

若你的 GPU 支持 Flash Attention，可先安装 Flash Attention：

pip install flash-attn --no-build-isolation

使用 OpenAI Whisper 模块

pip install -U openai-whisper tqdm

💻 使用示例

基础用法

使用 Transformers 库运行模型：

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

# Set device (GPU if available, otherwise CPU) and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Specify the pre-trained model ID
model_id = "Oriserve/Whisper-Hindi2Hinglish-Prime"

# Load the speech-to-text model with specified configurations
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype,        # Use appropriate precision (float16 for GPU, float32 for CPU)
    low_cpu_mem_usage=True,         # Optimize memory usage during loading
    use_safetensors=True            # Use safetensors format for better security
)
model.to(device)                    # Move model to specified device

# Load the processor for audio preprocessing and tokenization
processor = AutoProcessor.from_pretrained(model_id)

# Create speech recognition pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={
        "task": "transcribe",       # Set task to transcription
        "language": "en"            # Specify English language
    }
)

# Process audio file and print transcription
sample = "sample.wav"               # Input audio file path
result = pipe(sample)               # Run inference
print(result["text"])               # Print transcribed text

高级用法

使用 Flash Attention 2 加速转录：

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2")

使用 OpenAI Whisper 模块：

import torch
from transformers import AutoModelForSpeechSeq2Seq
import re
from tqdm import tqdm
from collections import OrderedDict
import json

# Load parameter name mapping from HF to OpenAI format
with open('convert_hf2openai.json', 'r') as f:
    reverse_translation = json.load(f)

reverse_translation = OrderedDict(reverse_translation)

def save_model(model, save_path):
    def reverse_translate(current_param):
        # Convert parameter names using regex patterns
        for pattern, repl in reverse_translation.items():
            if re.match(pattern, current_param):
                return re.sub(pattern, repl, current_param)

    # Extract model dimensions from config
    config = model.config
    model_dims = {
        "n_mels": config.num_mel_bins,           # Number of mel spectrogram bins
        "n_vocab": config.vocab_size,            # Vocabulary size
        "n_audio_ctx": config.max_source_positions,    # Max audio context length
        "n_audio_state": config.d_model,         # Audio encoder state dimension
        "n_audio_head": config.encoder_attention_heads,  # Audio encoder attention heads
        "n_audio_layer": config.encoder_layers,   # Number of audio encoder layers
        "n_text_ctx": config.max_target_positions,     # Max text context length
        "n_text_state": config.d_model,          # Text decoder state dimension
        "n_text_head": config.decoder_attention_heads,  # Text decoder attention heads
        "n_text_layer": config.decoder_layers,    # Number of text decoder layers
    }

    # Convert model state dict to Whisper format
    original_model_state_dict = model.state_dict()
    new_state_dict = {}

    for key, value in tqdm(original_model_state_dict.items()):
        key = key.replace("model.", "")          # Remove 'model.' prefix
        new_key = reverse_translate(key)         # Convert parameter names
        if new_key is not None:
            new_state_dict[new_key] = value

    # Create final model dictionary
    pytorch_model = {"dims": model_dims, "model_state_dict": new_state_dict}

    # Save converted model
    torch.save(pytorch_model, save_path)

# Load Hugging Face model
model_id = "Oriserve/Whisper-Hindi2Hinglish-Prime"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    low_cpu_mem_usage=True,        # Optimize memory usage
    use_safetensors=True           # Use safetensors format
)

# Convert and save model
model_save_path = "Whisper-Hindi2Hinglish-Prime.pt"
save_model(model,model_save_path)

转录音频：

import whisper
# Load converted model with Whisper and transcribe
model = whisper.load_model("Whisper-Hindi2Hinglish-Prime.pt")
result = model.transcribe("sample.wav")
print(result["text"])

📚 详细文档

训练

数据

时长：使用了约 550 小时带有噪声的印度口音印地语数据对模型进行微调。
收集：由于缺乏适用于自动语音识别的印地英混合语数据集，使用了专门策划的专有数据集。
标注：使用最先进的模型对数据进行标注，并通过人工干预改进转录结果。
质量：鉴于模型的预期使用场景是在背景噪声丰富的印度环境中，因此重点收集了带有噪声的数据。
处理：确保所有音频被分割成长度小于 30 秒的片段，且每个片段中最多有 2 个说话者。未进行进一步处理，以免改变源数据的质量。

微调

新型训练器架构：编写了自定义训练器，以确保高效的有监督微调，并使用自定义回调函数，在训练过程中实现更高的可观测性。
自定义动态层冻结：通过使用预训练模型对部分训练数据进行推理，确定模型中最活跃的层。在训练过程中，保持这些层不冻结，而将其他所有层冻结，从而实现更快的收敛和高效的微调。
集成 DeepSpeed：还利用了 DeepSpeed 来加速和优化训练过程。

性能概述

定性性能概述

音频	Whisper Large V3	Whisper-Hindi2Hinglish-Prime
	maynata pura, canta maynata	Mehnat to poora karte hain.
	Where did they come from?	Haan vahi ek aapko bataaya na.
	A Pantral Logan.	Aap pandrah log hain.
	Thank you, Sanchez.	Kitne saal ki?
	Rangers, I can tell you.	Lander cycle chaahie.
	Uh-huh. They can't.	Haan haan, dekhe hain.

定量性能概述

⚠️ 重要提示

以下字错率（WER）分数是针对我们的模型和原始 Whisper 模型生成的印地英混合语文本。若要查看我们的模型在现实世界中与其他最先进模型的性能对比，请访问我们的语音转文本竞技场。

数据集	Whisper Large V3	Whisper-Hindi2Hinglish-Prime
Common-Voice	61.9432	32.4314
FLEURS	50.8425	28.6806
Indic-Voices	82.5621	60.8224

🔧 技术细节

本模型基于 Whisper 架构，使用 transformers 库进行开发。在训练过程中，采用了自定义训练器、动态层冻结和 DeepSpeed 等技术，以提高训练效率和模型性能。同时，为了支持印地英混合语的转录，对模型进行了专门的微调。

📄 许可证

本模型采用 Apache 2.0 许可证。

📋 杂项

本模型属于 Oriserve 训练的基于 Transformer 的自动语音识别模型家族。若要将此模型与同家族的其他模型或其他最先进的模型进行比较，请访问我们的语音转文本竞技场。若要了解更多关于我们的其他模型以及有关 AI 语音代理的其他问题，可通过电子邮件 ai-team@oriserve.com 联系我们。