Whisper-Hindi2Hinglish-Swift开源语音识别模型 - 精准识别印度口音及噪声环境下语音

首页

Whisper Hindi2Hinglish Swift

由 Oriserve 开发

基于Whisper架构优化的印地语-印地英语混合语音识别模型，专为印度口音和噪声环境优化

语音识别

Transformers

支持多种语言开源协议:Apache-2.0 #印地英语混合识别 #噪声环境优化 #印度口音适配

下载量 496

发布时间 : 1/7/2025

模型简介

该模型是Whisper-base的微调版本，专注于将印地语语音转录为口语化的印地英语混合文本，适用于印度地区的语音识别场景

模型特点

印地英语混合语言支持

新增将音频转录为口语化印地英语混合文本的能力，减少语法错误概率

噪声环境优化

针对印度常见背景噪声环境特别优化，提升嘈杂场景下的识别准确率

幻觉抑制

通过训练技术最小化转录幻觉，提升输出文本的准确性

动态层冻结技术

采用创新的训练技术实现快速收敛和高效微调

模型能力

印地语语音识别

印地英语混合文本生成

噪声环境下的语音转录

长音频处理

使用案例

语音转写服务

客服电话转录

将印度地区的客服通话内容转录为文字记录

在噪声环境下保持较高识别准确率

会议记录

自动生成印地英语混合的会议纪要

支持多人对话场景

语音助手

本地化语音指令识别

为印度地区用户提供更准确的语音指令识别

支持印地英语混合口语表达

🚀 Whisper-Hindi2Hinglish-Swift

Whisper-Hindi2Hinglish-Swift 是一款基于 Whisper 架构的自动语音识别模型，它能够将音频转录为 Hinglish 语言，减少语法错误的可能性，同时还能有效减少转录幻觉，提高识别准确性。

🚀 快速开始

本模型可通过 transformers 库或 openai-whisper 模块使用，具体使用方法请参考使用示例部分。

✨ 主要特性

支持 Hinglish 语言：能够将音频转录为 Hinglish 语言，减少语法错误的可能性。
基于 Whisper 架构：基于 Whisper 架构，便于与 transformers 包集成使用。
减少转录幻觉：最大程度减少转录幻觉，提高识别准确性。
性能提升：与预训练模型相比，在基准测试数据集上的平均性能提升约 57%。

📦 安装指南

使用 `transformers` 库

pip install --upgrade transformers

使用 `openai-whisper` 模块

pip install -U openai-whisper tqdm

💻 使用示例

基础用法

使用 transformers 库进行音频转录：

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

# Set device (GPU if available, otherwise CPU) and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Specify the pre-trained model ID
model_id = "Oriserve/Whisper-Hindi2Hinglish-Swift"

# Load the speech-to-text model with specified configurations
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype,        # Use appropriate precision (float16 for GPU, float32 for CPU)
    low_cpu_mem_usage=True,         # Optimize memory usage during loading
    use_safetensors=True            # Use safetensors format for better security
)
model.to(device)                    # Move model to specified device

# Load the processor for audio preprocessing and tokenization
processor = AutoProcessor.from_pretrained(model_id)

# Create speech recognition pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={
        "task": "transcribe",       # Set task to transcription
        "language": "en"            # Specify English language
    }
)

# Process audio file and print transcription
sample = "sample.wav"               # Input audio file path
result = pipe(sample)               # Run inference
print(result["text"])               # Print transcribed text

高级用法

使用 openai-whisper 模块进行音频转录：

import torch
from transformers import AutoModelForSpeechSeq2Seq
import re
from tqdm import tqdm
from collections import OrderedDict
import json

# Load parameter name mapping from HF to OpenAI format
with open('convert_hf2openai.json', 'r') as f:
    reverse_translation = json.load(f)

reverse_translation = OrderedDict(reverse_translation)

def save_model(model, save_path):
    def reverse_translate(current_param):
        # Convert parameter names using regex patterns
        for pattern, repl in reverse_translation.items():
            if re.match(pattern, current_param):
                return re.sub(pattern, repl, current_param)

    # Extract model dimensions from config
    config = model.config
    model_dims = {
        "n_mels": config.num_mel_bins,           # Number of mel spectrogram bins
        "n_vocab": config.vocab_size,            # Vocabulary size
        "n_audio_ctx": config.max_source_positions,    # Max audio context length
        "n_audio_state": config.d_model,         # Audio encoder state dimension
        "n_audio_head": config.encoder_attention_heads,  # Audio encoder attention heads
        "n_audio_layer": config.encoder_layers,   # Number of audio encoder layers
        "n_text_ctx": config.max_target_positions,     # Max text context length
        "n_text_state": config.d_model,          # Text decoder state dimension
        "n_text_head": config.decoder_attention_heads,  # Text decoder attention heads
        "n_text_layer": config.decoder_layers,    # Number of text decoder layers
    }

    # Convert model state dict to Whisper format
    original_model_state_dict = model.state_dict()
    new_state_dict = {}

    for key, value in tqdm(original_model_state_dict.items()):
        key = key.replace("model.", "")          # Remove 'model.' prefix
        new_key = reverse_translate(key)         # Convert parameter names
        if new_key is not None:
            new_state_dict[new_key] = value

    # Create final model dictionary
    pytorch_model = {"dims": model_dims, "model_state_dict": new_state_dict}

    # Save converted model
    torch.save(pytorch_model, save_path)

# Load Hugging Face model
model_id = "Oriserve/Whisper-Hindi2Hinglish-Swift"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    low_cpu_mem_usage=True,        # Optimize memory usage
    use_safetensors=True           # Use safetensors format
)

# Convert and save model
model_save_path = "Whisper-Hindi2Hinglish-Swift.pt"
save_model(model,model_save_path)

import whisper
# Load converted model with Whisper and transcribe
model = whisper.load_model("Whisper-Hindi2Hinglish-Swift.pt")
result = model.transcribe("sample.wav")
print(result["text"])

📚 详细文档

训练数据

时长：共使用了约 550 小时的带有印度口音的嘈杂印地语数据对模型进行微调。
数据收集：由于缺乏现成的适用于自动语音识别的 Hinglish 数据集，因此使用了专门策划的专有数据集。
数据标注：使用 SOTA 模型对数据进行标注，并通过人工干预改进转录结果。
数据质量：由于模型的预期使用场景是在背景噪音丰富的印度环境中，因此重点收集了嘈杂的数据。
数据处理：确保音频被分割成长度小于 30 秒的片段，并且每个片段中最多有 2 个说话者。为了不改变源数据的质量，未进行进一步的处理步骤。

微调过程

自定义训练器架构：编写了自定义训练器，以确保高效的有监督微调，并使用自定义回调函数在训练过程中提供更高的可观测性。
自定义动态层冻结：通过使用预训练模型对部分训练数据进行推理，确定模型中最活跃的层。在训练过程中，保持这些层不冻结，而其他层则保持冻结状态，从而实现更快的收敛和高效的微调。
集成 DeepSpeed：使用 DeepSpeed 加速和优化训练过程。

🔧 技术细节

本模型基于 Whisper 架构，使用了约 550 小时的带有印度口音的嘈杂印地语数据进行微调。在微调过程中，采用了自定义训练器架构、自定义动态层冻结和集成 DeepSpeed 等技术，以提高模型的性能和训练效率。

📄 许可证

本模型采用 Apache-2.0 许可证。

性能概述

定性性能概述

音频	Whisper Base	Whisper-Hindi2Hinglish-Swift
	وہاں بس دن میں کتنی بار چلتی ہے	vah bas din mein kitni baar chalti hai?
	سلمان کی ایمیت سے پراوہویت ہوتے ہیں اس کمپنی کے سیر بھاؤ جانے کیسے	salmaan ki image se prabhaavit hote hain is company ke share bhaav jaane kaise?
	تو لویا تو لویا	vah roya aur aur roya.
	حلمت نہ پیننے سے بھارت میں ہر گنٹے ہوتی ہے چار لوگوں کی موت	helmet na pahnne se bhaarat mein har gante hoti hai chaar logon ki maut.
	اوستہ مجھے چٹھیکہ جواب نہ دینے کے لیٹانٹہ	usne mujhe chithi ka javaab na dene ke lie daanta.
	پرانا شاہ دیواروں سے گیرا ہوا ہے	puraana shahar divaaron se ghera hua hai.

定量性能概述

⚠️ 重要提示

以下 WER 分数是本模型和原始 Whisper 模型生成的 Hinglish 文本的分数。若要查看本模型与其他 SOTA 模型在现实场景中的性能对比，请访问语音转文本竞技场。

数据集	Whisper Base	Whisper-Hindi2Hinglish-Swift
Common-Voice	106.7936	38.6549
FLEURS	104.2783	35.0888
Indic-Voices	110.8399	65.2147

其他信息

本模型是 Oriserve 训练的基于 Transformer 的自动语音识别模型家族的一员。若要将本模型与同家族的其他模型或其他 SOTA 模型进行比较，请访问语音转文本竞技场。如需了解更多关于我们其他模型的信息，或有关于 AI 语音代理的其他问题，请通过电子邮件 ai-team@oriserve.com 联系我们。