whisper-large-v3-russian开源俄语语音识别模型

首页

Whisper Large V3 Russian

由 antony66 开发

基于OpenAI Whisper-large-v3微调的俄语语音识别模型，针对俄语识别性能进行了优化

语音识别

Transformers

其他#俄语语音识别 #电话录音优化 #低词错误率

下载量 6,665

发布时间 : 5/17/2024

模型简介

该模型是Whisper-large-v3的俄语优化版本，专门针对俄语语音识别任务进行了微调，显著提升了俄语识别的准确率

模型特点

俄语优化

专门针对俄语语音识别进行了微调，显著提升了俄语识别准确率

高性能

在Common Voice 17.0俄语数据集上，WER从9.84降至6.39

电话录音优化

特别针对电话通话场景进行了优化，建议预处理录音以获得最佳效果

模型能力

俄语语音识别

自动语音转文本

支持时间戳返回

使用案例

语音转写

电话录音转写

将俄语电话通话内容自动转写为文本

WER 6.39

语音内容分析

对俄语语音内容进行自动分析和处理

🚀 语音识别模型

本项目基于 openai/whisper-large-v3 模型进行微调，旨在更好地支持俄语语音识别。使用 Common Voice 17.0 数据集的俄语部分进行微调，该数据集包含超过 20 万条语音数据。

🚀 快速开始

本模型是 openai/whisper-large-v3 的微调版本，旨在更好地支持俄语。

用于微调的数据集是 Common Voice 17.0 的俄语部分，包含超过 20 万行数据。

在对原始数据集进行预处理（将所有分割数据混合，并按 0.95/0.05 的比例重新划分为新的训练集和测试集，即分别为 225761/11883 行）后，原始的 Whisper v3 模型的字错率（WER）为 9.84，而微调后的版本目前显示为 6.39。

微调过程在双 Tesla A100 80Gb 上花费了超过 60 小时。

✨ 主要特性

语言支持：针对俄语进行了微调，能更好地识别俄语语音。
性能提升：相比原始模型，字错率（WER）显著降低。

📦 安装指南

文档未提及安装步骤，故跳过此章节。

💻 使用示例

基础用法

在处理语音通话时，强烈建议在进行自动语音识别（ASR）之前对录音进行预处理并调整音量。例如，可以使用以下命令：

sox record.wav -r 16k record-normalized.wav norm -0.5 compand 0.3,1 -90,-90,-70,-70,-60,-20,0,0 -5 0 0.2

高级用法

以下是进行自动语音识别的 Python 代码示例：

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline

torch_dtype = torch.bfloat16 # set your preferred type here 

device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
    setattr(torch.distributed, "is_initialized", lambda : False) # monkey patching
device = torch.device(device)

whisper = WhisperForConditionalGeneration.from_pretrained(
    "antony66/whisper-large-v3-russian", torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True,
    # add attn_implementation="flash_attention_2" if your GPU supports it
)

processor = WhisperProcessor.from_pretrained("antony66/whisper-large-v3-russian")

asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model=whisper,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=256,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

# read your wav file into variable wav. For example:
from io import BufferIO
wav = BytesIO()
with open('record-normalized.wav', 'rb') as f:
    wav.write(f.read())
wav.seek(0)

# get the transcription
asr = asr_pipeline(wav, generate_kwargs={"language": "russian", "max_new_tokens": 256}, return_timestamps=False)

print(asr['text'])