开源wav2vec2俄语语音识别模型 - 精准识别俄语语音，免费使用别错过！

首页

Wav2vec2 Large 100k Voxpopuli Ft Common Voice Plus TTS Dataset Plus Data Augmentation Russian

由 Edresson 开发

基于Facebook的Wav2vec2 Large 100k Voxpopuli模型，使用Common Voice 7.0、M-AILABS数据集及数据增强技术在俄语上进行微调的语音识别模型。

语音识别

Transformers

其他开源协议:Apache-2.0 #俄语语音识别 #多数据集微调 #数据增强优化

下载量 23

发布时间 : 3/2/2022

模型简介

该模型是一个自动语音识别(ASR)系统，专门针对俄语优化，能够将俄语语音转换为文本。

模型特点

多数据集微调

使用Common Voice 7.0和M-AILABS数据集进行训练，提高了模型识别准确性

数据增强技术

采用基于TTS和语音转换的数据增强方法，增强了模型的泛化能力

俄语优化

专门针对俄语语音特点进行优化，在俄语识别任务上表现优异

模型能力

俄语语音识别

语音转文本

自动语音识别

使用案例

语音转录

俄语语音转写

将俄语语音内容自动转换为文本

在Common Voice 7.0测试集上达到19.46%的词错误率

语音助手

俄语语音指令识别

用于俄语语音助手中的语音指令识别

🚀 Wav2vec2 Large 100k Voxpopuli 在俄语上微调模型

本项目基于 Wav2vec2 Large 100k Voxpopuli 模型，使用 Common Voice 7.0、M - AILABS 数据集，结合基于 TTS 和语音转换的数据增强方法，在俄语数据上进行了微调。

🚀 快速开始

安装依赖

本项目使用 Python 和 PyTorch，你可以通过以下方式安装所需的库：

pip install transformers torchaudio datasets jiwer

使用模型

from transformers import AutoTokenizer, Wav2Vec2ForCTC
  
tokenizer = AutoTokenizer.from_pretrained("Edresson/wav2vec2-large-100k-voxpopuli-ft-Common_Voice_plus_TTS-Dataset_plus_Data_Augmentation-russian")
model = Wav2Vec2ForCTC.from_pretrained("Edresson/wav2vec2-large-100k-voxpopuli-ft-Common_Voice_plus_TTS-Dataset_plus_Data_Augmentation-russian")

💻 使用示例

基础用法

from transformers import AutoTokenizer, Wav2Vec2ForCTC
  
tokenizer = AutoTokenizer.from_pretrained("Edresson/wav2vec2-large-100k-voxpopuli-ft-Common_Voice_plus_TTS-Dataset_plus_Data_Augmentation-russian")
model = Wav2Vec2ForCTC.from_pretrained("Edresson/wav2vec2-large-100k-voxpopuli-ft-Common_Voice_plus_TTS-Dataset_plus_Data_Augmentation-russian")

高级用法

使用 Common Voice 数据集进行测试

from datasets import load_dataset
import torchaudio
import re
from jiwer import wer

# 加载数据集
dataset = load_dataset("common_voice", "ru", split="test", data_dir="./cv-corpus-7.0-2021-07-21")

# 定义重采样器
resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)

# 定义字符过滤正则表达式
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'

# 定义映射函数
def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
    batch["sampling_rate"] = resampler.new_freq
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("â€™", "'")
    return batch

# 对数据集进行映射
ds = dataset.map(map_to_array)

# 定义预测函数
def map_to_pred(batch):
    features = tokenizer(batch["speech"], return_tensors="pt", padding="longest")
    input_values = features.input_values
    attention_mask = features.attention_mask

    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    batch["predicted"] = tokenizer.batch_decode(predicted_ids)
    batch["target"] = batch["sentence"]
    return batch

# 进行预测
result = ds.map(map_to_pred, batched=True, batch_size=1, remove_columns=list(ds.features.keys()))

# 计算 WER
print(wer.compute(predictions=result["predicted"], references=result["target"]))

📚 详细文档

模型信息

属性	详情
模型类型	基于 Wav2vec2 Large 100k Voxpopuli 微调的语音识别模型
训练数据	Common Voice 7.0、M - AILABS 数据集，结合基于 TTS 和语音转换的数据增强方法
评估指标	词错误率（WER）