開源wav2vec2俄語語音識別模型 - 精準識別俄語語音，免費使用別錯過！

首頁

Wav2vec2 Large 100k Voxpopuli Ft Common Voice Plus TTS Dataset Plus Data Augmentation Russian

由Edresson開發

基於Facebook的Wav2vec2 Large 100k Voxpopuli模型，使用Common Voice 7.0、M-AILABS數據集及數據增強技術在俄語上進行微調的語音識別模型。

語音識別

Transformers

其他開源協議:Apache-2.0 #俄語語音識別 #多數據集微調 #數據增強優化

下載量 23

發布時間 : 3/2/2022

模型概述

該模型是一個自動語音識別(ASR)系統，專門針對俄語優化，能夠將俄語語音轉換為文本。

模型特點

多數據集微調

使用Common Voice 7.0和M-AILABS數據集進行訓練，提高了模型識別準確性

數據增強技術

採用基於TTS和語音轉換的數據增強方法，增強了模型的泛化能力

俄語優化

專門針對俄語語音特點進行優化，在俄語識別任務上表現優異

模型能力

俄語語音識別

語音轉文本

自動語音識別

使用案例

語音轉錄

俄語語音轉寫

將俄語語音內容自動轉換為文本

在Common Voice 7.0測試集上達到19.46%的詞錯誤率

語音助手

俄語語音指令識別

用於俄語語音助手中的語音指令識別

🚀 Wav2vec2 Large 100k Voxpopuli 在俄語上微調模型

本項目基於 Wav2vec2 Large 100k Voxpopuli 模型，使用 Common Voice 7.0、M - AILABS 數據集，結合基於 TTS 和語音轉換的數據增強方法，在俄語數據上進行了微調。

🚀 快速開始

安裝依賴

本項目使用 Python 和 PyTorch，你可以通過以下方式安裝所需的庫：

pip install transformers torchaudio datasets jiwer

使用模型

from transformers import AutoTokenizer, Wav2Vec2ForCTC
  
tokenizer = AutoTokenizer.from_pretrained("Edresson/wav2vec2-large-100k-voxpopuli-ft-Common_Voice_plus_TTS-Dataset_plus_Data_Augmentation-russian")
model = Wav2Vec2ForCTC.from_pretrained("Edresson/wav2vec2-large-100k-voxpopuli-ft-Common_Voice_plus_TTS-Dataset_plus_Data_Augmentation-russian")

💻 使用示例

基礎用法

from transformers import AutoTokenizer, Wav2Vec2ForCTC
  
tokenizer = AutoTokenizer.from_pretrained("Edresson/wav2vec2-large-100k-voxpopuli-ft-Common_Voice_plus_TTS-Dataset_plus_Data_Augmentation-russian")
model = Wav2Vec2ForCTC.from_pretrained("Edresson/wav2vec2-large-100k-voxpopuli-ft-Common_Voice_plus_TTS-Dataset_plus_Data_Augmentation-russian")

高級用法

使用 Common Voice 數據集進行測試

from datasets import load_dataset
import torchaudio
import re
from jiwer import wer

# 加載數據集
dataset = load_dataset("common_voice", "ru", split="test", data_dir="./cv-corpus-7.0-2021-07-21")

# 定義重採樣器
resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)

# 定義字符過濾正則表達式
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'

# 定義映射函數
def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
    batch["sampling_rate"] = resampler.new_freq
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("â€™", "'")
    return batch

# 對數據集進行映射
ds = dataset.map(map_to_array)

# 定義預測函數
def map_to_pred(batch):
    features = tokenizer(batch["speech"], return_tensors="pt", padding="longest")
    input_values = features.input_values
    attention_mask = features.attention_mask

    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    batch["predicted"] = tokenizer.batch_decode(predicted_ids)
    batch["target"] = batch["sentence"]
    return batch

# 進行預測
result = ds.map(map_to_pred, batched=True, batch_size=1, remove_columns=list(ds.features.keys()))

# 計算 WER
print(wer.compute(predictions=result["predicted"], references=result["target"]))

📚 詳細文檔

模型信息

屬性	詳情
模型類型	基於 Wav2vec2 Large 100k Voxpopuli 微調的語音識別模型
訓練數據	Common Voice 7.0、M - AILABS 數據集，結合基於 TTS 和語音轉換的數據增強方法
評估指標	詞錯誤率（WER）