wave2vec2-large-xlsr-hindi開源印地語語音識別模型

首頁

Wave2vec2 Large Xlsr Hindi

由shiwangi27開發

基於facebook/wav2vec2-large-xlsr-53模型微調的印地語語音識別模型，使用OpenSLR和Common Voice印地語數據集訓練，支持16kHz採樣率語音輸入。

語音識別

Transformers

其他開源協議:Apache-2.0 #印地語語音識別 #XLSR微調 #低資源優化

下載量 63

發布時間 : 3/2/2022

模型概述

該模型是專為印地語語音識別任務設計的自動語音識別(ASR)模型，基於Wav2Vec2架構，適用於將印地語語音轉換為文本。

模型特點

多數據集訓練

結合使用OpenSLR和Common Voice印地語數據集進行訓練，提高了模型的數據多樣性

採樣率適配

支持16kHz採樣率輸入，訓練時對8kHz數據進行了上採樣處理

無需語言模型

可直接使用，不需要額外的語言模型支持

模型能力

印地語語音識別

語音轉文本

自動語音轉錄

使用案例

語音轉錄

印地語語音轉寫

將印地語語音內容轉換為文本格式

在Common Voice測試集上WER為46.055%

語音助手

印地語語音指令識別

用於印地語語音助手或語音控制系統的語音識別模塊

🚀 Wav2Vec2-Large-XLSR-印地語

本項目基於OpenSLR印地語數據集進行訓練，並使用Common Voice印地語測試數據集進行評估，對facebook/wav2vec2-large-xlsr-53模型進行了微調。該模型可用於印地語的自動語音識別任務，為語音處理領域提供了有力支持。

✨ 主要特性

多數據集支持：使用了OpenSLR Hindi和Common Voice等數據集進行訓練和評估。
指標評估：採用字錯率（WER）作為評估指標，直觀反映模型性能。
特定語言優化：針對印地語進行了精細調整，提高了在印地語語音識別上的準確性。

📦 安裝指南

文檔未提及安裝步驟，故跳過此章節。

💻 使用示例

基礎用法

模型可以直接使用（無需語言模型），示例代碼如下：

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "hi", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("shiwangi27/wave2vec2-large-xlsr-hindi") 
model = Wav2Vec2ForCTC.from_pretrained("shiwangi27/wave2vec2-large-xlsr-hindi")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset[:2]["sentence"])

高級用法

在印地語測試數據上評估模型的示例代碼如下：

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "hi", split="test") 
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("shiwangi27/wave2vec2-large-xlsr-hindi") 
model = Wav2Vec2ForCTC.from_pretrained("shiwangi27/wave2vec2-large-xlsr-hindi") 
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\�\।\']'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
	batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

def evaluate(batch):
	inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

	with torch.no_grad():
		logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

	pred_ids = torch.argmax(logits, dim=-1)
	batch["pred_strings"] = processor.batch_decode(pred_ids)
	return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

📚 詳細文檔

測試結果

數據集	字錯率（WER）
測試分割的Common Voice印地語數據集	46.055 %

代碼來源

用於訓練此模型的Notebook可在 shiwangi27/googlecolab 找到。訓練使用了 run_common_voice.py 的修改版本。

注意事項

當使用此模型時，請確保您的語音輸入採樣率為16kHz。

模型迭代說明

這是微調的第一次迭代。如果在未來的實驗中字錯率（WER）有所改善，將會更新此模型。

🔧 技術細節

本模型在訓練時，使用了大小為10000且隨機採樣的OpenSLR印地語數據。為了增加數據的多樣性，將OpenSLR的訓練集和測試集合並作為訓練數據。由於OpenSLR數據的採樣率為8kHz，因此在訓練時將其升採樣至16kHz。評估則在Common Voice測試集上進行。

📄 許可證

本項目採用Apache-2.0許可證。

模型信息

屬性	詳情
模型類型	微調後的印地語XLSR Wav2Vec2大模型
訓練數據	OpenSLR Hindi、Common Voice
評估指標	字錯率（WER）
許可證	Apache-2.0

模型索引

名稱：微調後的印地語XLSR Wav2Vec2大模型
結果：
- 任務：
  - 名稱：語音識別
  - 類型：自動語音識別
- 數據集：
  - 名稱：Common Voice印地語數據集
    - 類型：Common Voice
    - 參數：印地語
  - 名稱：OpenSLR印地語數據集
    - 鏈接：https://www.openslr.org/resources/103/
- 指標：
  - 名稱：測試字錯率（WER）
  - 類型：WER
  - 值：46.05