wav2vec2-large-xlsr-53-hungarian開源語音識別模型

首頁

Wav2vec2 Large Xlsr 53 Hungarian

由sarpba開發

基於facebook/wav2vec2-large-xlsr-53在匈牙利語Common Voice數據集上微調的自動語音識別模型

語音識別

Transformers

其他開源協議:Apache-2.0 #匈牙利語語音識別 #低詞錯誤率 #Common Voice微調

下載量 17

發布時間 : 3/2/2025

模型概述

這是一個針對匈牙利語優化的自動語音識別(ASR)模型，在Mozilla Common Voice 17.0匈牙利語數據集上微調，能夠將匈牙利語語音轉換為文本。

模型特點

匈牙利語優化

專門針對匈牙利語語音識別任務進行了微調優化

高性能

在Common Voice測試集上達到17.28%的詞錯誤率，優於同類模型

基於Wav2Vec2架構

採用Facebook先進的Wav2Vec2-large-xlsr-53作為基礎模型

模型能力

匈牙利語語音識別

語音轉文本

自動語音轉錄

使用案例

語音轉錄

匈牙利語語音轉錄

將匈牙利語語音內容轉換為文本

詞錯誤率17.28%

語音助手

匈牙利語語音指令識別

用於匈牙利語語音助手或語音控制系統的語音識別模塊

🚀 wav2vec2-large-xlsr-53-hungarian

這個模型是 facebook/wav2vec2-large-xlsr-53 在 MOZILLA-FOUNDATION/COMMON_VOICE_17_0 - HU 數據集上的微調版本。它能夠將語音數據轉換為文本，在語音識別領域有重要應用價值。

🚀 快速開始

這個模型是 facebook/wav2vec2-large-xlsr-53 在 MOZILLA-FOUNDATION/COMMON_VOICE_17_0 - HU 數據集上的微調版本。它在評估集上取得了以下結果：

損失值：0.1748
詞錯誤率（Wer）：0.2997

由於忽略了部分字符，訓練和測量的詞錯誤率值有所不同。

✨ 主要特性

模型對比

與之前最佳的 wav2vec 模型（在 CV17 上評估）相比，本模型表現更優：

模型名稱	詞錯誤率（WER）	字符錯誤率（CER）
jonatasgrosman/wav2vec2-large-xlsr-53-hungarian	46.199835320230555	9.85170677112479
sarpba/wav2vec2-large-xlsr-53-hungarian	17.27824914378453	3.151354554132789

評估時忽略的字符如下：

CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
                   "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
                   "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
                   "、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
                   "『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]

💻 使用示例

基礎用法

import torch
import librosa
import re
import warnings
from datasets import load_dataset
import evaluate
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "hu"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-hungarian"
DEVICE = "cuda"

CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
                   "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
                   "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
                   "、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
                   "『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]

test_dataset = load_dataset("mozilla-foundation/common_voice_17_0", LANG_ID, split="test")

wer = evaluate.load("wer")  # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
cer = evaluate.load("cer")  # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py


chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
model.to(DEVICE)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

predictions = [x.upper() for x in result["pred_strings"]]
references = [x.upper() for x in result["sentence"]]

print(f"WER: {wer.compute(predictions=predictions, references=references) * 100}")
print(f"CER: {cer.compute(predictions=predictions, references=references) * 100}")

🔧 技術細節

訓練過程

訓練超參數

訓練過程中使用了以下超參數：

學習率：0.0003
訓練批次大小：16
評估批次大小：8
隨機種子：42
分佈式類型：多 GPU
設備數量：2
梯度累積步數：2
總訓練批次大小：64
總評估批次大小：16
優化器：使用 OptimizerNames.ADAMW_TORCH，β值為 (0.9, 0.999)，ε值為 1e-08，無額外優化器參數
學習率調度器類型：線性
學習率調度器熱身步數：500
訓練輪數：15.0
混合精度訓練：原生 AMP

訓練結果

訓練損失	輪數	步數	驗證損失	詞錯誤率（Wer）
3.7968	1.0	758	0.2848	0.5295
0.2547	2.0	1516	0.1908	0.4222
0.1929	3.0	2274	0.1753	0.4000
0.1532	4.0	3032	0.1558	0.3710
0.1297	5.0	3790	0.1512	0.3536
0.1167	6.0	4548	0.1574	0.3514
0.101	7.0	5306	0.1483	0.3374
0.0859	8.0	6064	0.1490	0.3299
0.0791	9.0	6822	0.1523	0.3250
0.0702	10.0	7580	0.1608	0.3192
0.0629	11.0	8338	0.1664	0.3146
0.0559	12.0	9096	0.1641	0.3103
0.0527	13.0	9854	0.1665	0.3063
0.0468	14.0	10612	0.1691	0.3011
0.0443	15.0	11370	0.1748	0.2998