wav2vec2-large-xlsr-53-hungarianオープンソース音声認識モデル

ホーム

Wav2vec2 Large Xlsr 53 Hungarian

sarpbaによって開発

facebook/wav2vec2-large-xlsr-53を基に、ハンガリー語Common Voiceデータセットでファインチューニングした自動音声認識モデル

音声認識

Transformers

その他オープンソースライセンス:Apache-2.0 #ハンガリー語音声認識 #低単語誤り率 #Common Voiceファインチューニング

ダウンロード数 17

リリース時間 : 3/2/2025

モデル概要

これはハンガリー語に最適化された自動音声認識(ASR)モデルで、Mozilla Common Voice 17.0ハンガリー語データセットでファインチューニングされており、ハンガリー語音声をテキストに変換できます。

モデル特徴

ハンガリー語最適化

ハンガリー語音声認識タスクに特化してファインチューニングされています

高性能

Common Voiceテストセットで17.28%の単語誤り率を達成し、同類のモデルよりも優れています

Wav2Vec2アーキテクチャ採用

Facebookの先進的なWav2Vec2-large-xlsr-53をベースモデルとして採用

モデル能力

ハンガリー語音声認識

音声からテキストへの変換

自動音声転写

使用事例

音声転写

ハンガリー語音声転写

ハンガリー語音声コンテンツをテキストに変換

単語誤り率17.28%

音声アシスタント

ハンガリー語音声コマンド認識

ハンガリー語音声アシスタントや音声制御システムの音声認識モジュールとして使用

🚀 wav2vec2-large-xlsr-53-hungarian

このモデルは、MOZILLA-FOUNDATION/COMMON_VOICE_17_0 - HUデータセット上でfacebook/wav2vec2-large-xlsr-53をファインチューニングしたバージョンです。評価セットでは以下の結果を達成しています。

損失: 0.1748
Wer: 0.2997

トレーニングと測定されたwerの値が異なるのは、無視された文字のためです。

✨ 主な機能

モデル比較 (CV17での評価)

モデル名	WER	CER
jonatasgrosman/wav2vec2-large-xlsr-53-hungarian	46.199835320230555	9.85170677112479
sarpba/wav2vec2-large-xlsr-53-hungarian	17.27824914378453	3.151354554132789

評価時に無視する文字:

CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
                   "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
                   "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
                   "、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
                   "『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]

📚 詳細ドキュメント

意図された用途と制限

詳細情報は後日提供予定です。

トレーニングと評価

トレーニングはtransformersのPyTorchスクリプトを使用して行われました。

評価コード:

import torch
import librosa
import re
import warnings
from datasets import load_dataset
import evaluate
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "hu"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-hungarian"
DEVICE = "cuda"

CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
                   "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
                   "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
                   "、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
                   "『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]

test_dataset = load_dataset("mozilla-foundation/common_voice_17_0", LANG_ID, split="test")

wer = evaluate.load("wer")  # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
cer = evaluate.load("cer")  # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py


chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
model.to(DEVICE)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

predictions = [x.upper() for x in result["pred_strings"]]
references = [x.upper() for x in result["sentence"]]

print(f"WER: {wer.compute(predictions=predictions, references=references) * 100}")
print(f"CER: {cer.compute(predictions=predictions, references=references) * 100}")

トレーニング手順

トレーニングハイパーパラメータ

トレーニング中に使用されたハイパーパラメータは以下の通りです。

学習率: 0.0003
トレーニングバッチサイズ: 16
評価バッチサイズ: 8
シード: 42
分散タイプ: マルチGPU
デバイス数: 2
勾配累積ステップ: 2
総トレーニングバッチサイズ: 64
総評価バッチサイズ: 16
オプティマイザ: OptimizerNames.ADAMW_TORCHを使用し、ベータ=(0.9,0.999)、イプシロン=1e-08、追加のオプティマイザ引数はなし
学習率スケジューラタイプ: 線形
学習率スケジューラウォームアップステップ: 500
エポック数: 15.0
混合精度トレーニング: Native AMP

トレーニング結果

トレーニング損失	エポック	ステップ	検証損失	Wer
3.7968	1.0	758	0.2848	0.5295
0.2547	2.0	1516	0.1908	0.4222
0.1929	3.0	2274	0.1753	0.4000
0.1532	4.0	3032	0.1558	0.3710
0.1297	5.0	3790	0.1512	0.3536
0.1167	6.0	4548	0.1574	0.3514
0.101	7.0	5306	0.1483	0.3374
0.0859	8.0	6064	0.1490	0.3299
0.0791	9.0	6822	0.1523	0.3250
0.0702	10.0	7580	0.1608	0.3192
0.0629	11.0	8338	0.1664	0.3146
0.0559	12.0	9096	0.1641	0.3103
0.0527	13.0	9854	0.1665	0.3063
0.0468	14.0	10612	0.1691	0.3011
0.0443	15.0	11370	0.1748	0.2998