wav2vec2-large-xlsr-53-greek開源希臘語語音識別模型

首頁

Wav2vec2 Large Xlsr 53 Greek

由jonatasgrosman開發

這是一個針對希臘語語音識別任務微調的XLSR-53大模型，基於facebook/wav2vec2-large-xlsr-53模型，使用Common Voice 6.1和CSS10數據集訓練。

語音識別其他開源協議:Apache-2.0 #希臘語語音識別 #XLSR-53微調 #低詞錯誤率

下載量 130.81k

發布時間 : 3/2/2022

模型概述

該模型專門用於希臘語自動語音識別(ASR)，能夠將希臘語語音轉換為文本。

模型特點

高性能希臘語識別

在Common Voice希臘語測試集上達到11.62%的詞錯誤率(WER)和3.36%的字符錯誤率(CER)

基於XLSR-53大模型

基於facebook/wav2vec2-large-xlsr-53模型微調，具有強大的語音特徵提取能力

多數據集訓練

使用Common Voice 6.1和CSS10數據集進行訓練，覆蓋多樣化的語音場景

模型能力

希臘語語音識別

16kHz音頻處理

無語言模型直接使用

使用案例

語音轉文字

希臘語語音轉錄

將希臘語語音內容轉換為文本

準確率達到88.38%(1-WER)

語音助手

希臘語語音指令識別

用於希臘語語音助手或控制系統的指令識別

🚀 用於希臘語語音識別的微調XLSR - 53大模型

本項目微調了 facebook/wav2vec2-large-xlsr-53 模型，使其適用於希臘語語音識別。微調過程使用了 Common Voice 6.1 和 CSS10 的訓練集和驗證集。使用該模型時，請確保語音輸入的採樣率為16kHz。

此模型的微調得益於 OVHcloud 慷慨提供的GPU計算資源。訓練腳本可在此處找到。

🚀 快速開始

本模型可直接用於希臘語語音識別，無需額外的語言模型。以下為你介紹兩種使用方式。

📦 安裝指南

本模型使用時依賴一些Python庫，可通過以下命令安裝：

pip install huggingsound datasets transformers librosa torch

💻 使用示例

基礎用法

使用 HuggingSound 庫進行語音識別：

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-greek")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

高級用法

編寫自己的推理腳本：

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "el"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-greek"
SAMPLES = 5

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

以下是部分識別結果示例：

參考文本	預測文本
ΤΟ ΒΑΣΙΛΌΠΟΥΛΟ, ΠΟΥ ΜΟΙΆΖΕΙ ΛΕΟΝΤΑΡΆΚΙ ΚΑΙ ΑΕΤΟΥΔΆΚΙ	ΤΟ ΒΑΣΙΛΌΠΟΥΛΟ ΠΟΥ ΜΙΑΣΕ ΛΙΟΝΤΑΡΑΚΉ ΚΑΙ ΑΪΤΟΥΔΆΚΙ
ΣΥΝΆΜΑ ΞΕΠΡΌΒΑΛΑΝ ΑΠΌ ΜΈΣΑ ΑΠΌ ΤΑ ΔΈΝΤΡΑ, ΔΕΞΙΆ, ΑΡΜΑΤΩΜΈΝΟΙ ΚΑΒΑΛΑΡΈΟΙ.	ΣΥΝΆΜΑ ΚΑΙ ΤΡΌΒΑΛΑΝ ΑΠΌ ΜΈΣΑ ΑΠΌ ΤΑ ΔΈΝΤΡΑ ΔΕΞΙΆ ΑΡΜΑΤΩΜΈΝΟΙ ΚΑΒΑΛΑΡΈΟΙ
ΤΑ ΣΥΣΚΕΥΑΣΜΈΝΑ ΒΙΟΛΟΓΙΚΆ ΛΑΧΑΝΙΚΆ ΔΕΝ ΠΕΡΙΈΧΟΥΝ ΣΥΝΤΗΡΗΤΙΚΆ ΚΑΙ ΟΡΜΌΝΕΣ	ΤΑ ΣΥΣΚΕΦΑΣΜΈΝΑ ΒΙΟΛΟΓΙΚΆ ΛΑΧΑΝΙΚΆ ΔΕΝ ΠΕΡΙΈΧΟΥΝ ΣΙΔΗΡΗΤΙΚΆ ΚΑΙ ΟΡΜΌΝΕΣ
ΑΚΟΛΟΥΘΉΣΕΤΕ ΜΕ!	ΑΚΟΛΟΥΘΉΣΤΕ ΜΕ
ΚΑΙ ΠΟΎ ΜΠΟΡΏ ΝΑ ΤΟΝ ΒΡΩ;	Ε ΠΟΎ ΜΠΟΡΏ ΝΑ ΤΙ ΕΒΡΩ
ΝΑΙ! ΑΠΟΚΡΊΘΗΚΕ ΤΟ ΠΑΙΔΊ	ΝΑΙ ΑΠΟΚΡΊΘΗΚΕ ΤΟ ΠΑΙΔΊ
ΤΟ ΠΑΛΆΤΙ ΜΟΥ ΤΟ ΠΡΟΜΉΘΕΥΕ.	ΤΟ ΠΑΛΆΤΙ ΜΟΥ ΤΟ ΠΡΟΜΉΘΕΥΕ
ΉΛΘΕ ΜΉΝΥΜΑ ΑΠΌ ΤΟ ΘΕΊΟ ΒΑΣΙΛΙΆ;	ΉΛΘΑ ΜΕΊΝΕΙ ΜΕ ΑΠΌ ΤΟ ΘΕΊΟ ΒΑΣΊΛΙΑ
ΠΑΡΑΚΆΤΩ, ΈΝΑ ΡΥΆΚΙ ΜΟΥΡΜΟΎΡΙΖΕ ΓΛΥΚΆ, ΚΥΛΏΝΤΑΣ ΤΑ ΚΡΥΣΤΑΛΛΈΝΙΑ ΝΕΡΆ ΤΟΥ ΑΝΆΜΕΣΑ ΣΤΑ ΠΥΚΝΆ ΧΑΜΌΔΕΝΤΡΑ.	ΠΑΡΑΚΆΤΩ ΈΝΑ ΡΥΆΚΙ ΜΟΥΡΜΟΎΡΙΖΕ ΓΛΥΚΆ ΚΥΛΏΝΤΑΣ ΤΑ ΚΡΥΣΤΑΛΛΈΝΙΑ ΝΕΡΆ ΤΟΥ ΑΝΆΜΕΣΑ ΣΤΑ ΠΥΚΡΆ ΧΑΜΌΔΕΝΤΡΑ
ΠΡΆΓΜΑΤΙ, ΕΊΝΑΙ ΑΣΤΕΊΟ ΝΑ ΠΆΡΕΙ Ο ΔΙΆΒΟΛΟΣ	ΠΡΆΓΜΑΤΗ ΕΊΝΑΙ ΑΣΤΕΊΟ ΝΑ ΠΆΡΕΙ Ο ΔΙΆΒΟΛΟΣ

🔧 技術細節

評估模型

可使用以下腳本在Common Voice的希臘語測試數據上評估模型：

import torch
import re
import librosa
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "el"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-greek"
DEVICE = "cuda"

CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
                   "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
                   "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
                   "、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
                   "『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\\\", "º", "−", "^", "ʻ", "ˆ"]

test_dataset = load_dataset("common_voice", LANG_ID, split="test")

wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py

chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
model.to(DEVICE)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

predictions = [x.upper() for x in result["pred_strings"]]
references = [x.upper() for x in result["sentence"]]

print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")

測試結果

以下表格展示了模型的單詞錯誤率（WER）和字符錯誤率（CER）。我在2021 - 04 - 22也對其他模型運行了上述評估腳本。請注意，下表結果可能與已報告的結果不同，這可能是由於使用的其他評估腳本的特殊性導致的。

模型	單詞錯誤率（WER）	字符錯誤率（CER）
lighteternal/wav2vec2-large-xlsr-53-greek	10.13%	2.66%
jonatasgrosman/wav2vec2-large-xlsr-53-greek	11.62%	3.36%
vasilis/wav2vec2-large-xlsr-53-greek	19.09%	5.88%
PereLluis13/wav2vec2-large-xlsr-53-greek	20.16%	5.71%

📄 許可證

本模型使用的許可證為 Apache 2.0。

📚 引用

如果你想引用此模型，可以使用以下格式：

@misc{grosman2021xlsr53-large-greek,
  title={Fine-tuned {XLSR}-53 large model for speech recognition in {G}reek},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-greek}},
  year={2021}
}

模型信息

屬性	詳情
模型類型	用於希臘語語音識別的微調XLSR - 53大模型
訓練數據	Common Voice 6.1和CSS10的訓練集和驗證集
評估指標	單詞錯誤率（WER）、字符錯誤率（CER）