wav2vec2-large-fr-voxpopuli-french開源法語語音識別模型

首頁

Wav2vec2 Large Fr Voxpopuli French

由jonatasgrosman開發

基於facebook/wav2vec2-large-fr-voxpopuli微調的法語語音識別模型，在Common Voice 6.1法語數據集上訓練，支持16kHz音頻輸入

語音識別法語開源協議:Apache-2.0 #法語語音識別 #低詞錯誤率 #Common Voice優化

下載量 51

發布時間 : 3/2/2022

模型概述

針對法語優化的自動語音識別(ASR)模型，基於Voxpopuli wav2vec2架構，適用於法語語音轉文本任務

模型特點

高性能法語識別

在Common Voice測試集上達到17.62% WER和6.04% CER的優異表現

基於Voxpopuli預訓練

基於facebook/wav2vec2-large-fr-voxpopuli模型微調，具有強大的語音特徵提取能力

16kHz音頻支持

專為16kHz採樣率的語音輸入優化

模型能力

法語語音識別

音頻轉文本

自動語音識別

使用案例

語音轉錄

法語語音轉寫

將法語語音內容轉換為文本

準確率82.38%(WER 17.62%)

語音助手

法語語音指令識別

用於法語語音助手的前端語音識別模塊

🚀 用於法語語音識別的微調版法語Voxpopuli wav2vec2大模型

本模型是在法語數據集上對 facebook/wav2vec2-large-fr-voxpopuli 進行微調得到的，使用了 Common Voice 6.1 的訓練集和驗證集。使用該模型時，請確保語音輸入的採樣率為 16kHz。

此模型的微調得益於 OVHcloud 慷慨提供的 GPU 計算資源 👍

訓練腳本可在此處找到：https://github.com/jonatasgrosman/wav2vec2-sprint

🚀 快速開始

✨ 主要特性

基於預訓練的 facebook/wav2vec2-large-fr-voxpopuli 模型進行微調，適用於法語語音識別任務。
訓練使用了 Common Voice 6.1 的訓練集和驗證集，數據來源廣泛。
得益於 OVHcloud 提供的 GPU 計算資源進行微調。

💻 使用示例

基礎用法

使用 HuggingSound 庫：

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-fr-voxpopuli-french")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

高級用法

編寫自己的推理腳本：

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "fr"
MODEL_ID = "jonatasgrosman/wav2vec2-large-fr-voxpopuli-french"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

以下是預測結果示例：

參考文本	預測文本
"CE DERNIER A ÉVOLUÉ TOUT AU LONG DE L'HISTOIRE ROMAINE."	CE DERNIER A ÉVOLUÉ TOUT AU LONG DE L'HISTOIRE ROMAINE
CE SITE CONTIENT QUATRE TOMBEAUX DE LA DYNASTIE ACHÉMÉNIDE ET SEPT DES SASSANIDES.	CE SITE CONTIENT QUATRE TOMBEAUX DE LA DYNESTIE ACHÉMÉNIDE ET SEPT DES SACENNIDES
"J'AI DIT QUE LES ACTEURS DE BOIS AVAIENT, SELON MOI, BEAUCOUP D'AVANTAGES SUR LES AUTRES."	JAI DIT QUE LES ACTEURS DE BOIS AVAIENT SELON MOI BEAUCOUP DAVANTAGE SUR LES AUTRES
LES PAYS-BAS ONT REMPORTÉ TOUTES LES ÉDITIONS.	LE PAYS-BAS ON REMPORTÉ TOUTES LES ÉDITIONS
IL Y A MAINTENANT UNE GARE ROUTIÈRE.	IL A MAINTENANT GULA E RETIREN
HUIT	HUIT
DANS L’ATTENTE DU LENDEMAIN, ILS NE POUVAIENT SE DÉFENDRE D’UNE VIVE ÉMOTION	DANS LATTENTE DU LENDEMAIN IL NE POUVAIT SE DÉFENDRE DUNE VIVE ÉMOTION
LA PREMIÈRE SAISON EST COMPOSÉE DE DOUZE ÉPISODES.	LA PREMIÈRE SAISON EST COMPOSÉE DE DOUZ ÉPISODES
ELLE SE TROUVE ÉGALEMENT DANS LES ÎLES BRITANNIQUES.	ELLE SE TROUVE ÉGALEMENT DANS LES ÎLES BRITANNIQUES
ZÉRO	ZÉRO

📚 詳細文檔

評估方法

該模型可以在 Common Voice 的法語（fr）測試數據上進行如下評估：

import torch
import re
import librosa
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "fr"
MODEL_ID = "jonatasgrosman/wav2vec2-large-fr-voxpopuli-french"
DEVICE = "cuda"

CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
                   "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
                   "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
                   "、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
                   "『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]

test_dataset = load_dataset("common_voice", LANG_ID, split="test")

wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py

chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
model.to(DEVICE)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

predictions = [x.upper() for x in result["pred_strings"]]
references = [x.upper() for x in result["sentence"]]

print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")

測試結果

以下表格展示了該模型以及其他模型的詞錯誤率（WER）和字符錯誤率（CER）。評估腳本於 2021 年 5 月 16 日運行。請注意，表格中的結果可能與之前報告的結果不同，這可能是由於使用的其他評估腳本的特殊性導致的。

模型	詞錯誤率（WER）	字符錯誤率（CER）
jonatasgrosman/wav2vec2-large-xlsr-53-french	15.90%	5.29%
jonatasgrosman/wav2vec2-large-fr-voxpopuli-french	17.62%	6.04%
Ilyes/wav2vec2-large-xlsr-53-french	19.67%	6.70%
Nhut/wav2vec2-large-xlsr-french	24.09%	8.42%
facebook/wav2vec2-large-xlsr-53-french	25.45%	10.35%
MehdiHosseiniMoghadam/wav2vec2-large-xlsr-53-French	28.22%	9.70%
Ilyes/wav2vec2-large-xlsr-53-french_punctuation	29.80%	11.79%
facebook/wav2vec2-base-10k-voxpopuli-ft-fr	61.06%	33.31%

📄 許可證

本模型使用的許可證為 Apache-2.0。

📚 引用

如果您想引用此模型，可以使用以下 BibTeX 格式：

@misc{grosman2021voxpopuli-fr-wav2vec2-large-french,
  title={Fine-tuned {F}rench {V}oxpopuli wav2vec2 large model for speech recognition in {F}rench},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-fr-voxpopuli-french}},
  year={2021}
}