wav2vec2-large-fr-voxpopuli-french开源法语语音识别模型

首页

Wav2vec2 Large Fr Voxpopuli French

由 jonatasgrosman 开发

基于facebook/wav2vec2-large-fr-voxpopuli微调的法语语音识别模型，在Common Voice 6.1法语数据集上训练，支持16kHz音频输入

语音识别法语开源协议:Apache-2.0 #法语语音识别 #低词错误率 #Common Voice优化

下载量 51

发布时间 : 3/2/2022

模型简介

针对法语优化的自动语音识别(ASR)模型，基于Voxpopuli wav2vec2架构，适用于法语语音转文本任务

模型特点

高性能法语识别

在Common Voice测试集上达到17.62% WER和6.04% CER的优异表现

基于Voxpopuli预训练

基于facebook/wav2vec2-large-fr-voxpopuli模型微调，具有强大的语音特征提取能力

16kHz音频支持

专为16kHz采样率的语音输入优化

模型能力

法语语音识别

音频转文本

自动语音识别

使用案例

语音转录

法语语音转写

将法语语音内容转换为文本

准确率82.38%(WER 17.62%)

语音助手

法语语音指令识别

用于法语语音助手的前端语音识别模块

🚀 用于法语语音识别的微调版法语Voxpopuli wav2vec2大模型

本模型是在法语数据集上对 facebook/wav2vec2-large-fr-voxpopuli 进行微调得到的，使用了 Common Voice 6.1 的训练集和验证集。使用该模型时，请确保语音输入的采样率为 16kHz。

此模型的微调得益于 OVHcloud 慷慨提供的 GPU 计算资源 👍

训练脚本可在此处找到：https://github.com/jonatasgrosman/wav2vec2-sprint

🚀 快速开始

✨ 主要特性

基于预训练的 facebook/wav2vec2-large-fr-voxpopuli 模型进行微调，适用于法语语音识别任务。
训练使用了 Common Voice 6.1 的训练集和验证集，数据来源广泛。
得益于 OVHcloud 提供的 GPU 计算资源进行微调。

💻 使用示例

基础用法

使用 HuggingSound 库：

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-fr-voxpopuli-french")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

高级用法

编写自己的推理脚本：

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "fr"
MODEL_ID = "jonatasgrosman/wav2vec2-large-fr-voxpopuli-french"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

以下是预测结果示例：

参考文本	预测文本
"CE DERNIER A ÉVOLUÉ TOUT AU LONG DE L'HISTOIRE ROMAINE."	CE DERNIER A ÉVOLUÉ TOUT AU LONG DE L'HISTOIRE ROMAINE
CE SITE CONTIENT QUATRE TOMBEAUX DE LA DYNASTIE ACHÉMÉNIDE ET SEPT DES SASSANIDES.	CE SITE CONTIENT QUATRE TOMBEAUX DE LA DYNESTIE ACHÉMÉNIDE ET SEPT DES SACENNIDES
"J'AI DIT QUE LES ACTEURS DE BOIS AVAIENT, SELON MOI, BEAUCOUP D'AVANTAGES SUR LES AUTRES."	JAI DIT QUE LES ACTEURS DE BOIS AVAIENT SELON MOI BEAUCOUP DAVANTAGE SUR LES AUTRES
LES PAYS-BAS ONT REMPORTÉ TOUTES LES ÉDITIONS.	LE PAYS-BAS ON REMPORTÉ TOUTES LES ÉDITIONS
IL Y A MAINTENANT UNE GARE ROUTIÈRE.	IL A MAINTENANT GULA E RETIREN
HUIT	HUIT
DANS L’ATTENTE DU LENDEMAIN, ILS NE POUVAIENT SE DÉFENDRE D’UNE VIVE ÉMOTION	DANS LATTENTE DU LENDEMAIN IL NE POUVAIT SE DÉFENDRE DUNE VIVE ÉMOTION
LA PREMIÈRE SAISON EST COMPOSÉE DE DOUZE ÉPISODES.	LA PREMIÈRE SAISON EST COMPOSÉE DE DOUZ ÉPISODES
ELLE SE TROUVE ÉGALEMENT DANS LES ÎLES BRITANNIQUES.	ELLE SE TROUVE ÉGALEMENT DANS LES ÎLES BRITANNIQUES
ZÉRO	ZÉRO

📚 详细文档

评估方法

该模型可以在 Common Voice 的法语（fr）测试数据上进行如下评估：

import torch
import re
import librosa
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "fr"
MODEL_ID = "jonatasgrosman/wav2vec2-large-fr-voxpopuli-french"
DEVICE = "cuda"

CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
                   "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
                   "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
                   "、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
                   "『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]

test_dataset = load_dataset("common_voice", LANG_ID, split="test")

wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py

chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
model.to(DEVICE)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

predictions = [x.upper() for x in result["pred_strings"]]
references = [x.upper() for x in result["sentence"]]

print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")

测试结果

以下表格展示了该模型以及其他模型的词错误率（WER）和字符错误率（CER）。评估脚本于 2021 年 5 月 16 日运行。请注意，表格中的结果可能与之前报告的结果不同，这可能是由于使用的其他评估脚本的特殊性导致的。

模型	词错误率（WER）	字符错误率（CER）
jonatasgrosman/wav2vec2-large-xlsr-53-french	15.90%	5.29%
jonatasgrosman/wav2vec2-large-fr-voxpopuli-french	17.62%	6.04%
Ilyes/wav2vec2-large-xlsr-53-french	19.67%	6.70%
Nhut/wav2vec2-large-xlsr-french	24.09%	8.42%
facebook/wav2vec2-large-xlsr-53-french	25.45%	10.35%
MehdiHosseiniMoghadam/wav2vec2-large-xlsr-53-French	28.22%	9.70%
Ilyes/wav2vec2-large-xlsr-53-french_punctuation	29.80%	11.79%
facebook/wav2vec2-base-10k-voxpopuli-ft-fr	61.06%	33.31%

📄 许可证

本模型使用的许可证为 Apache-2.0。

📚 引用

如果您想引用此模型，可以使用以下 BibTeX 格式：

@misc{grosman2021voxpopuli-fr-wav2vec2-large-french,
  title={Fine-tuned {F}rench {V}oxpopuli wav2vec2 large model for speech recognition in {F}rench},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-fr-voxpopuli-french}},
  year={2021}
}