wav2vec2-large-xlsr-53-french开源模型 - 高准确率法语语音转文本支持

首页

Wav2vec2 Large Xlsr 53 French

由 jonatasgrosman 开发

这是一个基于XLSR-53大模型微调的法语语音识别模型，在Common Voice数据集上训练，支持高准确率的法语语音转文本。

语音识别法语开源协议:Apache-2.0 #法语语音识别 #低词错误率 #XLSR-53微调

下载量 47.83k

发布时间 : 3/2/2022

模型简介

该模型是针对法语优化的自动语音识别(ASR)系统，基于Facebook的wav2vec2-large-xlsr-53架构微调，能够将法语语音转换为文本。

模型特点

高精度法语识别

在Common Voice法语测试集上达到17.65%的词错误率(WER)和4.89%的字错误率(CER)

支持语言模型增强

结合语言模型后，WER可降至13.59%，CER降至3.91%，显著提升识别准确率

16kHz采样率支持

专为16kHz采样率的语音输入优化，适合大多数语音应用场景

开源许可

采用Apache-2.0许可证，允许商业和研究用途

模型能力

法语语音识别

实时语音转文本

批量音频处理

使用案例

语音转录

法语语音转文字

将法语语音内容转换为可编辑的文本格式

在标准测试集上达到83%以上的准确率

语音助手

法语语音指令识别

用于法语语音助手或控制系统的语音指令识别

🚀 用于法语语音识别的微调XLSR - 53大模型

本项目是基于facebook/wav2vec2-large-xlsr-53模型，使用Common Voice 6.1的训练集和验证集对法语进行微调的语音识别模型。使用该模型时，请确保语音输入的采样率为16kHz。

此模型的微调得益于OVHcloud慷慨提供的GPU计算资源。训练脚本可在以下链接找到：https://github.com/jonatasgrosman/wav2vec2-sprint

🚀 快速开始

本模型是基于facebook/wav2vec2-large-xlsr-53模型，使用Common Voice 6.1的训练集和验证集对法语进行微调得到的。使用该模型时，请确保语音输入的采样率为16kHz。

✨ 主要特性

数据集：使用了common_voice和mozilla-foundation/common_voice_6_0数据集进行训练。
评估指标：使用了字错率（WER）和字符错误率（CER）作为评估指标。
应用场景：适用于法语的自动语音识别任务。

📦 安装指南

文档未提及安装步骤，故跳过此章节。

💻 使用示例

基础用法

使用HuggingSound库：

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-french")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

高级用法

编写自己的推理脚本：

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "fr"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-french"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

识别结果示例

参考文本	预测文本
"CE DERNIER A ÉVOLUÉ TOUT AU LONG DE L'HISTOIRE ROMAINE."	CE DERNIER ÉVOLUÉ TOUT AU LONG DE L'HISTOIRE ROMAINE
CE SITE CONTIENT QUATRE TOMBEAUX DE LA DYNASTIE ACHÉMÉNIDE ET SEPT DES SASSANIDES.	CE SITE CONTIENT QUATRE TOMBEAUX DE LA DYNASTIE ASHEMÉNID ET SEPT DES SASANDNIDES
"J'AI DIT QUE LES ACTEURS DE BOIS AVAIENT, SELON MOI, BEAUCOUP D'AVANTAGES SUR LES AUTRES."	JAI DIT QUE LES ACTEURS DE BOIS AVAIENT SELON MOI BEAUCOUP DAVANTAGES SUR LES AUTRES
LES PAYS-BAS ONT REMPORTÉ TOUTES LES ÉDITIONS.	LE PAYS-BAS ON REMPORTÉ TOUTES LES ÉDITIONS
IL Y A MAINTENANT UNE GARE ROUTIÈRE.	IL AMNARDIGAD LE TIRAN
HUIT	HUIT
DANS L’ATTENTE DU LENDEMAIN, ILS NE POUVAIENT SE DÉFENDRE D’UNE VIVE ÉMOTION	DANS L'ATTENTE DU LENDEMAIN IL NE POUVAIT SE DÉFENDRE DUNE VIVE ÉMOTION
LA PREMIÈRE SAISON EST COMPOSÉE DE DOUZE ÉPISODES.	LA PREMIÈRE SAISON EST COMPOSÉE DE DOUZE ÉPISODES
ELLE SE TROUVE ÉGALEMENT DANS LES ÎLES BRITANNIQUES.	ELLE SE TROUVE ÉGALEMENT DANS LES ÎLES BRITANNIQUES
ZÉRO	ZEGO

📚 详细文档

评估方法

在mozilla-foundation/common_voice_6_0数据集的test分割上进行评估：

python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-french --dataset mozilla-foundation/common_voice_6_0 --config fr --split test

在speech-recognition-community-v2/dev_data数据集上进行评估：

python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-french --dataset speech-recognition-community-v2/dev_data --config fr --split validation --chunk_length_s 5.0 --stride_length_s 1.0

模型信息

属性	详情
模型类型	用于法语语音识别的微调XLSR - 53大模型
训练数据	common_voice、mozilla-foundation/common_voice_6_0

评估指标

任务	数据集	评估指标	值
自动语音识别	Common Voice fr	测试字错率（WER）	17.65
自动语音识别	Common Voice fr	测试字符错误率（CER）	4.89
自动语音识别	Common Voice fr	测试字错率（+LM）	13.59
自动语音识别	Common Voice fr	测试字符错误率（+LM）	3.91
自动语音识别	Robust Speech Event - Dev Data	开发集字错率（WER）	34.35
自动语音识别	Robust Speech Event - Dev Data	开发集字符错误率（CER）	14.09
自动语音识别	Robust Speech Event - Dev Data	开发集字错率（+LM）	24.72
自动语音识别	Robust Speech Event - Dev Data	开发集字符错误率（+LM）	12.33

📄 许可证

本项目采用Apache 2.0许可证。

🔗 引用信息

如果您想引用此模型，可以使用以下BibTeX格式：

@misc{grosman2021xlsr53-large-french,
  title={Fine-tuned {XLSR}-53 large model for speech recognition in {F}rench},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-french}},
  year={2021}
}