asr-wav2vec2-ctc-french开源法语语音识别模型

首页

Asr Wav2vec2 Ctc French

由 bofenghuang 开发

这是一个基于wav2vec2-FR-7K-large模型微调的法语自动语音识别(ASR)模型，在超过2200小时的法语语音数据上训练而成。

语音识别

Transformers

法语开源协议:Apache-2.0 #法语语音识别 #多方言鲁棒性 #Wav2Vec2大模型

下载量 520

发布时间 : 11/25/2022

模型简介

该模型专门用于法语语音识别任务，支持16kHz采样率的音频输入，在多个法语语音数据集上表现出色。

模型特点

多数据集训练

模型在Common Voice 11.0、多语言LibriSpeech、Voxpopuli等多个法语语音数据集上训练，覆盖多种语音场景。

支持语言模型

模型可与语言模型结合使用，显著降低词错误率(WER)。

非洲口音支持

模型在非洲口音法语数据上进行了训练，能够识别带有非洲口音的法语。

模型能力

法语语音识别

支持16kHz采样率音频处理

支持语言模型集成

多场景语音识别

使用案例

语音转录

法语语音转文字

将法语语音内容转换为文字

在Common Voice 11.0测试集上WER为11.44(无语言模型)和9.66(有语言模型)

语音分析

非洲口音法语识别

识别带有非洲口音的法语语音

在非洲口音法语测试集上WER为16.22(无语言模型)和15.39(有语言模型)

🚀 用于法语自动语音识别的微调wav2vec2 - FR - 7K - large模型

本模型专为解决法语自动语音识别问题而设计，基于大规模法语语音数据集进行微调，能有效提升语音识别的准确性和鲁棒性，在多种法语语音场景中具有出色的表现。

🚀 快速开始

本模型是 LeBenchmark/wav2vec2 - FR - 7K - large 的微调版本。它在包含超过2200小时法语语音音频的复合数据集上进行训练，这些数据集来自 Common Voice 11.0、Multilingual LibriSpeech、Voxpopuli、Multilingual TEDx、MediaSpeech 和 African Accented French 的训练集和验证集。使用该模型时，请确保输入的语音采样率为16Khz。

✨ 主要特性

微调优化：基于预训练模型进行微调，更适配法语语音识别任务。
多数据集训练：使用多个法语语音数据集训练，提升模型的泛化能力。
支持语言模型：可选择使用语言模型，进一步提高识别准确率。

📦 安装指南

文档中未提及安装步骤，若有需要可参考相关库（如 transformers、torch、torchaudio）的官方安装指南。

💻 使用示例

基础用法

使用语言模型处理本地音频文件

import torch
import torchaudio

from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/asr-wav2vec2-french")
model_sample_rate = processor_with_lm.feature_extractor.sampling_rate

wav_path = "example.wav"  # path to your audio file
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0)  # mono

# resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# normalize
input_dict = processor_with_lm(waveform, sampling_rate=model_sample_rate, return_tensors="pt")

with torch.inference_mode():
    logits = model(input_dict.input_values.to(device)).logits

predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0]

不使用语言模型处理本地音频文件

import torch
import torchaudio

from transformers import AutoModelForCTC, Wav2Vec2Processor

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
processor = Wav2Vec2Processor.from_pretrained("bhuang/asr-wav2vec2-french")
model_sample_rate = processor.feature_extractor.sampling_rate

wav_path = "example.wav"  # path to your audio file
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0)  # mono

# resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# normalize
input_dict = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")

with torch.inference_mode():
    logits = model(input_dict.input_values.to(device)).logits

# decode
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentence = processor.batch_decode(predicted_ids)[0]

📚 详细文档

评估

在 `mozilla - foundation/common_voice_11_0` 上进行评估

python eval.py \
  --model_id "bhuang/asr-wav2vec2-french" \
  --dataset "mozilla-foundation/common_voice_11_0" \
  --config "fr" \
  --split "test" \
  --log_outputs \
  --outdir "outputs/results_mozilla-foundatio_common_voice_11_0_with_lm"

在 `speech - recognition - community - v2/dev_data` 上进行评估

python eval.py \
  --model_id "bhuang/asr-wav2vec2-french" \
  --dataset "speech-recognition-community-v2/dev_data" \
  --config "fr" \
  --split "validation" \
  --chunk_length_s 30.0 \
  --stride_length_s 5.0 \
  --log_outputs \
  --outdir "outputs/results_speech-recognition-community-v2_dev_data_with_lm"

模型指标

属性	详情
模型类型	基于Wav2Vec2 - CTC架构的微调模型
训练数据	Common Voice 11.0、Multilingual LibriSpeech、Voxpopuli、Multilingual TEDx、MediaSpeech、African Accented French

评估结果

任务	数据集	测试WER	测试WER (+LM)
自动语音识别	Common Voice 11.0	11.44	9.66
自动语音识别	Multilingual LibriSpeech (MLS)	5.93	5.13
自动语音识别	VoxPopuli	9.33	8.51
自动语音识别	African Accented French	16.22	15.39
自动语音识别	Robust Speech Event - Dev Data	16.56	12.96
自动语音识别	Fleurs	10.10	8.84