whisper-large-v3-french开源法语语音识别模型 - 精准预测大小写、标点和数字

首页

Whisper Large V3 French

由 bofenghuang 开发

基于OpenAI Whisper-large-v3微调的法语自动语音识别模型，支持大小写、标点符号和数字预测

语音识别

Transformers

法语开源协议:MIT #法语语音识别 #多场景适配 #低WER

下载量 771

发布时间 : 11/27/2023

模型简介

该模型是专为法语优化的自动语音识别系统，在多个法语数据集上表现出色，支持长文本转录和快速推理

模型特点

多格式支持

提供多种格式转换，兼容transformers、openai-whisper、fasterwhisper等多种库

高效长文本处理

支持分块并行处理长音频，提供比顺序处理快9倍的推理速度

推测解码优化

支持使用蒸馏模型进行推测解码，实现2倍加速而保持相同输出质量

广泛数据集适配

在Common Voice、Multilingual LibriSpeech、VoxPopuli等多个法语数据集上表现优异

模型能力

法语语音识别

长音频转录

标点符号预测

大小写识别

数字转换

使用案例

语音转文字

会议记录

将法语会议录音自动转换为文字记录

准确率超过90%

媒体字幕生成

为法语视频内容自动生成字幕

支持多种法语口音

语音分析

呼叫中心语音分析

分析客户服务通话内容

在嘈杂环境下仍保持良好表现

🚀 Whisper-Large-V3-French

Whisper-Large-V3-French在openai/whisper-large-v3的基础上进行了微调，进一步提升了其在法语上的性能。该模型经过训练，可以预测大小写、标点符号和数字。虽然这可能会在一定程度上牺牲性能，但我们认为这能使其拥有更广泛的用途。

🚀 快速开始

Whisper-Large-V3-French可以用于法语语音识别任务。它已经被转换为多种格式，方便在不同的库中使用，包括transformers、openai-whisper、fasterwhisper、whisper.cpp、candle、mlx等。

✨ 主要特性

基于openai/whisper-large-v3微调，在法语上有更好的表现。
能够预测大小写、标点符号和数字。
支持多种格式，可在不同库中使用。

📦 安装指南

根据不同的使用场景，你可以选择不同的库进行安装：

OpenAI Whisper

pip install -U openai-whisper

Faster Whisper

pip install faster-whisper

Whisper.cpp

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
make

Candle

git clone https://github.com/huggingface/candle.git
cd candle/candle-examples/examples/whisper

MLX

git clone https://github.com/ml-explore/mlx-examples.git
cd mlx-examples/whisper
pip install -r requirements.txt

💻 使用示例

基础用法

Hugging Face Pipeline

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-french"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Init pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    torch_dtype=torch_dtype,
    device=device,
    # chunk_length_s=30,  # for long-form transcription
    max_new_tokens=128,
)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Run pipeline
result = pipe(sample)
print(result["text"])

Hugging Face Low-level APIs

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-french"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Extract feautres
input_features = processor(
    sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features


# Generate tokens
predicted_ids = model.generate(
    input_features.to(dtype=torch_dtype).to(device), max_new_tokens=128
)

# Detokenize to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

高级用法

Speculative Decoding

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoModelForSpeechSeq2Seq,
    AutoProcessor,
    pipeline,
)

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-french"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Load draft model
assistant_model_name_or_path = "bofenghuang/whisper-large-v3-french-distil-dec2"
assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
assistant_model.to(device)

# Init pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={"assistant_model": assistant_model},
    max_new_tokens=128,
)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Run pipeline
result = pipe(sample)
print(result["text"])

OpenAI Whisper

import whisper
from datasets import load_dataset

# Load model
model = whisper.load_model("./models/whisper-large-v3-french/original_model.pt")

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]["array"].astype("float32")

# Transcribe
result = model.transcribe(sample, language="fr")
print(result["text"])

Faster Whisper

from datasets import load_dataset
from faster_whisper import WhisperModel

# Load model
model = WhisperModel("./models/whisper-large-v3-french/ctranslate2", device="cuda", compute_type="float16")  # Run on GPU with FP16

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]["array"].astype("float32")

segments, info = model.transcribe(sample, beam_size=5, language="fr")

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Whisper.cpp

./main -m ./models/whisper-large-v3-french/ggml-model-q5_0.bin -l fr -f /path/to/audio/file --print-colors

Candle

cargo run --example whisper --release -- --model large-v3 --model-id bofenghuang/whisper-large-v3-french --language fr --input /path/to/audio/file

若要使用CUDA，可在命令行中添加--features cuda：

cargo run --example whisper --release --features cuda -- --model large-v3 --model-id bofenghuang/whisper-large-v3-french --language fr --input /path/to/audio/file

MLX

import whisper

result = whisper.transcribe("/path/to/audio/file", path_or_hf_repo="mlx_models/whisper-large-v3-french", language="fr")
print(result["text"])

📚 详细文档

性能评估

我们在短文本和长文本转录任务上对模型进行了评估，并在分布内和分布外数据集上进行了测试，以全面分析其准确性、泛化能力和鲁棒性。

需要注意的是，报告中的WER（词错误率）是在将数字转换为文本、去除标点符号（除了撇号和连字符）并将所有字符转换为小写之后的结果。

所有公开数据集的评估结果可在此处找到。

短文本转录

eval-short-form

由于缺乏现成的法语领域外（OOD）和长文本测试集，我们使用了Zaion Lab的内部测试集进行评估。这些测试集包含了来自呼叫中心对话的人工标注的音频-转录对，其显著特点是存在大量背景噪音和特定领域的术语。

长文本转录

eval-long-form

长文本转录使用了🤗 Hugging Face的管道进行快速评估。音频文件被分割成30秒的片段，并进行并行处理。

训练细节

我们收集了一个包含超过2500小时法语语音识别数据的复合数据集，其中包括Common Voice 13.0、Multilingual LibriSpeech、Voxpopuli、Fleurs、Multilingual TEDx、MediaSpeech、African Accented French等数据集。

由于一些数据集（如MLS）只提供没有大小写或标点符号的文本，我们使用了🤗 Speechbox的定制版本，借助bofenghuang/whisper-large-v2-cv11-french模型从有限的符号集中恢复大小写和标点符号。

然而，即使在这些数据集中，我们也发现了一些质量问题。这些问题包括音频和转录在语言或内容上不匹配、话语分割不当以及脚本化语音中缺少单词等。我们构建了一个管道来过滤掉许多这些有问题的话语，旨在提高数据集的质量。因此，我们排除了超过10%的数据，并且在重新训练模型时，我们发现幻觉现象显著减少。

在训练过程中，我们使用了🤗 Transformers仓库中提供的脚本。模型训练在GENCI的Jean-Zay超级计算机上进行，我们感谢IDRIS团队在整个项目过程中提供的及时支持。

致谢

感谢OpenAI创建并开源了Whisper模型。
感谢🤗 Hugging Face将Whisper模型集成到Transformers仓库中，并提供了训练代码库。
感谢Genci为该项目慷慨提供GPU计算时间。

🔧 技术细节

评估指标

使用WER（词错误率）作为评估指标，以衡量模型在语音转录任务中的准确性。

训练脚本

使用🤗 Transformers仓库中的run_speech_recognition_seq2seq.py脚本进行训练。

训练环境

在Jean-Zay超级计算机上进行训练。

📄 许可证

本项目采用MIT许可证。

信息表格

属性	详情
模型类型	基于`openai/whisper-large-v3`微调的语音识别模型
训练数据	包含超过2500小时法语语音识别数据的复合数据集，包括Common Voice 13.0、Multilingual LibriSpeech、Voxpopuli、Fleurs、Multilingual TEDx、MediaSpeech、African Accented French等