whisper-large-v3-distil-fr-v0.2开源模型 - 高效法语语音转文本，准确率有保障

首页

Whisper Large V3 Distil Fr V0.2

由 bofenghuang 开发

专为法语语音转文本优化的Whisper蒸馏版本，仅保留2层解码器结构，在保证准确率的同时提升推理效率

语音识别

Transformers

法语开源协议:MIT #法语语音转文本 #长文本转录优化 #推测解码加速

下载量 385

发布时间 : 8/22/2024

模型简介

基于OpenAI Whisper-large-v3的法语优化蒸馏模型，通过减少解码器层数和采用耐心教师蒸馏策略，实现高效语音识别

模型特点

高效推理

相比原模型提速5.8倍，参数量仅需49%，适合资源受限场景

长文本优化

训练采用30秒音频片段，增强长文本转录能力，减少幻觉输出

多框架兼容

支持transformers、faster-whisper、whisper.cpp等多种推理框架

推测解码支持

可作为草稿模型实现2倍加速，且保证输出与原模型一致

模型能力

法语语音转文本

长音频转录

实时语音识别

带噪语音处理

使用案例

客服场景

客服通话转录

处理含背景噪声和领域术语的客服录音

在内部测试集上表现良好

多媒体处理

法语视频字幕生成

为法语视频内容自动生成字幕

🚀 Whisper-Large-V3-Distil-French-v0.2

Whisper的蒸馏版本，具有2个解码器层，针对法语语音转文本进行了优化，可有效提升推理速度并保持一定准确性。

该模型是Whisper的蒸馏版本，仅有2个解码器层，专为法语语音转文本任务优化。与 v0.1 相比，此版本将训练扩展到30秒的音频片段，以保持长文本转录能力。在蒸馏过程中，使用了 "patient" teacher，即更长的训练时间和更激进的数据增强策略，从而提高了整体性能。

模型以 openai/whisper-large-v3 作为教师模型，同时保持编码器架构不变。这使其适合作为推测解码的草稿模型，只需添加2个额外的解码器层并仅运行一次编码器，就可能在保持相同输出的情况下实现2倍的推理速度。它也可以作为独立模型，以一定的准确性换取更高的效率，运行速度快5.8倍，且仅使用49%的参数。这篇论文还表明，在长文本转录过程中，蒸馏模型实际上可能比完整模型产生更少的幻觉内容。

该模型已转换为多种格式，以确保在包括transformers、openai-whisper、faster-whisper、whisper.cpp、candle、mlx等库中的广泛兼容性。

🚀 快速开始

此模型可用于法语语音转文本任务，支持多种使用方式和格式，能在不同场景下满足需求。

✨ 主要特性

优化长文本转录：扩展训练到30秒音频片段，保持长文本转录能力。
高效推理：作为推测解码草稿模型可提升2倍推理速度；作为独立模型运行速度快5.8倍，仅使用49%参数。
减少幻觉内容：长文本转录时可能比完整模型产生更少幻觉内容。
广泛兼容性：转换为多种格式，适配多个库。

💻 使用示例

基础用法

Hugging Face Pipeline

模型可轻松与 🤗 Hugging Face pipeline 类结合使用进行音频转录。对于长文本转录（超过30秒），它将按照OpenAI论文中描述的方式进行顺序解码。如果需要更快的推理速度，可以使用 chunk_length_s 参数进行分块并行解码，推理速度可提高9倍，但与OpenAI的顺序算法相比，性能可能会略有下降。

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-distil-fr-v0.2"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Init pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    torch_dtype=torch_dtype,
    device=device,
    # chunk_length_s=30,  # for chunked decoding
    max_new_tokens=128,
)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Run pipeline
result = pipe(sample)
print(result["text"])

高级用法

多种使用方式

除了Hugging Face Pipeline，还可以使用Hugging Face低级别API、推测解码、OpenAI Whisper、Faster Whisper、Whisper.cpp、Candle、MLX等方式进行转录。

Hugging Face低级别API

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-distil-fr-v0.2"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Extract feautres
input_features = processor(
    sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features


# Generate tokens
predicted_ids = model.generate(
    input_features.to(dtype=torch_dtype).to(device), max_new_tokens=128
)

# Detokenize to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

推测解码

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoModelForSpeechSeq2Seq,
    AutoProcessor,
    pipeline,
)

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "openai/whisper-large-v3"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Load draft model
assistant_model_name_or_path = "bofenghuang/whisper-large-v3-distil-fr-v0.2"
assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
assistant_model.to(device)

# Init pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={"assistant_model": assistant_model},
    max_new_tokens=128,
)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Run pipeline
result = pipe(sample)
print(result["text"])

OpenAI Whisper

import whisper
from datasets import load_dataset

# Load model
model_name_or_path = "./models/whisper-large-v3-distil-fr-v0.2/original_model.pt"
model = whisper.load_model(model_name_or_path)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]["array"].astype("float32")

# Transcribe
result = model.transcribe(sample, language="fr")
print(result["text"])

Faster Whisper

from datasets import load_dataset
from faster_whisper import WhisperModel

# Load model
model_name_or_path = "./models/whisper-large-v3-distil-fr-v0.2/ctranslate2"
model = WhisperModel(model_name_or_path", device="cuda", compute_type="float16")  # Run on GPU with FP16

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]["array"].astype("float32")

segments, info = model.transcribe(sample, beam_size=5, language="fr")

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Whisper.cpp

./main -m ./models/whisper-large-v3-distil-fr-v0.2/ggml-model-q5_0.bin -l fr -f /path/to/audio/file --print-colors

Candle

cargo run --example whisper --release -- --model large-v3 --model-id bofenghuang/whisper-large-v3-distil-fr-v0.2 --language fr --input /path/to/audio/file

MLX

import whisper

result = whisper.transcribe("/path/to/audio/file", path_or_hf_repo="mlx_models/whisper-large-v3-distil-fr-v0.2", language="fr")
print(result["text"])

📚 详细文档

性能评估

模型在短文本和长文本转录上都进行了评估，使用分布内（ID）和分布外（OOD）数据集来评估准确性、泛化能力和鲁棒性。

需要注意的是，这里显示的单词错误率（WER）结果是归一化后的结果，包括将文本转换为小写并去除符号和标点。

所有公开数据集的评估结果可以在这里找到。

短文本转录

eval-short-form

斜体表示分布内（ID）评估，其中测试集对应于训练期间看到的数据分布，通常比分布外（OOD）评估具有更高的性能。~~斜体和删除线~~ 表示可能存在测试集污染的情况 - 例如，当训练和评估使用不同版本的Common Voice时，可能会出现数据重叠的可能性。

由于分布外（OOD）和长文本法语测试集的可用性有限，还使用了 Zaion Lab 的内部测试集进行评估 - 该测试集由人工标注的客服中心对话组成，包含大量背景噪音和特定领域的术语。

长文本转录

长文本转录评估使用了 🤗 Hugging Face pipeline，同时使用了分块（chunk_length_s=30）和原始顺序解码方法。 eval-long-form

训练细节

构建了一个超过22,000小时标注和半标注法语语音的数据集。通过Whisper-Large-V3对该数据集进行解码，并过滤掉WER超过20%的片段后，保留了约10,000小时的高质量音频。

数据集	总时长 (h)	过滤后时长 (h) <20% WER
mcv	800.37	687.02
mls	1076.58	1043.87
voxpopuli	199.03	177.11
mtedx	170.31	147.48
african_accented_french	7.69	7.69
yodas-fr000	2395.82	1502.82
yodas-fr100	4978.16	1887.36
yodas-fr101	4966.07	1882.39
yodas-fr102	4992.84	1877.40
yodas-fr103	3161.39	1189.32
总计	22748.26	10402.46