whisper-large-v3-french開源法語語音識別模型 - 精準預測大小寫、標點和數字

首頁

Whisper Large V3 French

由bofenghuang開發

基於OpenAI Whisper-large-v3微調的法語自動語音識別模型，支持大小寫、標點符號和數字預測

語音識別

Transformers

法語開源協議:MIT #法語語音識別 #多場景適配 #低WER

下載量 771

發布時間 : 11/27/2023

模型概述

該模型是專為法語優化的自動語音識別系統，在多個法語數據集上表現出色，支持長文本轉錄和快速推理

模型特點

多格式支持

提供多種格式轉換，兼容transformers、openai-whisper、fasterwhisper等多種庫

高效長文本處理

支持分塊並行處理長音頻，提供比順序處理快9倍的推理速度

推測解碼優化

支持使用蒸餾模型進行推測解碼，實現2倍加速而保持相同輸出質量

廣泛數據集適配

在Common Voice、Multilingual LibriSpeech、VoxPopuli等多個法語數據集上表現優異

模型能力

法語語音識別

長音頻轉錄

標點符號預測

大小寫識別

數字轉換

使用案例

語音轉文字

會議記錄

將法語會議錄音自動轉換為文字記錄

準確率超過90%

媒體字幕生成

為法語視頻內容自動生成字幕

支持多種法語口音

語音分析

呼叫中心語音分析

分析客戶服務通話內容

在嘈雜環境下仍保持良好表現

🚀 Whisper-Large-V3-French

Whisper-Large-V3-French在openai/whisper-large-v3的基礎上進行了微調，進一步提升了其在法語上的性能。該模型經過訓練，可以預測大小寫、標點符號和數字。雖然這可能會在一定程度上犧牲性能，但我們認為這能使其擁有更廣泛的用途。

🚀 快速開始

Whisper-Large-V3-French可以用於法語語音識別任務。它已經被轉換為多種格式，方便在不同的庫中使用，包括transformers、openai-whisper、fasterwhisper、whisper.cpp、candle、mlx等。

✨ 主要特性

基於openai/whisper-large-v3微調，在法語上有更好的表現。
能夠預測大小寫、標點符號和數字。
支持多種格式，可在不同庫中使用。

📦 安裝指南

根據不同的使用場景，你可以選擇不同的庫進行安裝：

OpenAI Whisper

pip install -U openai-whisper

Faster Whisper

pip install faster-whisper

Whisper.cpp

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
make

Candle

git clone https://github.com/huggingface/candle.git
cd candle/candle-examples/examples/whisper

MLX

git clone https://github.com/ml-explore/mlx-examples.git
cd mlx-examples/whisper
pip install -r requirements.txt

💻 使用示例

基礎用法

Hugging Face Pipeline

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-french"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Init pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    torch_dtype=torch_dtype,
    device=device,
    # chunk_length_s=30,  # for long-form transcription
    max_new_tokens=128,
)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Run pipeline
result = pipe(sample)
print(result["text"])

Hugging Face Low-level APIs

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-french"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Extract feautres
input_features = processor(
    sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features


# Generate tokens
predicted_ids = model.generate(
    input_features.to(dtype=torch_dtype).to(device), max_new_tokens=128
)

# Detokenize to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

高級用法

Speculative Decoding

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoModelForSpeechSeq2Seq,
    AutoProcessor,
    pipeline,
)

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-french"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Load draft model
assistant_model_name_or_path = "bofenghuang/whisper-large-v3-french-distil-dec2"
assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
assistant_model.to(device)

# Init pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={"assistant_model": assistant_model},
    max_new_tokens=128,
)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Run pipeline
result = pipe(sample)
print(result["text"])

OpenAI Whisper

import whisper
from datasets import load_dataset

# Load model
model = whisper.load_model("./models/whisper-large-v3-french/original_model.pt")

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]["array"].astype("float32")

# Transcribe
result = model.transcribe(sample, language="fr")
print(result["text"])

Faster Whisper

from datasets import load_dataset
from faster_whisper import WhisperModel

# Load model
model = WhisperModel("./models/whisper-large-v3-french/ctranslate2", device="cuda", compute_type="float16")  # Run on GPU with FP16

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]["array"].astype("float32")

segments, info = model.transcribe(sample, beam_size=5, language="fr")

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Whisper.cpp

./main -m ./models/whisper-large-v3-french/ggml-model-q5_0.bin -l fr -f /path/to/audio/file --print-colors

Candle

cargo run --example whisper --release -- --model large-v3 --model-id bofenghuang/whisper-large-v3-french --language fr --input /path/to/audio/file

若要使用CUDA，可在命令行中添加--features cuda：

cargo run --example whisper --release --features cuda -- --model large-v3 --model-id bofenghuang/whisper-large-v3-french --language fr --input /path/to/audio/file

MLX

import whisper

result = whisper.transcribe("/path/to/audio/file", path_or_hf_repo="mlx_models/whisper-large-v3-french", language="fr")
print(result["text"])

📚 詳細文檔

性能評估

我們在短文本和長文本轉錄任務上對模型進行了評估，並在分佈內和分佈外數據集上進行了測試，以全面分析其準確性、泛化能力和魯棒性。

需要注意的是，報告中的WER（詞錯誤率）是在將數字轉換為文本、去除標點符號（除了撇號和連字符）並將所有字符轉換為小寫之後的結果。

所有公開數據集的評估結果可在此處找到。

短文本轉錄

eval-short-form

由於缺乏現成的法語領域外（OOD）和長文本測試集，我們使用了Zaion Lab的內部測試集進行評估。這些測試集包含了來自呼叫中心對話的人工標註的音頻-轉錄對，其顯著特點是存在大量背景噪音和特定領域的術語。

長文本轉錄

eval-long-form

長文本轉錄使用了🤗 Hugging Face的管道進行快速評估。音頻文件被分割成30秒的片段，並進行並行處理。

訓練細節

我們收集了一個包含超過2500小時法語語音識別數據的複合數據集，其中包括Common Voice 13.0、Multilingual LibriSpeech、Voxpopuli、Fleurs、Multilingual TEDx、MediaSpeech、African Accented French等數據集。

由於一些數據集（如MLS）只提供沒有大小寫或標點符號的文本，我們使用了🤗 Speechbox的定製版本，藉助bofenghuang/whisper-large-v2-cv11-french模型從有限的符號集中恢復大小寫和標點符號。

然而，即使在這些數據集中，我們也發現了一些質量問題。這些問題包括音頻和轉錄在語言或內容上不匹配、話語分割不當以及腳本化語音中缺少單詞等。我們構建了一個管道來過濾掉許多這些有問題的話語，旨在提高數據集的質量。因此，我們排除了超過10%的數據，並且在重新訓練模型時，我們發現幻覺現象顯著減少。

在訓練過程中，我們使用了🤗 Transformers倉庫中提供的腳本。模型訓練在GENCI的Jean-Zay超級計算機上進行，我們感謝IDRIS團隊在整個項目過程中提供的及時支持。

致謝

感謝OpenAI創建並開源了Whisper模型。
感謝🤗 Hugging Face將Whisper模型集成到Transformers倉庫中，並提供了訓練代碼庫。
感謝Genci為該項目慷慨提供GPU計算時間。

🔧 技術細節

評估指標

使用WER（詞錯誤率）作為評估指標，以衡量模型在語音轉錄任務中的準確性。

訓練腳本

使用🤗 Transformers倉庫中的run_speech_recognition_seq2seq.py腳本進行訓練。

訓練環境

在Jean-Zay超級計算機上進行訓練。

📄 許可證

本項目採用MIT許可證。

信息表格

屬性	詳情
模型類型	基於`openai/whisper-large-v3`微調的語音識別模型
訓練數據	包含超過2500小時法語語音識別數據的複合數據集，包括Common Voice 13.0、Multilingual LibriSpeech、Voxpopuli、Fleurs、Multilingual TEDx、MediaSpeech、African Accented French等