đ Whisper-Large-V3-French
Whisper-Large-V3-French is fine-tuned on openai/whisper-large-v3
to enhance its performance in French. This model is trained to predict casing, punctuation, and numbers. Although it may slightly sacrifice performance, we believe it enables broader usage. It has been converted into various formats, facilitating its use across different libraries such as transformers, openai-whisper, fasterwhisper, whisper.cpp, candle, mlx, etc.
đ Quick Start
This section provides a brief overview of how to quickly get started with the Whisper-Large-V3-French model.
⨠Features
- Fine-tuned on
openai/whisper-large-v3
for better French performance.
- Predicts casing, punctuation, and numbers.
- Converted into multiple formats for use in different libraries.
đĻ Installation
The installation steps vary depending on the library you want to use. Here are some common installation commands:
OpenAI Whisper
pip install -U openai-whisper
Faster Whisper
pip install faster-whisper
đģ Usage Examples
Basic Usage
Hugging Face Pipeline
The model can be easily used with the đ¤ Hugging Face pipeline
class for audio transcription.
import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_name_or_path = "bofenghuang/whisper-large-v3-french"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_name_or_path,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
)
model.to(device)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
feature_extractor=processor.feature_extractor,
tokenizer=processor.tokenizer,
torch_dtype=torch_dtype,
device=device,
max_new_tokens=128,
)
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
Advanced Usage
Speculative Decoding
Speculative decoding can be achieved using a draft model, a distilled version of Whisper. This approach guarantees the same outputs as using the main Whisper model alone, offers 2x faster inference speed, and has only a slight increase in memory overhead.
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoModelForSpeechSeq2Seq,
AutoProcessor,
pipeline,
)
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_name_or_path = "bofenghuang/whisper-large-v3-french"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_name_or_path,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
)
model.to(device)
assistant_model_name_or_path = "bofenghuang/whisper-large-v3-french-distil-dec2"
assistant_model = AutoModelForCausalLM.from_pretrained(
assistant_model_name_or_path,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
)
assistant_model.to(device)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
feature_extractor=processor.feature_extractor,
tokenizer=processor.tokenizer,
torch_dtype=torch_dtype,
device=device,
generate_kwargs={"assistant_model": assistant_model},
max_new_tokens=128,
)
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
đ Documentation
Performance
We evaluated our model on short and long-form transcriptions, and tested it on both in-distribution and out-of-distribution datasets to comprehensively assess its accuracy, generalizability, and robustness.
The reported WER is the result after converting numbers to text, removing punctuation (except for apostrophes and hyphens), and converting all characters to lowercase.
All evaluation results on public datasets can be found here.
Short-Form Transcription

Due to the lack of readily available out-of-domain (OOD) and long-form test sets in French, we evaluated using internal test sets from Zaion Lab. These sets consist of human-annotated audio-transcription pairs from call center conversations, which have significant background noise and domain-specific terminology.
Long-Form Transcription

The long-form transcription was run using the đ¤ Hugging Face pipeline for quicker evaluation. Audio files were segmented into 30-second chunks and processed in parallel.
Usage
Hugging Face Low-level APIs
You can use the đ¤ Hugging Face low-level APIs for transcription, which offer more control over the process.
import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_name_or_path = "bofenghuang/whisper-large-v3-french"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_name_or_path,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
)
model.to(device)
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]
input_features = processor(
sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features
predicted_ids = model.generate(
input_features.to(dtype=torch_dtype).to(device), max_new_tokens=128
)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
OpenAI Whisper
You can use the sequential long-form decoding algorithm with a sliding window and temperature fallback, as described in OpenAI's original paper.
import whisper
from datasets import load_dataset
model = whisper.load_model("./models/whisper-large-v3-french/original_model.pt")
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]["array"].astype("float32")
result = model.transcribe(sample, language="fr")
print(result["text"])
Faster Whisper
Faster Whisper is a reimplementation of OpenAI's Whisper models and the sequential long-form decoding algorithm in the CTranslate2 format.
from datasets import load_dataset
from faster_whisper import WhisperModel
model = WhisperModel("./models/whisper-large-v3-french/ctranslate2", device="cuda", compute_type="float16")
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]["array"].astype("float32")
segments, info = model.transcribe(sample, beam_size=5, language="fr")
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
Whisper.cpp
Whisper.cpp is a reimplementation of OpenAI's Whisper models in plain C/C++.
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
make
python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='bofenghuang/whisper-large-v3-french', filename='ggml-model-q5_0.bin', local_dir='./models/whisper-large-v3-french')"
./main -m ./models/whisper-large-v3-french/ggml-model-q5_0.bin -l fr -f /path/to/audio/file --print-colors
Training details
The model is fine-tuned on openai/whisper-large-v3
using the following datasets:
Property |
Details |
Model Type |
Whisper-Large-V3-French |
Training Data |
mozilla-foundation/common_voice_13_0, facebook/multilingual_librispeech, facebook/voxpopuli, google/fleurs, gigant/african_accented_french |
Acknowledgements
We would like to thank all the contributors and the open-source community for their support.
đ License
This project is licensed under the MIT License.