đ Fine-tuned whisper-small model for ASR in French
This is a fine - tuned version of the openai/whisper-small
model, trained on the French dataset of mozilla - foundation/common_voice_11_0
. It can predict casing and punctuation, and requires speech input to be sampled at 16Khz.

đ Quick Start
This model is a fine - tuned version of openai/whisper-small, trained on the mozilla - foundation/common_voice_11_0
French dataset. When using the model, ensure that your speech input is also sampled at 16Khz. This model also predicts casing and punctuation.
⨠Features
- Accurate ASR: Trained on the
mozilla - foundation/common_voice_11_0
French dataset, it provides high - quality automatic speech recognition for French.
- Predict Casing and Punctuation: The model can predict casing and punctuation, which is very useful for practical applications.
đ Documentation
Performance
Below are the WERs of the pre - trained models on the [Common Voice 9.0](https://huggingface.co/datasets/mozilla - foundation/common_voice_9_0), Multilingual LibriSpeech, Voxpopuli and Fleurs. These results are reported in the original paper.
Below are the WERs of the fine - tuned models on the [Common Voice 11.0](https://huggingface.co/datasets/mozilla - foundation/common_voice_11_0), Multilingual LibriSpeech, Voxpopuli, and Fleurs. Note that these evaluation datasets have been filtered and preprocessed to only contain French alphabet characters and are removed of punctuation outside of apostrophe. The results in the table are reported as WER (greedy search) / WER (beam search with beam width 5)
.
đģ Usage Examples
Basic Usage
import torch
from datasets import load_dataset
from transformers import pipeline
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-small-cv11-french", device=device)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="fr", task="transcribe")
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = test_segment["audio"]
generated_sentences = pipe(waveform, max_new_tokens=225)["text"]
Advanced Usage
import torch
import torchaudio
from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-small-cv11-french").to(device)
processor = AutoProcessor.from_pretrained("bofenghuang/whisper-small-cv11-french", language="french", task="transcribe")
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="fr", task="transcribe")
model_sample_rate = processor.feature_extractor.sampling_rate
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = torch.from_numpy(test_segment["audio"]["array"])
sample_rate = test_segment["audio"]["sampling_rate"]
if sample_rate != model_sample_rate:
resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
waveform = resampler(waveform)
inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
input_features = inputs.input_features
input_features = input_features.to(device)
generated_ids = model.generate(inputs=input_features, max_new_tokens=225)
generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
đ License
This model is licensed under the apache - 2.0
license.
Model Index
Property |
Details |
Model Name |
Fine - tuned whisper - small model for ASR in French |
Task |
Automatic Speech Recognition |
Datasets |
mozilla - foundation/common_voice_11_0, facebook/multilingual_librispeech, facebook/voxpopuli, google/fleurs, gigant/african_accented_french |
Metrics |
WER |
Results |
See the performance section above for detailed results on different datasets. |