đ Fine-tuned whisper-large-v2 model for ASR in French
This model is a fine - tuned version of openai/whisper-large-v2, designed for automatic speech recognition in French. It's trained on a large composite dataset of French speech audio, offering high - quality ASR performance.
đ Quick Start
This fine - tuned whisper-large-v2
model is ready to use for automatic speech recognition in French. When using the model, ensure that your speech input is sampled at 16Khz. Note that this model doesn't predict casing or punctuation.
⨠Features
- Fine - tuned: Based on the powerful
openai/whisper-large-v2
model, fine - tuned on over 2200 hours of French speech audio.
- Multidataset training: Trained on a composite dataset including Common Voice 11.0, Multilingual LibriSpeech, and others.
- Low WER: Achieves low Word Error Rates (WER) on multiple French speech datasets.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
Inference with đ¤ Pipeline
import torch
from datasets import load_dataset
from transformers import pipeline
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-large-v2-french", device=device)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="fr", task="transcribe")
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = test_segment["audio"]
generated_sentences = pipe(waveform, max_new_tokens=225)["text"]
Advanced Usage
Inference with đ¤ low - level APIs
import torch
import torchaudio
from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-large-v2-french").to(device)
processor = AutoProcessor.from_pretrained("bofenghuang/whisper-large-v2-french", language="french", task="transcribe")
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="fr", task="transcribe")
model_sample_rate = processor.feature_extractor.sampling_rate
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = torch.from_numpy(test_segment["audio"]["array"])
sample_rate = test_segment["audio"]["sampling_rate"]
if sample_rate != model_sample_rate:
resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
waveform = resampler(waveform)
inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
input_features = inputs.input_features
input_features = input_features.to(device)
generated_ids = model.generate(inputs=input_features, max_new_tokens=225)
generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
đ Documentation
Performance
Pre - trained models' WER
Below are the WERs of the pre - trained models on the Common Voice 9.0, Multilingual LibriSpeech, Voxpopuli and Fleurs. These results are reported in the original paper.
Fine - tuned models' WER
Below are the WERs of the fine - tuned models on the Common Voice 11.0, Multilingual LibriSpeech, Voxpopuli, and Fleurs. Note that these evaluation datasets have been filtered and preprocessed to only contain French alphabet characters and are removed of punctuation outside of apostrophe. The results in the table are reported as WER (greedy search) / WER (beam search with beam width 5)
.
đ License
This model is licensed under the apache-2.0
license.
