whisper-small-cv11-french Open-source Model - Support for French Speech Recognition and Punctuation Prediction

Whisper Small Cv11 French

Developed by bofenghuang

A French automatic speech recognition model fine-tuned based on openai/whisper-small, trained on the Common Voice 11.0 French dataset, supporting case sensitivity and punctuation prediction.

Speech Recognition

Transformers

FrenchOpen Source License:Apache-2.0 #French Speech Recognition #Multi-dialect Support #Low WER

Downloads 266

Release Time : 1/5/2023

Model Overview

This model is a Whisper-small version optimized specifically for French speech recognition, excelling across multiple French speech datasets and suitable for French speech-to-text tasks.

Model Features

French Optimization

Specially fine-tuned for French speech recognition, outperforming the original Whisper-small model on French datasets.

Punctuation Prediction

Capable of predicting case sensitivity and punctuation, outputting formatted text.

Multi-dataset Support

Performs well on multiple French speech datasets including Common Voice, MLS, and VoxPopuli.

Model Capabilities

French Speech Recognition

Speech-to-Text

Punctuation Prediction

Use Cases

Speech Transcription

French Meeting Minutes

Automatically transcribe French meeting recordings into text records

WER (Word Error Rate) 10.99-14.45 (varies by dataset)

French Subtitle Generation

Automatically generate subtitles for French video content

Voice Assistants

French Voice Command Recognition

Used for voice command recognition in French voice assistants

🚀 Fine-tuned whisper-small model for ASR in French

This is a fine - tuned version of the openai/whisper-small model, trained on the French dataset of mozilla - foundation/common_voice_11_0. It can predict casing and punctuation, and requires speech input to be sampled at 16Khz.

🚀 Quick Start

This model is a fine - tuned version of openai/whisper-small, trained on the mozilla - foundation/common_voice_11_0 French dataset. When using the model, ensure that your speech input is also sampled at 16Khz. This model also predicts casing and punctuation.

✨ Features

Accurate ASR: Trained on the mozilla - foundation/common_voice_11_0 French dataset, it provides high - quality automatic speech recognition for French.
Predict Casing and Punctuation: The model can predict casing and punctuation, which is very useful for practical applications.

📚 Documentation

Performance

Below are the WERs of the pre - trained models on the [Common Voice 9.0](https://huggingface.co/datasets/mozilla - foundation/common_voice_9_0), Multilingual LibriSpeech, Voxpopuli and Fleurs. These results are reported in the original paper.

Model	Common Voice 9.0	MLS	VoxPopuli	Fleurs
openai/whisper-small	22.7	16.2	15.7	15.0
openai/whisper-medium	16.0	8.9	12.2	8.7
openai/whisper-large	14.7	8.9	11.0	7.7
openai/whisper-large-v2	13.9	7.3	11.4	8.3

Below are the WERs of the fine - tuned models on the [Common Voice 11.0](https://huggingface.co/datasets/mozilla - foundation/common_voice_11_0), Multilingual LibriSpeech, Voxpopuli, and Fleurs. Note that these evaluation datasets have been filtered and preprocessed to only contain French alphabet characters and are removed of punctuation outside of apostrophe. The results in the table are reported as WER (greedy search) / WER (beam search with beam width 5).

Model	Common Voice 11.0	MLS	VoxPopuli	Fleurs
bofenghuang/whisper-small-cv11-french	11.76 / 10.99	9.65 / 8.91	14.45 / 13.66	10.76 / 9.83
bofenghuang/whisper-medium-cv11-french	9.03 / 8.54	6.34 / 5.86	11.64 / 11.35	7.13 / 6.85
bofenghuang/whisper-medium-french	9.03 / 8.73	4.60 / 4.44	9.53 / 9.46	6.33 / 5.94
bofenghuang/whisper-large-v2-cv11-french	8.05 / 7.67	5.56 / 5.28	11.50 / 10.69	5.42 / 5.05
bofenghuang/whisper-large-v2-french	8.15 / 7.83	4.20 / 4.03	9.10 / 8.66	5.22 / 4.98

💻 Usage Examples

Basic Usage

import torch

from datasets import load_dataset
from transformers import pipeline

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load pipeline
pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-small-cv11-french", device=device)

# NB: set forced_decoder_ids for generation utils
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="fr", task="transcribe")

# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = test_segment["audio"]

# Run
generated_sentences = pipe(waveform, max_new_tokens=225)["text"]  # greedy
# generated_sentences = pipe(waveform, max_new_tokens=225, generate_kwargs={"num_beams": 5})["text"]  # beam search

# Normalise predicted sentences if necessary

Advanced Usage

import torch
import torchaudio

from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load model
model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-small-cv11-french").to(device)
processor = AutoProcessor.from_pretrained("bofenghuang/whisper-small-cv11-french", language="french", task="transcribe")

# NB: set forced_decoder_ids for generation utils
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="fr", task="transcribe")

# 16_000
model_sample_rate = processor.feature_extractor.sampling_rate

# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = torch.from_numpy(test_segment["audio"]["array"])
sample_rate = test_segment["audio"]["sampling_rate"]

# Resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# Get feat
inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
input_features = inputs.input_features
input_features = input_features.to(device)

# Generate
generated_ids = model.generate(inputs=input_features, max_new_tokens=225)  # greedy
# generated_ids = model.generate(inputs=input_features, max_new_tokens=225, num_beams=5)  # beam search

# Detokenize
generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

# Normalise predicted sentences if necessary

📄 License

This model is licensed under the apache - 2.0 license.

Model Index

Property	Details
Model Name	Fine - tuned whisper - small model for ASR in French
Task	Automatic Speech Recognition
Datasets	mozilla - foundation/common_voice_11_0, facebook/multilingual_librispeech, facebook/voxpopuli, google/fleurs, gigant/african_accented_french
Metrics	WER
Results	See the performance section above for detailed results on different datasets.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご