Whisper-large-v3-French Open-Source French Speech Recognition Model - Accurately Predict Capitalization, Punctuation, and Numbers

Whisper Large V3 French

Developed by bofenghuang

A French automatic speech recognition model fine-tuned based on OpenAI Whisper-large-v3, supporting case sensitivity, punctuation, and number prediction

Speech Recognition

Transformers

FrenchOpen Source License:MIT #French Speech Recognition #Multi-scenario Adaptation #Low WER

Downloads 771

Release Time : 11/27/2023

Model Overview

This model is an automatic speech recognition system specifically optimized for French, excelling on multiple French datasets, supporting long-text transcription and fast inference

Model Features

Multi-format Support

Provides multiple format conversions, compatible with libraries such as transformers, openai-whisper, fasterwhisper, etc.

Efficient Long-text Processing

Supports chunked parallel processing of long audio, offering 9x faster inference speed compared to sequential processing

Speculative Decoding Optimization

Supports speculative decoding using distilled models, achieving 2x speedup while maintaining the same output quality

Broad Dataset Adaptation

Performs excellently on multiple French datasets including Common Voice, Multilingual LibriSpeech, VoxPopuli, etc.

Model Capabilities

French Speech Recognition

Long Audio Transcription

Punctuation Prediction

Case Sensitivity Recognition

Number Conversion

Use Cases

Speech-to-Text

Meeting Minutes

Automatically convert French meeting recordings into text transcripts

Accuracy exceeding 90%

Media Subtitle Generation

Automatically generate subtitles for French video content

Supports various French accents

Speech Analysis

Call Center Speech Analysis

Analyze customer service call content

Maintains good performance even in noisy environments

🚀 Whisper-Large-V3-French

Whisper-Large-V3-French is fine-tuned on openai/whisper-large-v3 to enhance its performance in French. This model is trained to predict casing, punctuation, and numbers. Although it may slightly sacrifice performance, we believe it enables broader usage. It has been converted into various formats, facilitating its use across different libraries such as transformers, openai-whisper, fasterwhisper, whisper.cpp, candle, mlx, etc.

🚀 Quick Start

This section provides a brief overview of how to quickly get started with the Whisper-Large-V3-French model.

✨ Features

Fine-tuned on openai/whisper-large-v3 for better French performance.
Predicts casing, punctuation, and numbers.
Converted into multiple formats for use in different libraries.

📦 Installation

The installation steps vary depending on the library you want to use. Here are some common installation commands:

OpenAI Whisper

pip install -U openai-whisper

Faster Whisper

pip install faster-whisper

💻 Usage Examples

Basic Usage

Hugging Face Pipeline

The model can be easily used with the 🤗 Hugging Face pipeline class for audio transcription.

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-french"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Init pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    torch_dtype=torch_dtype,
    device=device,
    # chunk_length_s=30,  # for long-form transcription
    max_new_tokens=128,
)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Run pipeline
result = pipe(sample)
print(result["text"])

Advanced Usage

Speculative Decoding

Speculative decoding can be achieved using a draft model, a distilled version of Whisper. This approach guarantees the same outputs as using the main Whisper model alone, offers 2x faster inference speed, and has only a slight increase in memory overhead.

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoModelForSpeechSeq2Seq,
    AutoProcessor,
    pipeline,
)

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-french"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Load draft model
assistant_model_name_or_path = "bofenghuang/whisper-large-v3-french-distil-dec2"
assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
assistant_model.to(device)

# Init pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={"assistant_model": assistant_model},
    max_new_tokens=128,
)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Run pipeline
result = pipe(sample)
print(result["text"])

📚 Documentation

Performance

We evaluated our model on short and long-form transcriptions, and tested it on both in-distribution and out-of-distribution datasets to comprehensively assess its accuracy, generalizability, and robustness.

The reported WER is the result after converting numbers to text, removing punctuation (except for apostrophes and hyphens), and converting all characters to lowercase.

All evaluation results on public datasets can be found here.

Short-Form Transcription

eval-short-form

Due to the lack of readily available out-of-domain (OOD) and long-form test sets in French, we evaluated using internal test sets from Zaion Lab. These sets consist of human-annotated audio-transcription pairs from call center conversations, which have significant background noise and domain-specific terminology.

Long-Form Transcription

eval-long-form

The long-form transcription was run using the 🤗 Hugging Face pipeline for quicker evaluation. Audio files were segmented into 30-second chunks and processed in parallel.

Usage

Hugging Face Low-level APIs

You can use the 🤗 Hugging Face low-level APIs for transcription, which offer more control over the process.

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-french"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Extract feautres
input_features = processor(
    sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features


# Generate tokens
predicted_ids = model.generate(
    input_features.to(dtype=torch_dtype).to(device), max_new_tokens=128
)

# Detokenize to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

OpenAI Whisper

You can use the sequential long-form decoding algorithm with a sliding window and temperature fallback, as described in OpenAI's original paper.

import whisper
from datasets import load_dataset

# Load model
model = whisper.load_model("./models/whisper-large-v3-french/original_model.pt")

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]["array"].astype("float32")

# Transcribe
result = model.transcribe(sample, language="fr")
print(result["text"])

Faster Whisper

Faster Whisper is a reimplementation of OpenAI's Whisper models and the sequential long-form decoding algorithm in the CTranslate2 format.

from datasets import load_dataset
from faster_whisper import WhisperModel

# Load model
model = WhisperModel("./models/whisper-large-v3-french/ctranslate2", device="cuda", compute_type="float16")  # Run on GPU with FP16

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]["array"].astype("float32")

segments, info = model.transcribe(sample, beam_size=5, language="fr")

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Whisper.cpp

Whisper.cpp is a reimplementation of OpenAI's Whisper models in plain C/C++.

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp

# build the main example
make

# Download model quantized with Q5_0 method
python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='bofenghuang/whisper-large-v3-french', filename='ggml-model-q5_0.bin', local_dir='./models/whisper-large-v3-french')"

# Transcribe an audio file
./main -m ./models/whisper-large-v3-french/ggml-model-q5_0.bin -l fr -f /path/to/audio/file --print-colors

Training details

The model is fine-tuned on openai/whisper-large-v3 using the following datasets:

Property	Details
Model Type	Whisper-Large-V3-French
Training Data	mozilla-foundation/common_voice_13_0, facebook/multilingual_librispeech, facebook/voxpopuli, google/fleurs, gigant/african_accented_french

Acknowledgements

We would like to thank all the contributors and the open-source community for their support.

📄 License

This project is licensed under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご