Whisper-large-v3-french-distil-dec8 Open-source French Model - Optimize Speed and Memory, Ensure Speech Recognition Performance

Whisper Large V3 French Distil Dec8

Developed by bofenghuang

This is a distilled version of the Whisper-Large-V3 French model, optimized for inference speed and memory usage by reducing the number of decoder layers while maintaining good performance.

Speech Recognition

Transformers

FrenchOpen Source License:MIT #French Speech Recognition #Distilled Decoder #Low Word Error Rate

Downloads 32

Release Time : 12/25/2023

Model Overview

This model is a distilled variant of the original Whisper-Large-V3 French version, reducing the decoder layers from 32 to 8, suitable for French automatic speech recognition tasks.

Model Features

Efficient Distillation

Significantly reduces memory usage and inference time by decreasing the decoder layers from 32 to 8.

Multi-format Support

Converted into multiple formats, supporting libraries such as transformers, openai-whisper, and fasterwhisper.

Speculative Decoding Compatibility

Can be combined with the original Whisper model for speculative decoding to improve inference speed.

Multi-dataset Training

Trained on over 2,500 hours of French speech data, including Common Voice, Multilingual LibriSpeech, and more.

Model Capabilities

French Speech Recognition

Long Text Transcription

Short Text Transcription

Use Cases

Speech Transcription

French Speech to Text

Convert French speech content into text

WER of 7.62 on the Common Voice 13.0 test set

Multi-accent Recognition

Recognize different regional French accents

WER of 4.18 on the African-accented French test set

Audio Processing

Long Audio Processing

Process audio files longer than 30 seconds

Supports chunked parallel processing of long audio

🚀 Whisper-Large-V3-French-Distil-Dec8

Whisper-Large-V3-French-Distil is a series of distilled versions of Whisper-Large-V3-French. It reduces the number of decoder layers from 32 to 16, 8, 4, or 2 and uses large-scale datasets for distillation, as described in this paper. The distilled versions reduce memory usage and inference time, maintain performance (based on the number of retained layers), and reduce the risk of hallucinations, especially in long-form transcriptions. Moreover, they can be combined with the original Whisper-Large-V3-French model for speculative decoding, improving inference speed and output consistency compared to using the standalone model. This model has been converted into various formats for use in different libraries, such as transformers, openai-whisper, fasterwhisper, whisper.cpp, candle, mlx, etc.

🚀 Quick Start

The model can be used in multiple ways, and the following sections will introduce different usage methods.

✨ Features

Distilled Design: Reduces decoder layers and uses large-scale dataset distillation to reduce memory usage and inference time.
Performance Maintenance: Maintains performance and reduces the risk of hallucinations, especially in long-form transcriptions.
Speculative Decoding: Can be combined with the original model for speculative decoding, improving inference speed and output consistency.
Multi-format Support: Converted into various formats for use in different libraries.

📦 Installation

OpenAI Whisper

pip install -U openai-whisper

Faster Whisper

pip install faster-whisper

Whisper.cpp

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp

💻 Usage Examples

Hugging Face Pipeline

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-french-distil-dec8"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Init pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    torch_dtype=torch_dtype,
    device=device,
    # chunk_length_s=30,  # for long-form transcription
    max_new_tokens=128,
)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Run pipeline
result = pipe(sample)
print(result["text"])

Hugging Face Low-level APIs

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-french-distil-dec8"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Extract feautres
input_features = processor(
    sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features


# Generate tokens
predicted_ids = model.generate(
    input_features.to(dtype=torch_dtype).to(device), max_new_tokens=128
)

# Detokenize to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Speculative Decoding

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoModelForSpeechSeq2Seq,
    AutoProcessor,
    pipeline,
)

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-french"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Load draft model
assistant_model_name_or_path = "bofenghuang/whisper-large-v3-french-distil-dec2"
assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
assistant_model.to(device)

# Init pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={"assistant_model": assistant_model},
    max_new_tokens=128,
)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Run pipeline
result = pipe(sample)
print(result["text"])

OpenAI Whisper

import whisper
from datasets import load_dataset

# Load model
model = whisper.load_model("./models/whisper-large-v3-french-distil-dec8/original_model.pt")

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]["array"].astype("float32")

# Transcribe
result = model.transcribe(sample, language="fr")
print(result["text"])

Faster Whisper

from datasets import load_dataset
from faster_whisper import WhisperModel

# Load model
model = WhisperModel("./models/whisper-large-v3-french-distil-dec8/ctranslate2", device="cuda", compute_type="float16")  # Run on GPU with FP16

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]["array"].astype("float32")

segments, info = model.transcribe(sample, beam_size=5, language="fr")

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

📚 Documentation

Performance

We evaluated the model on both short and long-form transcriptions and tested it on in-distribution and out-of-distribution datasets for a comprehensive analysis of its accuracy, generalizability, and robustness. Note that the reported WER is the result after converting numbers to text, removing punctuation (except for apostrophes and hyphens), and converting all characters to lowercase. All evaluation results on public datasets can be found here.

Short-Form Transcription

eval-short-form Due to the lack of readily available out-of-domain (OOD) and long-form test sets in French, we used internal test sets from Zaion Lab. These sets consist of human-annotated audio-transcription pairs from call center conversations with significant background noise and domain-specific terminology.

Long-Form Transcription

eval-long-form The long-form transcription was run using the 🤗 Hugging Face pipeline for quicker evaluation. Audio files were segmented into 30-second chunks and processed in parallel.

Training details

The model is a distilled version of Whisper-Large-V3-French, reducing the number of decoder layers from 32 to 16, 8, 4, or 2 and using large-scale datasets for distillation, as described in this paper.

Acknowledgements

Not provided in the original document.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご