Faster-Whisper Large V3 French Distil Dec16 Open-Source Model - Optimize Inference for Efficient French Speech Recognition

Faster Whisper Large V3 French Distil Dec16

Developed by brandenkmurray

A distilled French version of Whisper-Large-V3, optimized for inference efficiency by reducing decoder layers while maintaining good performance

Speech Recognition

Transformers

FrenchOpen Source License:MIT #French Speech Recognition #Distilled Model #Low Word Error Rate

Downloads 25

Release Time : 6/28/2024

Model Overview

This model is a distilled French-specific version of Whisper-Large-V3, which reduces the decoder layers from 32 to 16, improving inference efficiency while maintaining good recognition accuracy. It supports multiple inference frameworks and is suitable for French speech recognition tasks.

Model Features

Efficient Inference

Significantly reduces memory usage and inference time by decreasing the number of decoder layers

Multi-Framework Support

Provides format conversions and supports multiple frameworks including transformers, openai-whisper, and fasterwhisper

Speculative Decoding Compatibility

Can be combined with the original Whisper model for speculative decoding to further improve inference speed

Long Text Processing Optimization

Effectively handles long audio inputs through chunk processing techniques

Model Capabilities

French Speech Recognition

Long Audio Transcription

Real-Time Speech Transcription

Use Cases

Speech Transcription

French Meeting Minutes

Convert French meeting recordings into text transcripts

Word error rate as low as 3.57%-8.76% (varies by test dataset)

French Media Content Subtitling

Automatically generate subtitles for French video content

Speech Analysis

Call Center Speech Analysis

Analyze French call center conversations

Performs well on domain-specific terms with noise

🚀 Whisper-Large-V3-French-Distil-Dec16

Whisper-Large-V3-French-Distil represents a series of distilled versions of Whisper-Large-V3-French, which reduces memory usage and inference time while maintaining performance and mitigating the risk of hallucinations.

🚀 Quick Start

This model is designed for automatic speech recognition. It has been converted into various formats, making it easy to use across different libraries.

✨ Features

Distilled Variants: Reduce memory usage and inference time while maintaining performance and mitigating the risk of hallucinations.
Multiple Formats: Converted into various formats for use across different libraries, including transformers, openai-whisper, fasterwhisper, whisper.cpp, candle, mlx, etc.
Speculative Decoding: Can be combined with the original Whisper-Large-V3-French model for speculative decoding, resulting in improved inference speed and consistent outputs.

📦 Installation

The installation process depends on the library you choose to use. Here are some examples:

OpenAI Whisper

pip install -U openai-whisper

Faster Whisper

pip install faster-whisper

Whisper.cpp

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp

💻 Usage Examples

Basic Usage

# Hugging Face Pipeline
import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-french-distil-dec16"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Init pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    torch_dtype=torch_dtype,
    device=device,
    # chunk_length_s=30,  # for long-form transcription
    max_new_tokens=128,
)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Run pipeline
result = pipe(sample)
print(result["text"])

Advanced Usage

Speculative Decoding

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoModelForSpeechSeq2Seq,
    AutoProcessor,
    pipeline,
)

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-french"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Load draft model
assistant_model_name_or_path = "bofenghuang/whisper-large-v3-french-distil-dec2"
assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
assistant_model.to(device)

# Init pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={"assistant_model": assistant_model},
    max_new_tokens=128,
)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Run pipeline
result = pipe(sample)
print(result["text"])

📚 Documentation

Performance

We evaluated our model on both short and long-form transcriptions, and also tested it on both in-distribution and out-of-distribution datasets to conduct a comprehensive analysis assessing its accuracy, generalizability, and robustness.

Please note that the reported WER is the result after converting numbers to text, removing punctuation (except for apostrophes and hyphens), and converting all characters to lowercase.

All evaluation results on the public datasets can be found here.

Short-Form Transcription

eval-short-form

Due to the lack of readily available out-of-domain (OOD) and long-form test sets in French, we evaluated using internal test sets from Zaion Lab. These sets comprise human-annotated audio-transcription pairs from call center conversations, which are notable for their significant background noise and domain-specific terminology.

Long-Form Transcription

eval-long-form

The long-form transcription was run using the 🤗 Hugging Face pipeline for quicker evaluation. Audio files were segmented into 30-second chunks and processed in parallel.

Training details

The distilled variants were achieved by reducing the number of decoder layers from 32 to 16, 8, 4, or 2 and distilling using a large-scale dataset, as outlined in this paper.

Acknowledgements

We would like to thank the contributors and the open-source community for their support and contributions.

📄 License

This project is licensed under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご