Distil-large-v3.5 Open-source Speech Recognition Model - Efficiently Achieve Speech Recognition, Free and Extremely Practical

Distil Large V3.5

Developed by distil-whisper

Distil-Whisper is a knowledge-distilled version of OpenAI Whisper-Large-v3, achieving efficient speech recognition through large-scale pseudo-label training.

Speech Recognition

Transformers

EnglishOpen Source License:MIT #Efficient Speech Recognition #Long Audio Processing #Speculative Decoding Optimization

Downloads 4,804

Release Time : 12/5/2024

Model Overview

Distil-Large-v3.5 is the latest addition to the Distil-Whisper English series, offering improved performance while maintaining efficiency. Trained on 98,000 hours of data using 'patient' teacher model strategy and SpecAugment data augmentation techniques.

Model Features

Efficient Inference

Approximately 1.5x faster than Whisper-Large-v3-Turbo while maintaining comparable accuracy

Knowledge Distillation Optimization

Utilizes 'patient' teacher model strategy and SpecAugment data augmentation, trained on 98,000 hours of data

Speculative Decoding Compatibility

Suitable as a draft model for Whisper-Large-v3 speculative decoding, enabling ~2x inference acceleration

Model Capabilities

Short-form speech transcription

Long-form speech transcription

Timestamp generation

English speech recognition

Use Cases

Speech Transcription

Meeting Minutes

Convert meeting recordings into text transcripts

Word Error Rate (WER) ~7.08%

Podcast Transcription

Convert long-form audio content into text

Long-form WER ~11.39%

🚀 Distil-Whisper: Distil-Large-v3.5

Distil-Whisper is a knowledge-distilled version of OpenAI's Whisper-Large-v3. It is introduced in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labelling. As the latest member of the Distil-Whisper English family, Distil-Large-v3.5 combines high efficiency with improved performance.

Compared to earlier models, Distil-Large-v3.5 has been trained on over 4 times more diverse public data (98k hours). During distillation, it uses a "patient" teacher with an extended training schedule and aggressive data augmentation (SpecAugment). This results in better robustness and accuracy than previous Distil-Whisper models, making it a suitable drop-in replacement.

Model	Params / M	Rel. RTFx	Short-Form OOD WER	Long-Form OOD WER
large-v3-turbo	809	1.0	7.30	10.25
distil-large-v3	756	1.44	7.53	11.6
distil-large-v3.5	756	1.46	7.08	11.39

Why consider Distil-Large-v3.5 when Whisper-Large-v3-Turbo already exists?

It offers a different balance between accuracy and efficiency. It is ~1.5x faster than Whisper-Large-v3-Turbo, performs slightly better on short-form transcription, and is only about 1% behind on long-form transcription.
It works well as a draft model for speculative decoding with Whisper-Large-v3. By keeping the encoder frozen during training, we only need to load two extra decoder layers and forward the encoder once. This achieves ~2x faster inference than Whisper-Large-v3 while maintaining the same outputs.

This model is a 🤗 collaborative effort between Bofeng Huang, Eustache Le Bihan, Steven Zheng, and Vaibhav Srivastav.

🚀 Quick Start

Distil-Large-v3.5 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first install the latest version of Transformers. For this example, we'll also install 🤗 Datasets to load a toy audio dataset from the Hugging Face Hub:

pip install --upgrade pip
pip install --upgrade transformers accelerate datasets[audio]

✨ Features

Knowledge Distillation: Distil-Whisper is a knowledge-distilled version of OpenAI's Whisper-Large-v3, offering a balance between accuracy and efficiency.
Enhanced Training: Trained on over 4 times more diverse public data with a "patient" teacher and aggressive data augmentation, resulting in better robustness and accuracy.
Multiple Usage Modes: Supports short-form and long-form transcription, as well as speculative decoding.
Library Compatibility: Compatible with various libraries such as Whisper.cpp, Faster-Whisper, OpenAI Whisper, Transformers.js, and Candle.

📦 Installation

pip install --upgrade pip
pip install --upgrade transformers accelerate datasets[audio]

💻 Usage Examples

Basic Usage - Short-Form Transcription

The model can be used with the pipeline class to transcribe short-form audio files (< 30-seconds) as follows:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v3.5"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:

- result = pipe(sample)
+ result = pipe("audio.mp3")

For segment-level timestamps, pass the argument return_timestamps=True and return the "chunks" output:

result = pipe(sample, return_timestamps=True)
print(result["chunks"])

Advanced Usage - More Control over Generation Parameters

For more control over the generation parameters, use the model + processor API directly:

Ad-hoc generation arguments can be passed to model.generate, including num_beams for beam-search, return_timestamps for segment-level timestamps, and prompt_ids for prompting. See the docstrings for more details.

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v3.5"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]

input_features = processor(
  sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features

input_features = input_features.to(device, dtype=torch_dtype)

gen_kwargs = {
  "max_new_tokens": 128,
  "num_beams": 1,
  "return_timestamps": False,
}

pred_ids = model.generate(input_features, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=gen_kwargs["return_timestamps"])

print(pred_text)

Advanced Usage - Sequential Long-Form

Unlike previous Distil-Whisper releases, Distil-Large-v3 and Distil-Large-v3.5 is specifically designed to be compatible with OpenAI's sequential long-form transcription algorithm. This algorithm uses a sliding window for buffered inference of long audio files (> 30-seconds), and returns more accurate transcriptions compared to the chunked long-form algorithm.

The sequential long-form algorithm should be used in either of the following scenarios:

Transcription accuracy is the most important factor, and latency is less of a consideration
You are transcribing batches of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate

If you are transcribing single long audio files and latency is the most important factor, you should use the chunked algorithm described below. For a detailed explanation of the different algorithms, refer to Sections 5 of the Distil-Whisper paper.

📚 Documentation

Performance

The model was evaluated on both short and long-form transcriptions, using in-distribution (ID) and out-of-distribution (OOD) datasets to assess accuracy, generalizability, and robustness.

Note that Word Error Rate (WER) results shown here are post-normalization, which includes converting text to lowercase, removing symbols and punctuation, and more.

Short-Form Evaluation

We've evaluated the model on 5 in-distribution (ID) test sets and 2 out-of-distribution (OOD) test sets for short-form transcription, as done in 🤗 Open ASR Leaderboard.

Dataset	Size / h	large-v3	large-v3-turbo	distil-v3	distil-v3.5
AMI	8.68	15.95	16.13	15.16	14.63
Gigaspeech	35.36	10.02	10.14	10.08	9.84
LS Clean	5.40	2.01	2.10	2.54	2.37
LS Other	5.34	3.91	4.24	5.19	5.04
Tedlium	2.61	3.86	3.57	3.86	3.64
-----------	-----	-----	-----	-----	-----
Earnings22	5.43	11.29	11.63	11.79	11.29
SPGISpeech	100.00	2.94	2.97	3.27	2.87
-----------	-----	-----	-----	-----	-----
ID Average		7.15	7.24	7.37	7.10
OOD Average		7.12	7.30	7.53	7.08
Average		7.14	7.25	7.41	7.10

Note: ID/OOD classification is based on distil-v3 and distil-v3.5 training data. Large-v3 and large-v3-turbo training corpus details are unknown, so this categorization might not represent their true in-domain vs. out-of-domain performance.

Long-Form Evaluation

We've evaluated the model on 1 in-distribution (ID) test sets and 4 out-of-distribution (OOD) test sets for long-form transcription, using the sequential decoding algorithm (condition_on_prev_tokens=False, return_timestamps=True).

Dataset	Size / h	large-v3-turbo	distil-v2	distil-v3	distil-v3.5
tedlium-long-form	2.47	3.07	9.66	3.9	4.63
-----------------	-----	-----	-----	-----	-----
meanwhile	1.01	5.03	16.75	7.04	6.79
earnings21	39.26	9.84	15.09	10.54	10.6
earnings22	119.89	13.32	19.11	15.06	14.19
rev16	16.16	12.82	21.15	13.76	13.98
-----------------	-----	-----	-----	-----	-----
ID Average		3.07	9.66	3.9	4.63
OOD Average		10.25	18.03	11.6	11.39
Average		8.82	16.35	10.06	10.04

Below are the Real Time Factor (RTFx) measurements showing that Distil-Large-v3.5 is approximately 1.5x faster than Whisper-Large-v3-Turbo on long-form transcription.

Dataset	large-v3-turbo	distil-v2	distil-v3	distil-v3.5
tedlium-long-form	34.33	27.96	44.95	45.19
meanwhile	26.55	28.01	40.84	42.48
earnings21	35.25	36.66	54.69	54.3
earnings22	39.08	42.09	57.28	58.8
rev16	33.86	23.87	45.43	45.91
-----------------	-----	-----	-----	-----
Average	33.81	31.72	48.64	49.34

Library Integrations

Whisper.cpp: Integrate Distil-Large-v3.5 with Whisper.cpp for efficient inference.
Faster-Whisper: Use Faster-Whisper to speed up the transcription process.
OpenAI Whisper: Compatible with OpenAI Whisper, allowing for seamless integration.
Transformers.js: Run the model in the browser using Transformers.js.
Candle: Integrate with Candle for efficient inference on various hardware platforms.

Training

Training Details

Distil-Large-v3.5 is trained using knowledge distillation from OpenAI's Whisper-Large-v3. It uses a "patient" teacher and aggressive data augmentation during training.

Training Data

Trained on over 4 times more diverse public data (98k hours) compared to earlier models.

🔧 Technical Details

Distil-Whisper is a knowledge-distilled version of OpenAI's Whisper-Large-v3. The knowledge distillation process involves training a smaller model (Distil-Large-v3.5) to mimic the behavior of a larger model (Whisper-Large-v3). During training, a "patient" teacher is used with an extended training schedule and aggressive data augmentation (SpecAugment). This helps the smaller model to learn more effectively and achieve better performance.

📄 License

This project is licensed under the MIT License.

📚 Citation

If you use Distil-Large-v3.5 in your research, please cite the following paper:

@article{distil-whisper,
    title={Robust Knowledge Distillation via Large-Scale Pseudo Labelling},
    author={[Author Names]},
    journal={[Journal Name]},
    year={2023},
    volume={[Volume]},
    pages={[Pages]}
}

🙏 Acknowledgements

This model is a 🤗 collaborative effort between Bofeng Huang, Eustache Le Bihan, Steven Zheng, and Vaibhav Srivastav.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご