Model Overview
Model Features
Model Capabilities
Use Cases
đ Distil-Whisper: Distil-Large-v3.5
Distil-Whisper is a knowledge-distilled version of OpenAI's Whisper-Large-v3. It is introduced in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labelling. As the latest member of the Distil-Whisper English family, Distil-Large-v3.5 combines high efficiency with improved performance.
Compared to earlier models, Distil-Large-v3.5 has been trained on over 4 times more diverse public data (98k hours). During distillation, it uses a "patient" teacher with an extended training schedule and aggressive data augmentation (SpecAugment). This results in better robustness and accuracy than previous Distil-Whisper models, making it a suitable drop-in replacement.
Model | Params / M | Rel. RTFx | Short-Form OOD WER | Long-Form OOD WER |
---|---|---|---|---|
large-v3-turbo | 809 | 1.0 | 7.30 | 10.25 |
distil-large-v3 | 756 | 1.44 | 7.53 | 11.6 |
distil-large-v3.5 | 756 | 1.46 | 7.08 | 11.39 |
Why consider Distil-Large-v3.5 when Whisper-Large-v3-Turbo already exists?
- It offers a different balance between accuracy and efficiency. It is ~1.5x faster than Whisper-Large-v3-Turbo, performs slightly better on short-form transcription, and is only about 1% behind on long-form transcription.
- It works well as a draft model for speculative decoding with Whisper-Large-v3. By keeping the encoder frozen during training, we only need to load two extra decoder layers and forward the encoder once. This achieves ~2x faster inference than Whisper-Large-v3 while maintaining the same outputs.
This model is a đ¤ collaborative effort between Bofeng Huang, Eustache Le Bihan, Steven Zheng, and Vaibhav Srivastav.
đ Quick Start
Distil-Large-v3.5 is supported in the Hugging Face đ¤ Transformers library from version 4.39 onwards. To run the model, first install the latest version of Transformers. For this example, we'll also install đ¤ Datasets to load a toy audio dataset from the Hugging Face Hub:
pip install --upgrade pip
pip install --upgrade transformers accelerate datasets[audio]
⨠Features
- Knowledge Distillation: Distil-Whisper is a knowledge-distilled version of OpenAI's Whisper-Large-v3, offering a balance between accuracy and efficiency.
- Enhanced Training: Trained on over 4 times more diverse public data with a "patient" teacher and aggressive data augmentation, resulting in better robustness and accuracy.
- Multiple Usage Modes: Supports short-form and long-form transcription, as well as speculative decoding.
- Library Compatibility: Compatible with various libraries such as Whisper.cpp, Faster-Whisper, OpenAI Whisper, Transformers.js, and Candle.
đĻ Installation
pip install --upgrade pip
pip install --upgrade transformers accelerate datasets[audio]
đģ Usage Examples
Basic Usage - Short-Form Transcription
The model can be used with the pipeline
class to transcribe short-form audio files (< 30-seconds) as follows:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v3.5"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
- result = pipe(sample)
+ result = pipe("audio.mp3")
For segment-level timestamps, pass the argument return_timestamps=True
and return the "chunks"
output:
result = pipe(sample, return_timestamps=True)
print(result["chunks"])
Advanced Usage - More Control over Generation Parameters
For more control over the generation parameters, use the model + processor API directly:
Ad-hoc generation arguments can be passed to model.generate
, including num_beams
for beam-search, return_timestamps
for segment-level timestamps, and prompt_ids
for prompting. See the docstrings for more details.
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v3.5"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]
input_features = processor(
sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features
input_features = input_features.to(device, dtype=torch_dtype)
gen_kwargs = {
"max_new_tokens": 128,
"num_beams": 1,
"return_timestamps": False,
}
pred_ids = model.generate(input_features, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=gen_kwargs["return_timestamps"])
print(pred_text)
Advanced Usage - Sequential Long-Form
Unlike previous Distil-Whisper releases, Distil-Large-v3 and Distil-Large-v3.5 is specifically designed to be compatible with OpenAI's sequential long-form transcription algorithm. This algorithm uses a sliding window for buffered inference of long audio files (> 30-seconds), and returns more accurate transcriptions compared to the chunked long-form algorithm.
The sequential long-form algorithm should be used in either of the following scenarios:
- Transcription accuracy is the most important factor, and latency is less of a consideration
- You are transcribing batches of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate
If you are transcribing single long audio files and latency is the most important factor, you should use the chunked algorithm described below. For a detailed explanation of the different algorithms, refer to Sections 5 of the Distil-Whisper paper.
đ Documentation
Performance
The model was evaluated on both short and long-form transcriptions, using in-distribution (ID) and out-of-distribution (OOD) datasets to assess accuracy, generalizability, and robustness.
Note that Word Error Rate (WER) results shown here are post-normalization, which includes converting text to lowercase, removing symbols and punctuation, and more.
Short-Form Evaluation
We've evaluated the model on 5 in-distribution (ID) test sets and 2 out-of-distribution (OOD) test sets for short-form transcription, as done in đ¤ Open ASR Leaderboard.
Dataset | Size / h | large-v3 | large-v3-turbo | distil-v3 | distil-v3.5 |
---|---|---|---|---|---|
AMI | 8.68 | 15.95 | 16.13 | 15.16 | 14.63 |
Gigaspeech | 35.36 | 10.02 | 10.14 | 10.08 | 9.84 |
LS Clean | 5.40 | 2.01 | 2.10 | 2.54 | 2.37 |
LS Other | 5.34 | 3.91 | 4.24 | 5.19 | 5.04 |
Tedlium | 2.61 | 3.86 | 3.57 | 3.86 | 3.64 |
----------- | ----- | ----- | ----- | ----- | ----- |
Earnings22 | 5.43 | 11.29 | 11.63 | 11.79 | 11.29 |
SPGISpeech | 100.00 | 2.94 | 2.97 | 3.27 | 2.87 |
----------- | ----- | ----- | ----- | ----- | ----- |
ID Average | 7.15 | 7.24 | 7.37 | 7.10 | |
OOD Average | 7.12 | 7.30 | 7.53 | 7.08 | |
Average | 7.14 | 7.25 | 7.41 | 7.10 |
Note: ID/OOD classification is based on distil-v3 and distil-v3.5 training data. Large-v3 and large-v3-turbo training corpus details are unknown, so this categorization might not represent their true in-domain vs. out-of-domain performance.
Long-Form Evaluation
We've evaluated the model on 1 in-distribution (ID) test sets and 4 out-of-distribution (OOD) test sets for long-form transcription, using the sequential decoding algorithm (condition_on_prev_tokens=False, return_timestamps=True).
Dataset | Size / h | large-v3-turbo | distil-v2 | distil-v3 | distil-v3.5 |
---|---|---|---|---|---|
tedlium-long-form | 2.47 | 3.07 | 9.66 | 3.9 | 4.63 |
----------------- | ----- | ----- | ----- | ----- | ----- |
meanwhile | 1.01 | 5.03 | 16.75 | 7.04 | 6.79 |
earnings21 | 39.26 | 9.84 | 15.09 | 10.54 | 10.6 |
earnings22 | 119.89 | 13.32 | 19.11 | 15.06 | 14.19 |
rev16 | 16.16 | 12.82 | 21.15 | 13.76 | 13.98 |
----------------- | ----- | ----- | ----- | ----- | ----- |
ID Average | 3.07 | 9.66 | 3.9 | 4.63 | |
OOD Average | 10.25 | 18.03 | 11.6 | 11.39 | |
Average | 8.82 | 16.35 | 10.06 | 10.04 |
Note: ID/OOD classification is based on distil-v3 and distil-v3.5 training data. Large-v3 and large-v3-turbo training corpus details are unknown, so this categorization might not represent their true in-domain vs. out-of-domain performance.
Below are the Real Time Factor (RTFx) measurements showing that Distil-Large-v3.5 is approximately 1.5x faster than Whisper-Large-v3-Turbo on long-form transcription.
Dataset | large-v3-turbo | distil-v2 | distil-v3 | distil-v3.5 |
---|---|---|---|---|
tedlium-long-form | 34.33 | 27.96 | 44.95 | 45.19 |
meanwhile | 26.55 | 28.01 | 40.84 | 42.48 |
earnings21 | 35.25 | 36.66 | 54.69 | 54.3 |
earnings22 | 39.08 | 42.09 | 57.28 | 58.8 |
rev16 | 33.86 | 23.87 | 45.43 | 45.91 |
----------------- | ----- | ----- | ----- | ----- |
Average | 33.81 | 31.72 | 48.64 | 49.34 |
Library Integrations
- Whisper.cpp: Integrate Distil-Large-v3.5 with Whisper.cpp for efficient inference.
- Faster-Whisper: Use Faster-Whisper to speed up the transcription process.
- OpenAI Whisper: Compatible with OpenAI Whisper, allowing for seamless integration.
- Transformers.js: Run the model in the browser using Transformers.js.
- Candle: Integrate with Candle for efficient inference on various hardware platforms.
Training
Training Details
Distil-Large-v3.5 is trained using knowledge distillation from OpenAI's Whisper-Large-v3. It uses a "patient" teacher and aggressive data augmentation during training.
Training Data
Trained on over 4 times more diverse public data (98k hours) compared to earlier models.
đ§ Technical Details
Distil-Whisper is a knowledge-distilled version of OpenAI's Whisper-Large-v3. The knowledge distillation process involves training a smaller model (Distil-Large-v3.5) to mimic the behavior of a larger model (Whisper-Large-v3). During training, a "patient" teacher is used with an extended training schedule and aggressive data augmentation (SpecAugment). This helps the smaller model to learn more effectively and achieve better performance.
đ License
This project is licensed under the MIT License.
đ Citation
If you use Distil-Large-v3.5 in your research, please cite the following paper:
@article{distil-whisper,
title={Robust Knowledge Distillation via Large-Scale Pseudo Labelling},
author={[Author Names]},
journal={[Journal Name]},
year={2023},
volume={[Volume]},
pages={[Pages]}
}
đ Acknowledgements
This model is a đ¤ collaborative effort between Bofeng Huang, Eustache Le Bihan, Steven Zheng, and Vaibhav Srivastav.

