đ Distil-Whisper: Distil-Large-v3.5
Distil-Whisper is a knowledge-distilled version of OpenAI's Whisper-Large-v3, offering high efficiency and better performance, suitable for automatic speech recognition.
Distil-Whisper is the knowledge-distilled version of OpenAI's Whisper-Large-v3, described in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labelling. As the newest addition to the Distil-Whisper English family, Distil-Large-v3.5 maintains the high efficiency of its predecessors while delivering better performance.
Compared to earlier models, Distil-Large-v3.5 has been trained on over 4Ã more diverse public data (98k hours) and uses a "patient" teacher with an extended training schedule and aggressive data augmentation (SpecAugment) during distillation. This results in enhanced robustness and accuracy compared to previous Distil-Whisper models, making it suitable as a drop-in replacement.
Why consider Distil-Large-v3.5 when Whisper-Large-v3-Turbo already exists?
- It offers a different balance between accuracy and efficiency, remains ~1.5x faster than Whisper-Large-v3-Turbo while performing slightly better on short-form transcription and falling ~1% behind on long-form transcription.
- It works perfectly as a draft model for speculative decoding with Whisper-Large-v3. By keeping the encoder frozen during training, we need to load just two extra decoder layers and forward the encoder only once. This achieves ~2x faster inference compared to Whisper-Large-v3 while maintaining identical outputs.
This model is a đ¤ collaborative effort between Bofeng Huang, Eustache Le Bihan, Steven Zheng, Vaibhav Srivastav, and Joshua Lochner.
đ Quick Start
Distil-Whisper Distil-Large-v3.5 is a powerful model for automatic speech recognition. You can quickly start using it through the following steps.
đĻ Installation
If you haven't already, you can install the Transformers.js JavaScript library from NPM using:
npm i @huggingface/transformers
đģ Usage Examples
Basic Usage
You can then transcribe audio as follows:
import { pipeline } from '@huggingface/transformers';
const transcriber = await pipeline('automatic-speech-recognition', 'distil-whisper/distil-large-v3.5-ONNX');
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
const output = await transcriber(url);
đ License
This project is licensed under the MIT license.