Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Whisper
Whisper is a pre - trained model for automatic speech recognition (ASR) and speech translation, trained on a large amount of labeled data to offer strong generalization ability without fine - tuning.
🚀 Quick Start
Whisper is a powerful pre - trained model for automatic speech recognition and speech translation. It can handle both English - only and multilingual tasks, and comes in different sizes to meet various needs.
✨ Features
- Multilingual Support: Supports a wide range of languages including English, Chinese, German, Spanish, etc.
- Dual - Task Training: Trained for both speech recognition and speech translation.
- Multiple Model Sizes: Available in five different configurations with varying model sizes.
📚 Documentation
Model Details
Whisper is a Transformer - based encoder - decoder model, trained on 680k hours of labeled speech data using large - scale weak supervision.
The models are trained on either English - only or multilingual data. English - only models focus on speech recognition, while multilingual models handle both speech recognition and translation.
Whisper checkpoints come in five sizes: tiny, base, small, medium, and large. The smallest four have English - only and multilingual versions, and the largest are multilingual only. All pre - trained checkpoints are available on the Hugging Face Hub.
Size | Parameters | English - only | Multilingual |
---|---|---|---|
tiny | 39 M | [🔗](https://huggingface.co/openai/whisper - tiny.en) | [🔗](https://huggingface.co/openai/whisper - tiny) |
base | 74 M | [🔗](https://huggingface.co/openai/whisper - base.en) | [🔗](https://huggingface.co/openai/whisper - base) |
small | 244 M | [🔗](https://huggingface.co/openai/whisper - small.en) | [🔗](https://huggingface.co/openai/whisper - small) |
medium | 769 M | [🔗](https://huggingface.co/openai/whisper - medium.en) | [🔗](https://huggingface.co/openai/whisper - medium) |
large | 1550 M | x | [🔗](https://huggingface.co/openai/whisper - large) |
large - v2 | 1550 M | x | [🔗](https://huggingface.co/openai/whisper - large - v2) |
Usage
To transcribe audio samples, use the model with a WhisperProcessor
. The WhisperProcessor
pre - processes audio inputs and post - processes model outputs.
Context tokens are used to inform the model of the task (transcription or translation). A typical sequence of context tokens might be:
<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|>
These tokens can be forced or un - forced. Forced tokens control the output language and task, while un - forced tokens let the model predict them automatically.
model.config.forced_decoder_ids = WhisperProcessor.get_decoder_prompt_ids(language="english", task="transcribe")
💻 Usage Examples
Basic Usage
Transcription - English to English
In this example, the context tokens are 'un - forced', and the model automatically predicts the output language (English) and task (transcribe).
>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import load_dataset
>>> # load model and processor
>>> processor = WhisperProcessor.from_pretrained("openai/whisper - small")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper - small")
>>> model.config.forced_decoder_ids = None
>>> # load dummy dataset and read audio files
>>> ds = load_dataset("hf - internal - testing/librispeech_asr_dummy", "clean", split="validation")
>>> sample = ds[0]["audio"]
>>> input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
>>> # generate token ids
>>> predicted_ids = model.generate(input_features)
>>> # decode token ids to text
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
['<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.<|endoftext|>']
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']
French to French
>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import Audio, load_dataset
>>> # load model and processor
>>> processor = WhisperProcessor.from_pretrained("openai/whisper - small")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper - small")
>>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="transcribe")
>>> # load streaming dataset and read first audio sample
>>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
>>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
>>> input_speech = next(iter(ds))["audio"]
>>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features
>>> # generate token ids
>>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
>>> # decode token ids to text
>>> transcription = processor.batch_decode(predicted_ids)
['<|startoftranscript|><|fr|><|transcribe|><|notimestamps|> Un vrai travail intéressant va enfin être mené sur ce sujet.<|endoftext|>']
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' Un vrai travail intéressant va enfin être mené sur ce sujet.']
Advanced Usage
Translation - French to English
>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import Audio, load_dataset
>>> # load model and processor
>>> processor = WhisperProcessor.from_pretrained("openai/whisper - small")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper - small")
>>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="translate")
>>> # load streaming dataset and read first audio sample
>>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
>>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
>>> input_speech = next(iter(ds))["audio"]
>>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features
>>> # generate token ids
>>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
>>> # decode token ids to text
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' A very interesting work, we will finally be given on this subject.']
Evaluation
>>> from datasets import load_dataset
>>> from transformers import WhisperForConditionalGeneration, WhisperProcessor
>>> import torch
>>> from evaluate import load
>>> librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")
>>> processor = WhisperProcessor.from_pretrained("openai/whisper - small")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper - small").to("cuda")
>>> def map_to_pred(batch):
>>> audio = batch["audio"]
>>> input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
>>> batch["reference"] = processor.tokenizer._normalize(batch['text'])
>>>
>>> with torch.no_grad():
>>> predicted_ids = model.generate(input_features.to("cuda"))[0]
>>> transcription = processor.decode(predicted_ids)
>>> batch["prediction"] = processor.tokenizer._normalize(transcription)
>>> return batch
>>> result = librispeech_test_clean.map(map_to_pred)
>>> wer = load("wer")
>>> print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))
3.432213777886737
Long - Form Transcription
The Whisper model is designed for audio samples up to 30s. Using a chunking algorithm via the Transformers pipeline
method, it can transcribe audio of arbitrary length. Enable chunking by setting chunk_length_s = 30
when instantiating the pipeline.
📄 License
This model is licensed under the apache - 2.0 license.

