Model Overview
Model Features
Model Capabilities
Use Cases
đ Whisper
Whisper is a state - of - the - art model for automatic speech recognition (ASR) and speech translation. It can effectively handle speech recognition and translation tasks, trained on a large amount of labeled data, and shows strong generalization ability in zero - shot scenarios.
đ Quick Start
First, you need to install the necessary libraries to run the Whisper large - v3 - turbo model:
pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate
Here is a basic example of using the model to transcribe audio:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
⨠Features
- Multilingual Support: Supports a wide range of languages including en, zh, de, es, ru, etc.
- High - Performance: Trained on >5M hours of labeled data, showing strong generalization ability.
- Multiple Decoding Strategies: Compatible with various decoding strategies like temperature fallback and condition on previous tokens.
- Automatic Language Prediction: Automatically predicts the language of the source audio.
đĻ Installation
To install the necessary libraries for running the model, use the following commands:
pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate
đģ Usage Examples
Basic Usage
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
Advanced Usage
Transcribing a Local Audio File
result = pipe("audio.mp3")
Transcribing Multiple Audio Files in Parallel
result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)
Enabling Decoding Heuristics
generate_kwargs = {
"max_new_tokens": 448,
"num_beams": 1,
"condition_on_prev_tokens": False,
"compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
"temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
"logprob_threshold": -1.0,
"no_speech_threshold": 0.6,
"return_timestamps": True,
}
result = pipe(sample, generate_kwargs=generate_kwargs)
Specifying the Source Audio Language
result = pipe(sample, generate_kwargs={"language": "english"})
Performing Speech Translation
result = pipe(sample, generate_kwargs={"task": "translate"})
Predicting Timestamps
result = pipe(sample, return_timestamps=True)
print(result["chunks"])
result = pipe(sample, return_timestamps="word")
print(result["chunks"])
Using Model + Processor API Directly
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]
inputs = processor(
sample["array"],
sampling_rate=sample["sampling_rate"],
return_tensors="pt",
truncation=False,
padding="longest",
return_attention_mask=True,
)
inputs = inputs.to(device, dtype=torch_dtype)
gen_kwargs = {
"max_new_tokens": 448,
"num_beams": 1,
"condition_on_prev_tokens": False,
"compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
"temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
"logprob_threshold": -1.0,
"no_speech_threshold": 0.6,
"return_timestamps": True,
}
pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)
print(pred_text)
đ§ Technical Details
Whisper is a Transformer - based encoder - decoder model, also known as a sequence - to - sequence model. There are English - only and multilingual versions. The English - only models are trained for English speech recognition, while the multilingual models are trained for both multilingual speech recognition and speech translation.
Whisper checkpoints come in five configurations of different model sizes. The smallest four are available in both English - only and multilingual versions, and the largest ones are multilingual only. All pre - trained checkpoints are available on the Hugging Face Hub.
Size | Parameters | English-only | Multilingual |
---|---|---|---|
tiny | 39 M | â | â |
base | 74 M | â | â |
small | 244 M | â | â |
medium | 769 M | â | â |
large | 1550 M | x | â |
large-v2 | 1550 M | x | â |
large-v3 | 1550 M | x | â |
large-v3-turbo | 809 M | x | â |
đ Documentation
Additional Speed & Memory Improvements
Chunked Long - Form
Whisper has a 30 - second receptive field. For audios longer than this, two long - form algorithms can be used: sequential and chunked. The sequential algorithm is suitable when transcription accuracy is crucial or when transcribing batches of long audio files. The chunked algorithm is better when transcription speed is the priority or when transcribing a single long audio file.
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
chunk_length_s=30,
batch_size=16, # batch size for inference - set based on your device
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
Torch compile
The Whisper forward pass is compatible with torch.compile
for 4.5x speed - ups.
import torch
from torch.nn.attention import SDPBackend, sdpa_kernel
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
from tqdm import tqdm
torch.set_float32_matmul_precision("high")
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
).to(device)
# Enable static cache and compile the forward pass
model.generation_config.cache_implementation = "static"
model.generation_config.max_new_tokens = 256
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
# 2 warmup steps
for _ in tqdm(range(2), desc="Warm-up step"):
with sdpa_kernel(SDPBackend.MATH):
result = pipe(sample.copy(), generate_kwargs={"min_new_tokens": 256, "max_new_tokens": 256})
# fast run
with sdpa_kernel(SDPBackend.MATH):
result = pipe(sample.copy())
print(result["text"])
Flash Attention 2
If your GPU supports it and you are not using torch.compile, we recommend using Flash - Attention 2. First, install [Flash Attention](https://github.com/Dao - AILab/flash - attention):
pip install flash-attn --no-build-isolation
Then use the following code:
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2")
Torch Scale - Product - Attention (SDPA)
If your GPU does not support Flash Attention, use PyTorch scaled dot - product attention (SDPA). Check if your PyTorch version is compatible:
from transformers.utils import is_torch_sdpa_available
print(is_torch_sdpa_available())
If it returns True
, SDPA is activated by default. If False
, upgrade your PyTorch version. You can also explicitly set it:
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="sdpa")
Fine - Tuning
The pre - trained Whisper model can be fine - tuned for better performance on certain languages and tasks. Refer to the blog post [Fine - Tune Whisper with đ¤ Transformers](https://huggingface.co/blog/fine - tune - whisper) for a step - by - step guide.
Evaluated Use
The primary users are AI researchers. However, it can also be useful for developers, especially for English speech recognition. Caution is needed when using the model, such as not using it to transcribe recordings without consent or for subjective classification.
Performance and Limitations
The models show improved robustness and near - state - of - the - art accuracy. But they may have hallucination issues due to weakly supervised training. Performance varies across languages, and they are prone to generating repetitive texts.
đ License
The model uses the MIT license.

