Model Overview
Model Features
Model Capabilities
Use Cases
đ Whisper
Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation. It was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Trained on over 5 million hours of labeled data, Whisper shows strong generalization ability across various datasets and domains in a zero-shot setting.
Whisper large-v3-turbo is a distilled version of Whisper large-v3. That is, it's essentially the same model, but the number of decoding layers has been reduced from 32 to 4. As a result, the model is much faster, though there's a slight degradation in quality.
Disclaimer: Part of the content for this model card was written by the đ¤ Hugging Face team, and part was copied from the original model card.
đ Quick Start
Prerequisites
Whisper large-v3-turbo is supported in Hugging Face đ¤ Transformers. To run the model, first install the Transformers library. For this example, we'll also install đ¤ Datasets to load a toy audio dataset from the Hugging Face Hub, and đ¤ Accelerate to reduce the model loading time:
pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate
Basic Usage
The model can be used with the pipeline
class to transcribe audios of arbitrary length:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "deepdml/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
⨠Features
- Multilingual Support: Supports multiple languages including en, zh, de, es, ru, ko, fr, ja, pt, tr, pl, ca, nl, ar, sv, it, id, hi, fi, vi, he, uk, el, ms, cs, ro, da, hu, ta, 'no', th, ur, hr, bg, lt, la, mi, ml, cy, sk, te, fa, lv, bn, sr, az, sl, kn, et, mk, br, eu, is, hy, ne, mn, bs, kk, sq, sw, gl, mr, pa, si, km, sn, yo, so, af, oc, ka, be, tg, sd, gu, am, yi, lo, uz, fo, ht, ps, tk, nn, mt, sa, lb, my, bo, tl, mg, as, tt, haw, ln, ha, ba, jw, su.
- Automatic Language Prediction: Automatically predicts the language of the source audio.
- Speech Translation: Can perform speech translation tasks, converting speech to English text.
- Timestamp Prediction: Can predict both sentence-level and word-level timestamps.
đĻ Installation
To install the necessary libraries for using Whisper large-v3-turbo, run the following commands:
pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate
đģ Usage Examples
Basic Usage
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "deepdml/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
Transcribing a Local Audio File
result = pipe("audio.mp3")
Transcribing Multiple Audio Files in Parallel
result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)
Using Decoding Strategies
generate_kwargs = {
"max_new_tokens": 448,
"num_beams": 1,
"condition_on_prev_tokens": False,
"compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
"temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
"logprob_threshold": -1.0,
"no_speech_threshold": 0.6,
"return_timestamps": True,
}
result = pipe(sample, generate_kwargs=generate_kwargs)
Specifying the Source Audio Language
result = pipe(sample, generate_kwargs={"language": "english"})
Performing Speech Translation
result = pipe(sample, generate_kwargs={"task": "translate"})
Predicting Sentence-Level Timestamps
result = pipe(sample, return_timestamps=True)
print(result["chunks"])
Predicting Word-Level Timestamps
result = pipe(sample, return_timestamps="word")
print(result["chunks"])
Combining Arguments
result = pipe(sample, return_timestamps=True, generate_kwargs={"language": "french", "task": "translate"})
print(result["chunks"])
Using the Model + Processor API Directly
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "deepdml/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]
inputs = processor(
sample["array"],
sampling_rate=sample["sampling_rate"],
return_tensors="pt",
truncation=False,
padding="longest",
return_attention_mask=True,
)
inputs = inputs.to(device, dtype=torch_dtype)
gen_kwargs = {
"max_new_tokens": 448,
"num_beams": 1,
"condition_on_prev_tokens": False,
"compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
"temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
"logprob_threshold": -1.0,
"no_speech_threshold": 0.6,
"return_timestamps": True,
}
pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)
print(pred_text)
đ§ Technical Details
Long-Form Transcription Algorithms
Whisper has a receptive field of 30 seconds. To transcribe audios longer than this, one of two long-form algorithms is required:
- Sequential: Uses a "sliding window" for buffered inference, transcribing 30-second slices one after the other.
- Chunked: Splits long audio files into shorter ones (with a small overlap between segments), transcribes each segment independently, and stitches the resulting transcriptions at the boundaries.
The sequential long-form algorithm should be used in either of the following scenarios:
- Transcription accuracy is the most important factor, and speed is less of a consideration.
- You are transcribing batches of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate.
Conversely, the chunked algorithm should be used when:
- Transcription speed is the most important factor.
- You are transcribing a single long audio file.
Additional Speed & Memory Improvements
Chunked Long-Form
To enable the chunked algorithm, pass the chunk_length_s
parameter to the pipeline
. For large-v3, a chunk length of 30 seconds is optimal. To activate batching over long audio files, pass the argument batch_size
:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "deepdml/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
chunk_length_s=30,
batch_size=16, # batch size for inference - set based on your device
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
Torch compile
The Whisper forward pass is compatible with torch.compile
for 4.5x speed-ups.
Note: torch.compile
is currently not compatible with the Chunked long-form algorithm or Flash Attention 2 â ī¸
import torch
from torch.nn.attention import SDPBackend, sdpa_kernel
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
from tqdm import tqdm
torch.set_float32_matmul_precision("high")
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "deepdml/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
).to(device)
# Enable static cache and compile the forward pass
model.generation_config.cache_implementation = "static"
model.generation_config.max_new_tokens = 256
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
# 2 warmup steps
for _ in tqdm(range(2), desc="Warm-up step"):
with sdpa_kernel(SDPBackend.MATH):
result = pipe(sample.copy(), generate_kwargs={"min_new_tokens": 256, "max_new_tokens": 256})
# fast run
with sdpa_kernel(SDPBackend.MATH):
result = pipe(sample.copy())
print(result["text"])
Flash Attention 2
We recommend using Flash-Attention 2 if your GPU supports it and you are not using torch.compile. To do so, first install Flash Attention:
pip install flash-attn --no-build-isolation
Then pass attn_implementation="flash_attention_2"
to from_pretrained
:
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2")
Torch Scale-Product-Attention (SDPA)
If your GPU does not support Flash Attention, we recommend making use of PyTorch scaled dot-product attention (SDPA). This attention implementation is activated by default for PyTorch versions 2.1.1 or greater. To check whether you have a compatible PyTorch version, run the following Python code snippet:
from transformers.utils import is_torch_sdpa_available
print(is_torch_sdpa_available())
If the above returns True
, you have a valid version of PyTorch installed and SDPA is activated by default. If it returns False
, you need to upgrade your PyTorch version according to the official instructions
Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying attn_implementation="sdpa"
as follows:
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="sdpa")
đ License
This model is licensed under the apache-2.0 license.

