Whisper large-v3-turbo Open Source Speech Model - Free Multi-language Speech Recognition and Translation

Whisper Large V3 Turbo

Developed by Daemontatox

Whisper large-v3-turbo is an automatic speech recognition and speech translation model proposed by OpenAI, trained with large-scale weak supervision and supporting multiple languages.

Speech Recognition

Transformers

Supports Multiple LanguagesOpen Source License:MIT #High-speed speech recognition #Multilingual support #Speech translation

Downloads 26

Release Time : 2/26/2025

Model Overview

Whisper large-v3-turbo is a pruned and fine-tuned version of Whisper large-v3, with the decoder layers reduced from 32 to 4, significantly improving speed while slightly reducing quality.

Model Features

Multilingual support

Supports speech recognition and translation tasks for over 100 languages.

Efficient inference

Significantly improves inference speed by reducing decoder layers, suitable for real-time applications.

Zero-shot generalization

Demonstrates strong generalization capabilities on unseen languages and domains.

Long audio processing

Supports chunk processing of long audio files, ideal for transcribing meetings, lectures, and other lengthy recordings.

Model Capabilities

Speech recognition

Speech translation

Multilingual transcription

Timestamp prediction

Use Cases

Speech transcription

Meeting minutes

Automatically transcribes meeting recordings into text records.

Supports multiple languages with accuracy close to human-level performance.

Podcast transcription

Transcribes podcast content into text for search and archiving.

Handles various accents and background noise.

Speech translation

Real-time translation

Translates non-English speech into English text in real-time.

Supports translation from multiple languages to English.

Assistive tools

Subtitle generation

Automatically generates subtitles for video content.

Produces timestamped subtitle files.

🚀 Whisper

Whisper is a state - of - the - art model for automatic speech recognition (ASR) and speech translation. It can effectively handle speech recognition and translation tasks, trained on a large amount of labeled data, and shows strong generalization ability in zero - shot scenarios.

🚀 Quick Start

First, you need to install the necessary libraries to run the Whisper large - v3 - turbo model:

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate

Here is a basic example of using the model to transcribe audio:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

✨ Features

Multilingual Support: Supports a wide range of languages including en, zh, de, es, ru, etc.
High - Performance: Trained on >5M hours of labeled data, showing strong generalization ability.
Multiple Decoding Strategies: Compatible with various decoding strategies like temperature fallback and condition on previous tokens.
Automatic Language Prediction: Automatically predicts the language of the source audio.

📦 Installation

To install the necessary libraries for running the model, use the following commands:

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Advanced Usage

Transcribing a Local Audio File

result = pipe("audio.mp3")

Transcribing Multiple Audio Files in Parallel

result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)

Enabling Decoding Heuristics

generate_kwargs = {
    "max_new_tokens": 448,
    "num_beams": 1,
    "condition_on_prev_tokens": False,
    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "return_timestamps": True,
}

result = pipe(sample, generate_kwargs=generate_kwargs)

Specifying the Source Audio Language

result = pipe(sample, generate_kwargs={"language": "english"})

Performing Speech Translation

result = pipe(sample, generate_kwargs={"task": "translate"})

Predicting Timestamps

result = pipe(sample, return_timestamps=True)
print(result["chunks"])

result = pipe(sample, return_timestamps="word")
print(result["chunks"])

Using Model + Processor API Directly

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]

inputs = processor(
    sample["array"],
    sampling_rate=sample["sampling_rate"],
    return_tensors="pt",
    truncation=False,
    padding="longest",
    return_attention_mask=True,
)
inputs = inputs.to(device, dtype=torch_dtype)

gen_kwargs = {
    "max_new_tokens": 448,
    "num_beams": 1,
    "condition_on_prev_tokens": False,
    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "return_timestamps": True,
}

pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)

print(pred_text)

🔧 Technical Details

Whisper is a Transformer - based encoder - decoder model, also known as a sequence - to - sequence model. There are English - only and multilingual versions. The English - only models are trained for English speech recognition, while the multilingual models are trained for both multilingual speech recognition and speech translation.

Whisper checkpoints come in five configurations of different model sizes. The smallest four are available in both English - only and multilingual versions, and the largest ones are multilingual only. All pre - trained checkpoints are available on the Hugging Face Hub.

Size	Parameters	English-only	Multilingual
tiny	39 M	✓	✓
base	74 M	✓	✓
small	244 M	✓	✓
medium	769 M	✓	✓
large	1550 M	x	✓
large-v2	1550 M	x	✓
large-v3	1550 M	x	✓
large-v3-turbo	809 M	x	✓

📚 Documentation

Additional Speed & Memory Improvements

Chunked Long - Form

Whisper has a 30 - second receptive field. For audios longer than this, two long - form algorithms can be used: sequential and chunked. The sequential algorithm is suitable when transcription accuracy is crucial or when transcribing batches of long audio files. The chunked algorithm is better when transcription speed is the priority or when transcribing a single long audio file.

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    batch_size=16,  # batch size for inference - set based on your device
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Torch compile

The Whisper forward pass is compatible with torch.compile for 4.5x speed - ups.

import torch
from torch.nn.attention import SDPBackend, sdpa_kernel
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
from tqdm import tqdm

torch.set_float32_matmul_precision("high")

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
).to(device)

# Enable static cache and compile the forward pass
model.generation_config.cache_implementation = "static"
model.generation_config.max_new_tokens = 256
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

# 2 warmup steps
for _ in tqdm(range(2), desc="Warm-up step"):
    with sdpa_kernel(SDPBackend.MATH):
        result = pipe(sample.copy(), generate_kwargs={"min_new_tokens": 256, "max_new_tokens": 256})

# fast run
with sdpa_kernel(SDPBackend.MATH):
    result = pipe(sample.copy())

print(result["text"])

Flash Attention 2

If your GPU supports it and you are not using torch.compile, we recommend using Flash - Attention 2. First, install [Flash Attention](https://github.com/Dao - AILab/flash - attention):

pip install flash-attn --no-build-isolation

Then use the following code:

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2")

Torch Scale - Product - Attention (SDPA)

If your GPU does not support Flash Attention, use PyTorch scaled dot - product attention (SDPA). Check if your PyTorch version is compatible:

from transformers.utils import is_torch_sdpa_available

print(is_torch_sdpa_available())

If it returns True, SDPA is activated by default. If False, upgrade your PyTorch version. You can also explicitly set it:

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="sdpa")

Fine - Tuning

The pre - trained Whisper model can be fine - tuned for better performance on certain languages and tasks. Refer to the blog post [Fine - Tune Whisper with 🤗 Transformers](https://huggingface.co/blog/fine - tune - whisper) for a step - by - step guide.

Evaluated Use

The primary users are AI researchers. However, it can also be useful for developers, especially for English speech recognition. Caution is needed when using the model, such as not using it to transcribe recordings without consent or for subjective classification.

Performance and Limitations

The models show improved robustness and near - state - of - the - art accuracy. But they may have hallucination issues due to weakly supervised training. Performance varies across languages, and they are prone to generating repetitive texts.

📄 License

The model uses the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご