Open source and free! Whisper-large-v3 automatic speech recognition and translation model, supporting multiple languages.

Whisper Large V3

Developed by unsloth

Whisper is OpenAI's state-of-the-art automatic speech recognition (ASR) and speech translation model, supporting multiple languages

Speech Recognition

Safetensors

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual Speech Recognition #Zero-shot Translation #Long Audio Processing

Downloads 4,002

Release Time : 5/14/2025

Model Overview

Whisper is a Transformer-based encoder-decoder model for automatic speech recognition and speech translation tasks. The large-v3 version was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio, supports multiple languages, and outperforms previous versions

Model Features

Multilingual Support

Supports speech recognition and translation for over 50 languages, including low-resource languages

Large-scale Training

Trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio, covering a wide range of domains

Zero-shot Generalization

Demonstrates strong generalization capabilities on unseen datasets and domains

Improved Accuracy

Reduces error rates by 10-20% compared to the large-v2 version

Long-form Audio Processing

Supports processing of audio longer than 30 seconds through chunking or sequential methods

Model Capabilities

Speech-to-text

Multilingual speech recognition

Speech translation (to English)

Timestamp prediction

Language detection

Long audio processing

Use Cases

Speech Transcription

Meeting Minutes

Automatically transcribe business meeting content

Highly accurate meeting transcripts

Podcast Transcription

Convert podcast audio into searchable text

Text format for easy content retrieval and analysis

Speech Translation

Real-time Translation

Translate non-English speech into English text in real-time

A bridge for cross-language communication

Assistive Technology

Subtitle Generation

Automatically generate subtitles for video content

Enhances accessibility of video content

🚀 Whisper Large-v3: Advanced Automatic Speech Recognition

This project offers an advanced automatic speech recognition solution based on the Whisper large-v3 model. It supports a wide range of languages and provides efficient and accurate speech transcription and translation capabilities.

🚀 Quick Start

Fine-tune for Free: You can fine-tune TTS models for free using our Google Colab notebooks here.
Read the Blog: Check out our blog about TTS support: unsloth.ai/blog/tts.

✨ Features

Multilingual Support: Supports a vast array of languages including en, zh, de, es, ru, etc.
High Performance: Demonstrates improved performance over a wide variety of languages, with 10% - 20% reduction of errors compared to Whisper large-v2.
Efficient Training: Trained on >5M hours of labeled data, enabling strong generalization in zero-shot settings.
Flexible Usage: Can be used for speech transcription and translation tasks, and supports various decoding strategies.

📦 Installation

To run the Whisper large-v3 model, first install the necessary libraries:

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Advanced Usage

Transcribe a Local Audio File

result = pipe("audio.mp3")

Transcribe Multiple Audio Files in Parallel

result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)

Enable Decoding Strategies

generate_kwargs = {
    "max_new_tokens": 448,
    "num_beams": 1,
    "condition_on_prev_tokens": False,
    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "return_timestamps": True,
}

result = pipe(sample, generate_kwargs=generate_kwargs)

Specify Source Audio Language

result = pipe(sample, generate_kwargs={"language": "english"})

Perform Speech Translation

result = pipe(sample, generate_kwargs={"task": "translate"})

Predict Timestamps

# Sentence-level timestamps
result = pipe(sample, return_timestamps=True)
print(result["chunks"])

# Word-level timestamps
result = pipe(sample, return_timestamps="word")
print(result["chunks"])

Using Model + Processor API Directly

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]

inputs = processor(
    sample["array"],
    sampling_rate=sample["sampling_rate"],
    return_tensors="pt",
    truncation=False,
    padding="longest",
    return_attention_mask=True,
)
inputs = inputs.to(device, dtype=torch_dtype)

gen_kwargs = {
    "max_new_tokens": 448,
    "num_beams": 1,
    "condition_on_prev_tokens": False,
    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "return_timestamps": True,
}

pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)

print(pred_text)

🔧 Technical Details

Model Architecture

Whisper large-v3 has the same architecture as the previous large and large-v2 models, with the following minor differences:

The spectrogram input uses 128 Mel frequency bins instead of 80.
A new language token for Cantonese.

Training Data

The model was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio collected using Whisper large-v2. It was trained for 2.0 epochs over this mixture dataset.

Performance Improvement

The large-v3 model shows improved performance over a wide variety of languages, with 10% to 20% reduction of errors compared to Whisper large-v2.

📄 License

This project is licensed under the apache-2.0 license.

Additional Information

Supported Models Comparison

Unsloth supports	Free Notebooks	Performance	Memory use
Orpheus-TTS	🚀 Start on Colab	1.5x faster	58% less
Whisper Large V3	🚀 Start on Colab	1.5x faster	50% less
Qwen3 (14B)	🚀 Start on Colab	2x faster	70% less
Llama 3.2 Vision (11B)	🚀 Start on Colab	1.8x faster	50% less

Additional Speed & Memory Improvements

Chunked Long-Form

You can enable the chunked long-form algorithm to transcribe long audio files more efficiently. Pass the chunk_length_s parameter to the pipeline and set the batch_size for batching.

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    batch_size=16,  # batch size for inference - set based on your device
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Torch compile

The Whisper forward pass is compatible with torch.compile for 4.5x speed-ups.

⚠️ Important Note

torch.compile is currently not fully stable and may have some compatibility issues.

Model Widgets

Librispeech sample 1: Listen
Librispeech sample 2: Listen

See Our Collection

All TTS Models: Check out our collection for all our TTS model uploads.

Learn to Fine-tune

Read the Guide: Learn to fine-tune TTS models.

Unsloth Dynamic 2.0

Superior Performance: Unsloth Dynamic 2.0 achieves superior accuracy and outperforms other leading quants.

Connect with Us

GitHub: unslothai/unsloth
Discord: Join our Discord
Documentation: Read the docs

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご