Whisper Large V3.w4a16
This is the quantized version of openai/whisper-large-v3, employing INT4 weight quantization and FP16 activation quantization, suitable for vLLM inference.
Downloads 20
Release Time : 2/14/2025
Model Overview
This model is a quantized version of Whisper-large-v3, primarily used for speech recognition tasks to convert audio into text.
Model Features
Efficient Quantization
Utilizes INT4 weight quantization and FP16 activation quantization, significantly reducing model size and memory usage.
vLLM Compatibility
Optimized for vLLM >= 0.5.2, enabling efficient inference.
High Accuracy Retention
Maintains recognition accuracy close to the original model after quantization.
Model Capabilities
Speech Recognition
Audio to Text
English Transcription
Use Cases
Speech Transcription
Meeting Minutes
Automatically convert meeting recordings into text transcripts
WER (Word Error Rate) approximately 12.95%
Podcast Transcription
Convert podcast audio content into searchable text
đ whisper-large-v3-quantized.w4a16
A quantized version of openai/whisper-large-v3, optimized for efficient audio-to-text transcription.
đ Quick Start
This is a quantized version of the openai/whisper-large-v3 model. It's optimized for efficient inference using the vLLM backend.
⨠Features
- Model Architecture: whisper-large-v3
- Input: Audio-Text
- Output: Text
- Model Optimizations:
- Weight quantization: INT4
- Activation quantization: FP16
- Release Date: 1/31/2025
- Version: 1.0
- Model Developers: Neural Magic
đĻ Installation
This model can be used with the vLLM backend for efficient deployment.
đģ Usage Examples
Basic Usage
This model can be deployed efficiently using the vLLM backend, as shown in the example below.
from vllm.assets.audio import AudioAsset
from vllm import LLM, SamplingParams
# prepare model
llm = LLM(
model="neuralmagic/whisper-large-v3.w4a16",
max_model_len=448,
max_num_seqs=400,
limit_mm_per_prompt={"audio": 1},
)
# prepare inputs
inputs = { # Test explicit encoder/decoder prompt
"encoder_prompt": {
"prompt": "",
"multi_modal_data": {
"audio": AudioAsset("winning_call").audio_and_sample_rate,
},
},
"decoder_prompt": "<|startoftranscript|>",
}
# generate response
print("========== SAMPLE GENERATION ==============")
outputs = llm.generate(inputs, SamplingParams(temperature=0.0, max_tokens=64))
print(f"PROMPT : {outputs[0].prompt}")
print(f"RESPONSE: {outputs[0].outputs[0].text}")
print("==========================================")
Advanced Usage
This model was created with llm-compressor by running the code snippet below as part a multimodal announcement blog.
import torch
from datasets import load_dataset
from transformers import WhisperProcessor
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.tracing import TraceableWhisperForConditionalGeneration
# Select model and load it.
MODEL_ID = "openai/whisper-large-v3"
model = TraceableWhisperForConditionalGeneration.from_pretrained(
MODEL_ID,
device_map="auto",
torch_dtype="auto",
)
model.config.forced_decoder_ids = None
processor = WhisperProcessor.from_pretrained(MODEL_ID)
# Configure processor the dataset task.
processor.tokenizer.set_prefix_tokens(language="en", task="transcribe")
# Select calibration dataset.
DATASET_ID = "MLCommons/peoples_speech"
DATASET_SUBSET = "test"
DATASET_SPLIT = "test"
# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048
# Load dataset and preprocess.
ds = load_dataset(
DATASET_ID,
DATASET_SUBSET,
split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]",
trust_remote_code=True,
)
def preprocess(example):
return {
"array": example["audio"]["array"],
"sampling_rate": example["audio"]["sampling_rate"],
"text": " " + example["text"].capitalize(),
}
ds = ds.map(preprocess, remove_columns=ds.column_names)
# Process inputs.
def process(sample):
inputs = processor(
audio=sample["array"],
sampling_rate=sample["sampling_rate"],
text=sample["text"],
add_special_tokens=True,
return_tensors="pt",
)
inputs["input_features"] = inputs["input_features"].to(dtype=model.dtype)
inputs["decoder_input_ids"] = inputs["labels"]
del inputs["labels"]
return inputs
ds = ds.map(process, remove_columns=ds.column_names)
# Define a oneshot data collator for multimodal inputs.
def data_collator(batch):
assert len(batch) == 1
return {key: torch.tensor(value) for key, value in batch[0].items()}
# Recipe
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
# Apply algorithms.
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
data_collator=data_collator,
)
# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
sample_features = next(iter(ds))["input_features"]
sample_decoder_ids = [processor.tokenizer.prefix_tokens]
sample_input = {
"input_features": torch.tensor(sample_features).to(model.device),
"decoder_input_ids": torch.tensor(sample_decoder_ids).to(model.device),
}
output = model.generate(**sample_input, language="en")
print(processor.batch_decode(output, skip_special_tokens=True))
print("==========================================\n\n")
# that's where you have a lot of windows in the south no actually that's passive solar
# and passive solar is something that was developed and designed in the 1960s and 70s
# and it was a great thing for what it was at the time but it's not a passive house
# Save to disk compressed.
SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)
đ Documentation
Evaluation
Base Model
Total Test Time: 94.4606 seconds
Total Requests: 511
Successful Requests: 511
Average Latency: 53.3529 seconds
Median Latency: 52.7258 seconds
95th Percentile Latency: 86.5851 seconds
Estimated req_Throughput: 5.41 requests/s
Estimated Throughput: 100.79 tok/s
WER: 12.660815197787665
W4A16
Total Test Time: 106.2064 seconds
Total Requests: 511
Successful Requests: 511
Average Latency: 59.7467 seconds
Median Latency: 58.3930 seconds
95th Percentile Latency: 97.4831 seconds
Estimated req_Throughput: 4.81 requests/s
Estimated Throughput: 89.35 tok/s
WER: 12.949380786341228
BibTeX entry and citation info
@misc{radford2022whisper,
doi = {10.48550/ARXIV.2212.04356},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
đ License
This model is licensed under the Apache 2.0 license.
Voice Activity Detection
MIT
Voice activity detection model based on pyannote.audio 2.1, used to identify speech activity segments in audio
Speech Recognition
V
pyannote
7.7M
181
Wav2vec2 Large Xlsr 53 Portuguese
Apache-2.0
This is a fine-tuned XLSR-53 large model for Portuguese speech recognition tasks, trained on the Common Voice 6.1 dataset, supporting Portuguese speech-to-text conversion.
Speech Recognition Other
W
jonatasgrosman
4.9M
32
Whisper Large V3
Apache-2.0
Whisper is an advanced automatic speech recognition (ASR) and speech translation model proposed by OpenAI, trained on over 5 million hours of labeled data, with strong cross-dataset and cross-domain generalization capabilities.
Speech Recognition Supports Multiple Languages
W
openai
4.6M
4,321
Whisper Large V3 Turbo
MIT
Whisper is a state-of-the-art automatic speech recognition (ASR) and speech translation model developed by OpenAI, trained on over 5 million hours of labeled data, demonstrating strong generalization capabilities in zero-shot settings.
Speech Recognition
Transformers Supports Multiple Languages

W
openai
4.0M
2,317
Wav2vec2 Large Xlsr 53 Russian
Apache-2.0
A Russian speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampled audio input
Speech Recognition Other
W
jonatasgrosman
3.9M
54
Wav2vec2 Large Xlsr 53 Chinese Zh Cn
Apache-2.0
A Chinese speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input.
Speech Recognition Chinese
W
jonatasgrosman
3.8M
110
Wav2vec2 Large Xlsr 53 Dutch
Apache-2.0
A Dutch speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, trained on the Common Voice and CSS10 datasets, supporting 16kHz audio input.
Speech Recognition Other
W
jonatasgrosman
3.0M
12
Wav2vec2 Large Xlsr 53 Japanese
Apache-2.0
Japanese speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input
Speech Recognition Japanese
W
jonatasgrosman
2.9M
33
Mms 300m 1130 Forced Aligner
A text-to-audio forced alignment tool based on Hugging Face pre-trained models, supporting multiple languages with high memory efficiency
Speech Recognition
Transformers Supports Multiple Languages

M
MahmoudAshraf
2.5M
50
Wav2vec2 Large Xlsr 53 Arabic
Apache-2.0
Arabic speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, trained on Common Voice and Arabic speech corpus
Speech Recognition Arabic
W
jonatasgrosman
2.3M
37
Featured Recommended AI Models
Š 2025AIbase