🚀 INT4 Whisper large ONNX Model
Whisper is a pre - trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models can generalize well to many datasets and domains without fine - tuning. This repository contains the INT4 weight - only quantization for the Whisper large model in ONNX format, powered by Intel® Neural Compressor and Intel® Extension for Transformers.
This INT4 ONNX model is generated by the weight - only quantization method of Intel® Neural Compressor.
📚 Documentation
Model Details
Property |
Details |
Model Authors - Company |
Intel |
Date |
October 8, 2023 |
Version |
1 |
Model Type |
Speech Recognition |
Paper or Other Resources |
- |
License |
Apache 2.0 |
Questions or Comments |
Community Tab |
Intended Use
Property |
Details |
Primary intended uses |
You can use the raw model for automatic speech recognition inference |
Primary intended users |
Anyone doing automatic speech recognition inference |
Out - of - scope uses |
This model in most cases will need to be fine - tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people. |
📦 Installation
Export to ONNX Model
The FP32 model is exported with openai/whisper - large:
optimum-cli export onnx --model openai/whisper-large whisper-large-with-past/ --task automatic-speech-recognition-with-past --opset 13
Install ONNX Runtime
Install onnxruntime>=1.16.0
to support MatMulFpQ4
operator.
💻 Usage Examples
Run Quantization
Build Intel® Neural Compressor from master branch and run INT4 weight - only quantization.
The weight - only quantization configuration is as below:
dtype |
group_size |
scheme |
algorithm |
INT4 |
32 |
sym |
RTN |
We provide the key code below. For the complete script, please refer to whisper example.
from neural_compressor import quantization, PostTrainingQuantConfig
from neural_compressor.utils.constant import FP32
model_list = ['encoder_model.onnx', 'decoder_model.onnx', 'decoder_with_past_model.onnx']
for model in model_list:
config = PostTrainingQuantConfig(
approach="weight_only",
calibration_sampling_size=[8],
op_type_dict={".*": {"weight": {"bits": 4,
"algorithm": ["RTN"],
"scheme": ["sym"],
"group_size": 32}}},)
q_model = quantization.fit(
os.path.join("/path/to/whisper-large-with-past", model),
config,
calib_dataloader=dataloader)
q_model.save(os.path.join("/path/to/whisper-large-onnx-int4", model))
Evaluation
Operator Statistics
Below shows the operator statistics in the INT4 ONNX model:
Model |
Op Type |
Total |
INT4 weight |
FP32 weight |
encoder_model |
MatMul |
256 |
192 |
64 |
decoder_model |
MatMul |
449 |
321 |
128 |
decoder_with_past_model |
MatMul |
385 |
257 |
128 |
Evaluation of wer
Evaluate the model on librispeech_asr
dataset with below code:
import os
from evaluate import load
from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor, AutoConfig
model_name = 'openai/whisper-large'
model_path = 'whisper-large-onnx-int4'
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
wer = load("wer")
librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import PretrainedConfig
model_config = PretrainedConfig.from_pretrained(model_name)
predictions = []
references = []
sessions = ORTModelForSpeechSeq2Seq.load_model(
os.path.join(model_path, 'encoder_model.onnx'),
os.path.join(model_path, 'decoder_model.onnx'),
os.path.join(model_path, 'decoder_with_past_model.onnx'))
model = ORTModelForSpeechSeq2Seq(sessions[0], sessions[1], model_config, model_path, sessions[2])
for idx, batch in enumerate(librispeech_test_clean):
audio = batch["audio"]
input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
reference = processor.tokenizer._normalize(batch['text'])
references.append(reference)
predicted_ids = model.generate(input_features)[0]
transcription = processor.decode(predicted_ids)
prediction = processor.tokenizer._normalize(transcription)
predictions.append(prediction)
wer_result = wer.compute(references=references, predictions=predictions)
print(f"Result wer: {wer_result * 100}")
📊 Metrics (Model Performance)
Model |
Model Size (GB) |
wer |
FP32 |
8.8 |
3.04 |
INT4 |
1.9 |
3.05 |
📄 License
This model is licensed under the Apache 2.0 license.