Open-source Whisper-large-onnx-int4-inc Model - Free Automatic Speech Recognition and Translation

Whisper Large Onnx Int4 Inc

Developed by Intel

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. This repository provides the Whisper large model in ONNX format with INT4 weight quantization, powered by Intel® Neural Compressor and Intel® Transformers Extension.

Speech Recognition

Transformers

Open Source License:Apache-2.0 #INT4 Quantization #Multi-domain ASR #Low-resource Inference

Downloads 44

Release Time : 10/8/2023

Model Overview

Whisper is a pre-trained model that demonstrates strong generalization capabilities after training on 680,000 hours of labeled data, adapting to various datasets and domains without fine-tuning. This model is the INT4 quantized version, suitable for automatic speech recognition inference.

Model Features

INT4 Quantization

The model undergoes INT4 weight quantization, significantly reducing model size (from 8.8GB to 1.9GB) while maintaining high performance.

ONNX Format

The model is provided in ONNX format, facilitating deployment and inference across different platforms.

High Performance

The quantized model achieves a word error rate of only 3.05% on the librispeech_asr dataset, nearly identical to the FP32 version (3.04%).

No Fine-tuning Required

The model exhibits strong generalization capabilities, adapting to various datasets and domains without fine-tuning.

Model Capabilities

Automatic Speech Recognition

Speech Translation

Use Cases

Speech Recognition

Speech-to-Text

Convert speech content into text, suitable for scenarios like meeting minutes and subtitle generation.

Word error rate 3.05%

🚀 INT4 Whisper large ONNX Model

Whisper is a pre - trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models can generalize well to many datasets and domains without fine - tuning. This repository contains the INT4 weight - only quantization for the Whisper large model in ONNX format, powered by Intel® Neural Compressor and Intel® Extension for Transformers.

This INT4 ONNX model is generated by the weight - only quantization method of Intel® Neural Compressor.

📚 Documentation

Model Details

Property	Details
Model Authors - Company	Intel
Date	October 8, 2023
Version	1
Model Type	Speech Recognition
Paper or Other Resources	-
License	Apache 2.0
Questions or Comments	Community Tab

Intended Use

Property	Details
Primary intended uses	You can use the raw model for automatic speech recognition inference
Primary intended users	Anyone doing automatic speech recognition inference
Out - of - scope uses	This model in most cases will need to be fine - tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people.

📦 Installation

Export to ONNX Model

The FP32 model is exported with openai/whisper - large:

optimum-cli export onnx --model openai/whisper-large whisper-large-with-past/ --task automatic-speech-recognition-with-past --opset 13

Install ONNX Runtime

Install onnxruntime>=1.16.0 to support MatMulFpQ4 operator.

💻 Usage Examples

Run Quantization

Build Intel® Neural Compressor from master branch and run INT4 weight - only quantization.

The weight - only quantization configuration is as below:

dtype	group_size	scheme	algorithm
INT4	32	sym	RTN

We provide the key code below. For the complete script, please refer to whisper example.

from neural_compressor import quantization, PostTrainingQuantConfig
from neural_compressor.utils.constant import FP32

model_list = ['encoder_model.onnx', 'decoder_model.onnx', 'decoder_with_past_model.onnx']
for model in model_list:
    config = PostTrainingQuantConfig(
        approach="weight_only",
        calibration_sampling_size=[8],
        op_type_dict={".*": {"weight": {"bits": 4, 
                                        "algorithm": ["RTN"], 
                                        "scheme": ["sym"], 
                                        "group_size": 32}}},)
    q_model = quantization.fit(
        os.path.join("/path/to/whisper-large-with-past", model), # FP32 model path
        config,
        calib_dataloader=dataloader)
    q_model.save(os.path.join("/path/to/whisper-large-onnx-int4", model)) # INT4 model path

Evaluation

Operator Statistics

Below shows the operator statistics in the INT4 ONNX model:

Model	Op Type	Total	INT4 weight	FP32 weight
encoder_model	MatMul	256	192	64
decoder_model	MatMul	449	321	128
decoder_with_past_model	MatMul	385	257	128

Evaluation of wer

Evaluate the model on librispeech_asr dataset with below code:

import os
from evaluate import load
from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor, AutoConfig
model_name = 'openai/whisper-large'
model_path = 'whisper-large-onnx-int4'
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
wer = load("wer")
librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import PretrainedConfig
model_config = PretrainedConfig.from_pretrained(model_name)
predictions = []
references = []
sessions = ORTModelForSpeechSeq2Seq.load_model(
            os.path.join(model_path, 'encoder_model.onnx'),
            os.path.join(model_path, 'decoder_model.onnx'),
            os.path.join(model_path, 'decoder_with_past_model.onnx'))
model = ORTModelForSpeechSeq2Seq(sessions[0], sessions[1], model_config, model_path, sessions[2])
for idx, batch in enumerate(librispeech_test_clean):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    reference = processor.tokenizer._normalize(batch['text'])
    references.append(reference)
    predicted_ids = model.generate(input_features)[0]
    transcription = processor.decode(predicted_ids)
    prediction = processor.tokenizer._normalize(transcription)
    predictions.append(prediction)
wer_result = wer.compute(references=references, predictions=predictions)
print(f"Result wer: {wer_result * 100}")

📊 Metrics (Model Performance)

Model	Model Size (GB)	wer
FP32	8.8	3.04
INT4	1.9	3.05

📄 License

This model is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご