オープンソースのWhisper-large-onnx-int4-incモデル - 無料で自動音声認識と翻訳を実現

Whisper Large Onnx Int4 Inc

Intelによって開発

Whisperは自動音声認識（ASR）と音声翻訳のための事前学習済みモデルです。このリポジトリでは、インテル®ニューラルコンプレッサーとインテル®Transformers拡張機能によって駆動されるONNX形式のWhisper大型モデルINT4重み量子化バージョンを提供しています。

音声認識

Transformers

オープンソースライセンス:Apache-2.0 #INT4量子化 #マルチドメインASR #低リソース推論

ダウンロード数 44

リリース時間 : 10/8/2023

モデル概要

Whisperは68万時間の注釈付きデータで訓練された事前学習済みモデルで、微調整なしでさまざまなデータセットやドメインに適応できる強力な汎化能力を示しています。このモデルはINT4量子化バージョンで、自動音声認識推論に適しています。

モデル特徴

INT4量子化

モデルはINT4重み量子化されており、モデルサイズを大幅に削減（8.8GBから1.9GBへ）しながら高性能を維持しています。

ONNX形式

モデルはONNX形式で提供されており、さまざまなプラットフォームでの展開と推論が容易です。

高性能

量子化されたモデルはlibrispeech_asrデータセットで単語誤り率がわずか3.05％であり、FP32バージョン（3.04％）とほぼ同じ性能です。

微調整不要

モデルは強力な汎化能力を示し、微調整なしでさまざまなデータセットやドメインに適応できます。

モデル能力

自動音声認識

音声翻訳

使用事例

音声認識

音声からテキストへ

音声内容をテキストに変換し、会議議事録や字幕生成などのシナリオに適用できます。

単語誤り率3.05%

🚀 INT4 Whisper large ONNX モデル

Whisperは、自動音声認識（ASR）と音声翻訳のための事前学習済みモデルです。68万時間のラベル付きデータで学習されたWhisperモデルは、微調整を必要とせずに多くのデータセットやドメインに対して強力な汎化能力を示します。これは、Intel® Neural Compressor と Intel® Extension for Transformers を用いた、ONNX形式のWhisper largeモデルのINT4重みのみの量子化モデルのリポジトリです。

このINT4 ONNXモデルは、Intel® Neural Compressor の重みのみの量子化手法によって生成されています。

📚 詳細ドキュメント

モデル詳細

プロパティ	詳細
モデルの作者 - 会社	Intel
日付	2023年10月8日
バージョン	1
タイプ	音声認識
論文またはその他のリソース	-
ライセンス	Apache 2.0
質問やコメント	コミュニティタブ

想定される使用法

想定される使用法	説明
主な想定使用法	生のモデルを自動音声認識推論に使用できます
主な想定ユーザー	自動音声認識推論を行う人
想定外の使用法	このモデルは、ほとんどの場合、特定のタスクに合わせて微調整する必要があります。また、人に対して敵対的または疎外感を与える環境を意図的に作り出すために使用してはいけません。

📦 インストール

ONNXモデルへのエクスポート

FP32モデルは、openai/whisper-largeを使用してエクスポートされます。

optimum-cli export onnx --model openai/whisper-large whisper-large-with-past/ --task automatic-speech-recognition-with-past --opset 13

ONNX Runtimeのインストール

MatMulFpQ4 演算子をサポートするために、onnxruntime>=1.16.0 をインストールしてください。

💻 使用例

量子化の実行

Intel® Neural Compressor をマスターブランチからビルドし、INT4重みのみの量子化を実行します。

重みのみの量子化設定は以下の通りです。

dtype	グループサイズ	スキーム	アルゴリズム
INT4	32	sym	RTN

以下にキーコードを示します。完全なスクリプトについては、whisper example を参照してください。

from neural_compressor import quantization, PostTrainingQuantConfig
from neural_compressor.utils.constant import FP32

model_list = ['encoder_model.onnx', 'decoder_model.onnx', 'decoder_with_past_model.onnx']
for model in model_list:
    config = PostTrainingQuantConfig(
        approach="weight_only",
        calibration_sampling_size=[8],
        op_type_dict={".*": {"weight": {"bits": 4, 
                                        "algorithm": ["RTN"], 
                                        "scheme": ["sym"], 
                                        "group_size": 32}}},)
    q_model = quantization.fit(
        os.path.join("/path/to/whisper-large-with-past", model), # FP32 model path
        config,
        calib_dataloader=dataloader)
    q_model.save(os.path.join("/path/to/whisper-large-onnx-int4", model)) # INT4 model path

評価

演算子統計

以下は、INT4 ONNXモデルの演算子統計を示しています。

モデル	演算子タイプ	合計	INT4重み	FP32重み
encoder_model	MatMul	256	192	64
decoder_model	MatMul	449	321	128
decoder_with_past_model	MatMul	385	257	128

werの評価

以下のコードを使用して、librispeech_asr データセットでモデルを評価します。

import os
from evaluate import load
from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor, AutoConfig
model_name = 'openai/whisper-large'
model_path = 'whisper-large-onnx-int4'
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
wer = load("wer")
librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import PretrainedConfig
model_config = PretrainedConfig.from_pretrained(model_name)
predictions = []
references = []
sessions = ORTModelForSpeechSeq2Seq.load_model(
            os.path.join(model_path, 'encoder_model.onnx'),
            os.path.join(model_path, 'decoder_model.onnx'),
            os.path.join(model_path, 'decoder_with_past_model.onnx'))
model = ORTModelForSpeechSeq2Seq(sessions[0], sessions[1], model_config, model_path, sessions[2])
for idx, batch in enumerate(librispeech_test_clean):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    reference = processor.tokenizer._normalize(batch['text'])
    references.append(reference)
    predicted_ids = model.generate(input_features)[0]
    transcription = processor.decode(predicted_ids)
    prediction = processor.tokenizer._normalize(transcription)
    predictions.append(prediction)
wer_result = wer.compute(references=references, predictions=predictions)
print(f"Result wer: {wer_result * 100}")