【Whisper-large-v2-onnx-int4-inc】オープンソース音声認識モデル - 音声の高精度な認識と翻訳を無料で実現

Whisper Large V2 Onnx Int4 Inc

Intelによって開発

Whisperは事前学習済みの自動音声認識（ASR）および音声翻訳モデルで、68万時間の注釈付きデータでトレーニングされ、強力な汎化能力を示しています。このリポジトリには、ONNX形式のWhisper large v2モデルのINT4重みのみ量子化バージョンが含まれています。

音声認識

Transformers

オープンソースライセンス:Apache-2.0 #低精度量子化 #多言語ASR #効率的な推論

ダウンロード数 19

リリース時間 : 10/8/2023

モデル概要

Whisperは強力な自動音声認識および音声翻訳モデルで、微調整なしでさまざまなデータセットやドメインに適応できます。このモデルはINT4重みのみ量子化バージョンで、インテル® Neural Compressorによって駆動されています。

モデル特徴

INT4重みのみ量子化

モデルはINT4重みのみ量子化されており、モデルサイズを大幅に削減しながら高い認識精度を維持しています。

強力な汎化能力

68万時間の注釈付きデータでトレーニングされており、微調整なしでさまざまなデータセットやドメインに適応できます。

ONNX形式

モデルはONNX形式で提供されており、さまざまなプラットフォームでの展開と推論が容易です。

モデル能力

自動音声認識

音声翻訳

使用事例

音声認識

音声からテキストへ

音声コンテンツをテキストに変換し、会議議事録や字幕生成などのシナリオに適用できます。

単語誤り率は2.99%まで低減

🚀 INT4 Whisper large-v2 ONNXモデル

Whisperは、自動音声認識（ASR）と音声翻訳のための事前学習済みモデルです。68万時間のラベル付きデータで学習されたWhisperモデルは、微調整を必要とせずに多くのデータセットやドメインに対して強力な汎化能力を示します。これは、Intel® Neural Compressor と Intel® Extension for Transformers を使用した、ONNX形式のWhisper large v2モデルのINT4重みのみの量子化モデルのリポジトリです。

このINT4 ONNXモデルは、Intel® Neural Compressor の重みのみの量子化手法によって生成されています。

🚀 クイックスタート

モデルの詳細

モデルの詳細	説明
モデルの作者 - 会社	Intel
日付	2023年10月8日
バージョン	1
タイプ	音声認識
論文またはその他のリソース	-
ライセンス	Apache 2.0
質問やコメント	コミュニティタブ

想定される用途

想定される用途	説明
主な想定用途	生のモデルを自動音声認識推論に使用できます
主な想定ユーザー	自動音声認識推論を行う人
想定外の用途	このモデルは、ほとんどの場合、特定のタスクに合わせて微調整する必要があります。また、人々に敵意や疎外感を抱かせる環境を意図的に作り出すために使用してはなりません。

ONNXモデルへのエクスポート

FP32モデルは、openai/whisper-large-v2を使用してエクスポートされます。

optimum-cli export onnx --model openai/whisper-large-v2 whisper-large-v2-with-past/ --task automatic-speech-recognition-with-past --opset 13

ONNX Runtimeのインストール

MatMulFpQ4 演算子をサポートするために、onnxruntime>=1.16.0 をインストールします。

量子化の実行

Intel® Neural Compressor のマスターブランチからビルドし、INT4重みのみの量子化を実行します。

重みのみの量子化の設定は以下の通りです。

データ型	グループサイズ	スキーム	アルゴリズム
INT4	32	sym	RTN

以下にキーコードを提供します。完全なスクリプトについては、whisper example を参照してください。

from neural_compressor import quantization, PostTrainingQuantConfig
from neural_compressor.utils.constant import FP32

model_list = ['encoder_model.onnx', 'decoder_model.onnx', 'decoder_with_past_model.onnx']
for model in model_list:
    config = PostTrainingQuantConfig(
        approach="weight_only",
        calibration_sampling_size=[8],
        op_type_dict={".*": {"weight": {"bits": 4, 
                                        "algorithm": ["RTN"], 
                                        "scheme": ["sym"], 
                                        "group_size": 32}}},)
    q_model = quantization.fit(
        os.path.join("/path/to/whisper-large-v2-with-past", model), # FP32 model path
        config,
        calib_dataloader=dataloader)
    q_model.save(os.path.join("/path/to/whisper-large-v2-onnx-int4", model)) # INT4 model path

評価

演算子統計

以下は、INT4 ONNXモデルの演算子統計を示しています。

モデル	演算子タイプ	合計	INT4重み	FP32重み
encoder_model	MatMul	256	192	64
decoder_model	MatMul	449	321	128
decoder_with_past_model	MatMul	385	257	128

werの評価

以下のコードを使用して、librispeech_asr データセットでモデルを評価します。

import os
from evaluate import load
from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor, AutoConfig
model_name = 'openai/whisper-large-v2'
model_path = 'whisper-large-v2-onnx-int4'
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
wer = load("wer")
librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import PretrainedConfig
model_config = PretrainedConfig.from_pretrained(model_name)
predictions = []
references = []
sessions = ORTModelForSpeechSeq2Seq.load_model(
            os.path.join(model_path, 'encoder_model.onnx'),
            os.path.join(model_path, 'decoder_model.onnx'),
            os.path.join(model_path, 'decoder_with_past_model.onnx'))
model = ORTModelForSpeechSeq2Seq(sessions[0], sessions[1], model_config, model_path, sessions[2])
for idx, batch in enumerate(librispeech_test_clean):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    reference = processor.tokenizer._normalize(batch['text'])
    references.append(reference)
    predicted_ids = model.generate(input_features)[0]
    transcription = processor.decode(predicted_ids)
    prediction = processor.tokenizer._normalize(transcription)
    predictions.append(prediction)
wer_result = wer.compute(references=references, predictions=predictions)
print(f"Result wer: {wer_result * 100}")