whisper-large-v3.w4a16オープンソース音声転写モデル

ホーム

Whisper Large V3.w4a16

nm-testingによって開発

これはopenai/whisper-large-v3の量子化バージョンで、INT4重み量子化とFP16活性化量子化を採用し、vLLM推論に適しています。

音声認識

Transformers

英語オープンソースライセンス:Apache-2.0 #音声からテキストへ変換 #INT4量子化 #低リソース展開

ダウンロード数 20

リリース時間 : 2/14/2025

モデル概要

このモデルはWhisper-large-v3の量子化バージョンで、主に音声認識タスクに使用され、音声をテキストに変換します。

モデル特徴

効率的な量子化

INT4重み量子化とFP16活性化量子化を採用し、モデルサイズとメモリ使用量を大幅に削減

vLLM互換

vLLM >= 0.5.2向けに最適化されており、効率的な推論を実現

高精度を維持

量子化後も元のモデルに近い認識精度を維持

モデル能力

音声認識

音声からテキストへ変換

英語の文字起こし

使用事例

音声文字起こし

会議議事録

会議の録音を自動的に文字記録に変換

WER(単語誤り率)約12.95%

ポッドキャスト文字起こし

ポッドキャストの音声コンテンツを検索可能なテキストに変換

🚀 whisper-large-v3-quantized.w4a16

このモデルは、オーディオデータをテキストに変換する音声認識モデルです。重みの量子化を行うことで、推論の効率化を実現しています。

🚀 クイックスタート

このモデルは、openai/whisper-large-v3 を量子化したバージョンです。以下のセクションでは、モデルの概要、デプロイ方法、作成方法、評価結果について説明します。

✨ 主な機能

量子化による最適化: 重みをINT4、活性化をFP16に量子化することで、推論の効率化を実現しています。
vLLMによる高速推論: vLLM バックエンドを使用することで、高速に推論を実行できます。

📦 インストール

このモデルを使用するには、必要なライブラリをインストールする必要があります。以下のコマンドを使用して、必要なライブラリをインストールしてください。

pip install vllm transformers datasets torch

💻 使用例

基本的な使用法

from vllm.assets.audio import AudioAsset
from vllm import LLM, SamplingParams

# prepare model
llm = LLM(
    model="neuralmagic/whisper-large-v3.w4a16",
    max_model_len=448,
    max_num_seqs=400,
    limit_mm_per_prompt={"audio": 1},
)

# prepare inputs
inputs = {  # Test explicit encoder/decoder prompt
    "encoder_prompt": {
        "prompt": "",
        "multi_modal_data": {
            "audio": AudioAsset("winning_call").audio_and_sample_rate,
        },
    },
    "decoder_prompt": "<|startoftranscript|>",
}

# generate response
print("========== SAMPLE GENERATION ==============")
outputs = llm.generate(inputs, SamplingParams(temperature=0.0, max_tokens=64))
print(f"PROMPT  : {outputs[0].prompt}")
print(f"RESPONSE: {outputs[0].outputs[0].text}")
print("==========================================")

高度な使用法

import torch
from datasets import load_dataset
from transformers import WhisperProcessor

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.tracing import TraceableWhisperForConditionalGeneration

# Select model and load it.
MODEL_ID = "openai/whisper-large-v3"

model = TraceableWhisperForConditionalGeneration.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype="auto",
)
model.config.forced_decoder_ids = None
processor = WhisperProcessor.from_pretrained(MODEL_ID)

# Configure processor the dataset task.
processor.tokenizer.set_prefix_tokens(language="en", task="transcribe")

# Select calibration dataset.
DATASET_ID = "MLCommons/peoples_speech"
DATASET_SUBSET = "test"
DATASET_SPLIT = "test"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

# Load dataset and preprocess.
ds = load_dataset(
    DATASET_ID,
    DATASET_SUBSET,
    split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]",
    trust_remote_code=True,
)


def preprocess(example):
    return {
        "array": example["audio"]["array"],
        "sampling_rate": example["audio"]["sampling_rate"],
        "text": " " + example["text"].capitalize(),
    }


ds = ds.map(preprocess, remove_columns=ds.column_names)


# Process inputs.
def process(sample):
    inputs = processor(
        audio=sample["array"],
        sampling_rate=sample["sampling_rate"],
        text=sample["text"],
        add_special_tokens=True,
        return_tensors="pt",
    )

    inputs["input_features"] = inputs["input_features"].to(dtype=model.dtype)
    inputs["decoder_input_ids"] = inputs["labels"]
    del inputs["labels"]

    return inputs


ds = ds.map(process, remove_columns=ds.column_names)


# Define a oneshot data collator for multimodal inputs.
def data_collator(batch):
    assert len(batch) == 1
    return {key: torch.tensor(value) for key, value in batch[0].items()}


# Recipe
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])

# Apply algorithms.
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    data_collator=data_collator,
)

# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
sample_features = next(iter(ds))["input_features"]
sample_decoder_ids = [processor.tokenizer.prefix_tokens]
sample_input = {
    "input_features": torch.tensor(sample_features).to(model.device),
    "decoder_input_ids": torch.tensor(sample_decoder_ids).to(model.device),
}

output = model.generate(**sample_input, language="en")
print(processor.batch_decode(output, skip_special_tokens=True))
print("==========================================\n\n")
# that's where you have a lot of windows in the south no actually that's passive solar
# and passive solar is something that was developed and designed in the 1960s and 70s
# and it was a great thing for what it was at the time but it's not a passive house

# Save to disk compressed.
SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)

📚 ドキュメント

モデル概要

属性	詳情
モデルタイプ	whisper-large-v3
入力	オーディオ - テキスト
出力	テキスト
モデル最適化	重み量子化 (INT4), 活性化量子化 (FP16)
リリース日	2025年1月31日
バージョン	1.0
モデル開発者	Neural Magic

モデル最適化

このモデルは、openai/whisper-large-v3 の重みをINT4データ型に量子化することで得られました。vLLM >= 0.5.2 での推論に対応しています。

デプロイ

このモデルは、vLLM バックエンドを使用して効率的にデプロイできます。

作成方法

このモデルは、llm-compressor を使用して作成されました。詳細なコードは、「使用例」セクションを参照してください。

評価

ベースモデル

Total Test Time: 94.4606 seconds
Total Requests: 511
Successful Requests: 511
Average Latency: 53.3529 seconds
Median Latency: 52.7258 seconds
95th Percentile Latency: 86.5851 seconds
Estimated req_Throughput: 5.41 requests/s
Estimated Throughput: 100.79 tok/s
WER: 12.660815197787665

W4A16

Total Test Time: 106.2064 seconds
Total Requests: 511
Successful Requests: 511
Average Latency: 59.7467 seconds
Median Latency: 58.3930 seconds
95th Percentile Latency: 97.4831 seconds
Estimated req_Throughput: 4.81 requests/s
Estimated Throughput: 89.35 tok/s
WER: 12.949380786341228

BibTeXエントリと引用情報

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}