whisper-large-v3.w4a16開源語音轉錄模型

首頁

Whisper Large V3.w4a16

由nm-testing開發

這是openai/whisper-large-v3的量化版本，採用INT4權重量化和FP16激活量化，適用於vLLM推理。

語音識別

Transformers

英語開源協議:Apache-2.0 #音頻轉文本 #INT4量化 #低資源部署

下載量 20

發布時間 : 2/14/2025

模型概述

該模型是Whisper-large-v3的量化版本，主要用於語音識別任務，將音頻轉換為文本。

模型特點

高效量化

採用INT4權重量化和FP16激活量化，顯著減少模型大小和內存佔用

vLLM兼容

專為vLLM >= 0.5.2優化，可實現高效推理

保持高精度

在量化後仍保持接近原始模型的識別準確率

模型能力

語音識別

音頻轉文本

英語轉錄

使用案例

語音轉錄

會議記錄

將會議錄音自動轉換為文字記錄

WER(詞錯誤率)約12.95%

播客轉錄

將播客音頻內容轉換為可搜索的文本

🚀 whisper-large-v3-quantized.w4a16

這是 openai/whisper-large-v3 的量化版本，可高效處理音頻轉文本任務。通過對模型權重進行量化，該模型在推理性能上有顯著提升，適合在 vLLM 環境中部署。

🚀 快速開始

模型概述

模型架構：whisper-large-v3
- 輸入：音頻 - 文本
- 輸出：文本
模型優化：
- 權重量化：INT4
- 激活量化：FP16
發佈日期：2025 年 1 月 31 日
版本：1.0
模型開發者：Neural Magic

本模型是 openai/whisper-large-v3 的量化版本，通過將權重量化為 INT4 數據類型，可使用 vLLM >= 0.5.2 進行推理。

模型優化

此模型是將 openai/whisper-large-v3 的權重量化為 INT4 數據類型得到的，可使用 vLLM >= 0.5.2 進行推理。

📦 安裝指南

暫未提供相關安裝步驟。

💻 使用示例

基礎用法

from vllm.assets.audio import AudioAsset
from vllm import LLM, SamplingParams

# prepare model
llm = LLM(
    model="neuralmagic/whisper-large-v3.w4a16",
    max_model_len=448,
    max_num_seqs=400,
    limit_mm_per_prompt={"audio": 1},
)

# prepare inputs
inputs = {  # Test explicit encoder/decoder prompt
    "encoder_prompt": {
        "prompt": "",
        "multi_modal_data": {
            "audio": AudioAsset("winning_call").audio_and_sample_rate,
        },
    },
    "decoder_prompt": "<|startoftranscript|>",
}

# generate response
print("========== SAMPLE GENERATION ==============")
outputs = llm.generate(inputs, SamplingParams(temperature=0.0, max_tokens=64))
print(f"PROMPT  : {outputs[0].prompt}")
print(f"RESPONSE: {outputs[0].outputs[0].text}")
print("==========================================")

高級用法

# 此代碼展示了使用 vLLM 進行音頻轉錄的高級用法，包括模型準備、輸入處理和響應生成。
from vllm.assets.audio import AudioAsset
from vllm import LLM, SamplingParams

# prepare model
llm = LLM(
    model="neuralmagic/whisper-large-v3.w4a16",
    max_model_len=448,
    max_num_seqs=400,
    limit_mm_per_prompt={"audio": 1},
)

# prepare inputs
inputs = {  # Test explicit encoder/decoder prompt
    "encoder_prompt": {
        "prompt": "",
        "multi_modal_data": {
            "audio": AudioAsset("winning_call").audio_and_sample_rate,
        },
    },
    "decoder_prompt": "<|startoftranscript|>",
}

# generate response
print("========== SAMPLE GENERATION ==============")
outputs = llm.generate(inputs, SamplingParams(temperature=0.0, max_tokens=64))
print(f"PROMPT  : {outputs[0].prompt}")
print(f"RESPONSE: {outputs[0].outputs[0].text}")
print("==========================================")

📚 詳細文檔

部署

使用 vLLM

此模型可使用 vLLM 後端進行高效部署，示例如下：

創建

本模型使用 llm-compressor 創建，代碼如下：

import torch
from datasets import load_dataset
from transformers import WhisperProcessor

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.tracing import TraceableWhisperForConditionalGeneration

# Select model and load it.
MODEL_ID = "openai/whisper-large-v3"

model = TraceableWhisperForConditionalGeneration.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype="auto",
)
model.config.forced_decoder_ids = None
processor = WhisperProcessor.from_pretrained(MODEL_ID)

# Configure processor the dataset task.
processor.tokenizer.set_prefix_tokens(language="en", task="transcribe")

# Select calibration dataset.
DATASET_ID = "MLCommons/peoples_speech"
DATASET_SUBSET = "test"
DATASET_SPLIT = "test"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

# Load dataset and preprocess.
ds = load_dataset(
    DATASET_ID,
    DATASET_SUBSET,
    split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]",
    trust_remote_code=True,
)


def preprocess(example):
    return {
        "array": example["audio"]["array"],
        "sampling_rate": example["audio"]["sampling_rate"],
        "text": " " + example["text"].capitalize(),
    }


ds = ds.map(preprocess, remove_columns=ds.column_names)


# Process inputs.
def process(sample):
    inputs = processor(
        audio=sample["array"],
        sampling_rate=sample["sampling_rate"],
        text=sample["text"],
        add_special_tokens=True,
        return_tensors="pt",
    )

    inputs["input_features"] = inputs["input_features"].to(dtype=model.dtype)
    inputs["decoder_input_ids"] = inputs["labels"]
    del inputs["labels"]

    return inputs


ds = ds.map(process, remove_columns=ds.column_names)


# Define a oneshot data collator for multimodal inputs.
def data_collator(batch):
    assert len(batch) == 1
    return {key: torch.tensor(value) for key, value in batch[0].items()}


# Recipe
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])

# Apply algorithms.
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    data_collator=data_collator,
)

# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
sample_features = next(iter(ds))["input_features"]
sample_decoder_ids = [processor.tokenizer.prefix_tokens]
sample_input = {
    "input_features": torch.tensor(sample_features).to(model.device),
    "decoder_input_ids": torch.tensor(sample_decoder_ids).to(model.device),
}

output = model.generate(**sample_input, language="en")
print(processor.batch_decode(output, skip_special_tokens=True))
print("==========================================\n\n")
# that's where you have a lot of windows in the south no actually that's passive solar
# and passive solar is something that was developed and designed in the 1960s and 70s
# and it was a great thing for what it was at the time but it's not a passive house

# Save to disk compressed.
SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)

評估

基礎模型

總測試時間：94.4606 秒
總請求數：511
成功請求數：511
平均延遲：53.3529 秒
中位延遲：52.7258 秒
95% 百分位延遲：86.5851 秒
估計請求吞吐量：5.41 請求/秒
估計吞吐量：100.79 令牌/秒
字錯誤率（WER）：12.660815197787665

W4A16

總測試時間：106.2064 秒
總請求數：511
成功請求數：511
平均延遲：59.7467 秒
中位延遲：58.3930 秒
95% 百分位延遲：97.4831 秒
估計請求吞吐量：4.81 請求/秒
估計吞吐量：89.35 令牌/秒
字錯誤率（WER）：12.949380786341228

BibTeX 引用和引用信息

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

📄 許可證

本模型採用 Apache 2.0 許可證。

屬性	詳情
模型類型	whisper-large-v3
訓練數據	未提及
發佈日期	2025 年 1 月 31 日
版本	1.0
模型開發者	Neural Magic
基礎模型	openai/whisper-large-v3
庫名稱	transformers
權重量化	INT4
激活量化	FP16
適合推理的框架	vLLM >= 0.5.2