whisper-large-v3.w4a16开源语音转录模型

首页

Whisper Large V3.w4a16

由 nm-testing 开发

这是openai/whisper-large-v3的量化版本，采用INT4权重量化和FP16激活量化，适用于vLLM推理。

语音识别

Transformers

英语开源协议:Apache-2.0 #音频转文本 #INT4量化 #低资源部署

下载量 20

发布时间 : 2/14/2025

模型简介

该模型是Whisper-large-v3的量化版本，主要用于语音识别任务，将音频转换为文本。

模型特点

高效量化

采用INT4权重量化和FP16激活量化，显著减少模型大小和内存占用

vLLM兼容

专为vLLM >= 0.5.2优化，可实现高效推理

保持高精度

在量化后仍保持接近原始模型的识别准确率

模型能力

语音识别

音频转文本

英语转录

使用案例

语音转录

会议记录

将会议录音自动转换为文字记录

WER(词错误率)约12.95%

播客转录

将播客音频内容转换为可搜索的文本

🚀 whisper-large-v3-quantized.w4a16

这是 openai/whisper-large-v3 的量化版本，可高效处理音频转文本任务。通过对模型权重进行量化，该模型在推理性能上有显著提升，适合在 vLLM 环境中部署。

🚀 快速开始

模型概述

模型架构：whisper-large-v3
- 输入：音频 - 文本
- 输出：文本
模型优化：
- 权重量化：INT4
- 激活量化：FP16
发布日期：2025 年 1 月 31 日
版本：1.0
模型开发者：Neural Magic

本模型是 openai/whisper-large-v3 的量化版本，通过将权重量化为 INT4 数据类型，可使用 vLLM >= 0.5.2 进行推理。

模型优化

此模型是将 openai/whisper-large-v3 的权重量化为 INT4 数据类型得到的，可使用 vLLM >= 0.5.2 进行推理。

📦 安装指南

暂未提供相关安装步骤。

💻 使用示例

基础用法

from vllm.assets.audio import AudioAsset
from vllm import LLM, SamplingParams

# prepare model
llm = LLM(
    model="neuralmagic/whisper-large-v3.w4a16",
    max_model_len=448,
    max_num_seqs=400,
    limit_mm_per_prompt={"audio": 1},
)

# prepare inputs
inputs = {  # Test explicit encoder/decoder prompt
    "encoder_prompt": {
        "prompt": "",
        "multi_modal_data": {
            "audio": AudioAsset("winning_call").audio_and_sample_rate,
        },
    },
    "decoder_prompt": "<|startoftranscript|>",
}

# generate response
print("========== SAMPLE GENERATION ==============")
outputs = llm.generate(inputs, SamplingParams(temperature=0.0, max_tokens=64))
print(f"PROMPT  : {outputs[0].prompt}")
print(f"RESPONSE: {outputs[0].outputs[0].text}")
print("==========================================")

高级用法

# 此代码展示了使用 vLLM 进行音频转录的高级用法，包括模型准备、输入处理和响应生成。
from vllm.assets.audio import AudioAsset
from vllm import LLM, SamplingParams

# prepare model
llm = LLM(
    model="neuralmagic/whisper-large-v3.w4a16",
    max_model_len=448,
    max_num_seqs=400,
    limit_mm_per_prompt={"audio": 1},
)

# prepare inputs
inputs = {  # Test explicit encoder/decoder prompt
    "encoder_prompt": {
        "prompt": "",
        "multi_modal_data": {
            "audio": AudioAsset("winning_call").audio_and_sample_rate,
        },
    },
    "decoder_prompt": "<|startoftranscript|>",
}

# generate response
print("========== SAMPLE GENERATION ==============")
outputs = llm.generate(inputs, SamplingParams(temperature=0.0, max_tokens=64))
print(f"PROMPT  : {outputs[0].prompt}")
print(f"RESPONSE: {outputs[0].outputs[0].text}")
print("==========================================")

📚 详细文档

部署

使用 vLLM

此模型可使用 vLLM 后端进行高效部署，示例如下：

创建

本模型使用 llm-compressor 创建，代码如下：

import torch
from datasets import load_dataset
from transformers import WhisperProcessor

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.tracing import TraceableWhisperForConditionalGeneration

# Select model and load it.
MODEL_ID = "openai/whisper-large-v3"

model = TraceableWhisperForConditionalGeneration.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype="auto",
)
model.config.forced_decoder_ids = None
processor = WhisperProcessor.from_pretrained(MODEL_ID)

# Configure processor the dataset task.
processor.tokenizer.set_prefix_tokens(language="en", task="transcribe")

# Select calibration dataset.
DATASET_ID = "MLCommons/peoples_speech"
DATASET_SUBSET = "test"
DATASET_SPLIT = "test"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

# Load dataset and preprocess.
ds = load_dataset(
    DATASET_ID,
    DATASET_SUBSET,
    split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]",
    trust_remote_code=True,
)


def preprocess(example):
    return {
        "array": example["audio"]["array"],
        "sampling_rate": example["audio"]["sampling_rate"],
        "text": " " + example["text"].capitalize(),
    }


ds = ds.map(preprocess, remove_columns=ds.column_names)


# Process inputs.
def process(sample):
    inputs = processor(
        audio=sample["array"],
        sampling_rate=sample["sampling_rate"],
        text=sample["text"],
        add_special_tokens=True,
        return_tensors="pt",
    )

    inputs["input_features"] = inputs["input_features"].to(dtype=model.dtype)
    inputs["decoder_input_ids"] = inputs["labels"]
    del inputs["labels"]

    return inputs


ds = ds.map(process, remove_columns=ds.column_names)


# Define a oneshot data collator for multimodal inputs.
def data_collator(batch):
    assert len(batch) == 1
    return {key: torch.tensor(value) for key, value in batch[0].items()}


# Recipe
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])

# Apply algorithms.
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    data_collator=data_collator,
)

# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
sample_features = next(iter(ds))["input_features"]
sample_decoder_ids = [processor.tokenizer.prefix_tokens]
sample_input = {
    "input_features": torch.tensor(sample_features).to(model.device),
    "decoder_input_ids": torch.tensor(sample_decoder_ids).to(model.device),
}

output = model.generate(**sample_input, language="en")
print(processor.batch_decode(output, skip_special_tokens=True))
print("==========================================\n\n")
# that's where you have a lot of windows in the south no actually that's passive solar
# and passive solar is something that was developed and designed in the 1960s and 70s
# and it was a great thing for what it was at the time but it's not a passive house

# Save to disk compressed.
SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)

评估

基础模型

总测试时间：94.4606 秒
总请求数：511
成功请求数：511
平均延迟：53.3529 秒
中位延迟：52.7258 秒
95% 百分位延迟：86.5851 秒
估计请求吞吐量：5.41 请求/秒
估计吞吐量：100.79 令牌/秒
字错误率（WER）：12.660815197787665

W4A16

总测试时间：106.2064 秒
总请求数：511
成功请求数：511
平均延迟：59.7467 秒
中位延迟：58.3930 秒
95% 百分位延迟：97.4831 秒
估计请求吞吐量：4.81 请求/秒
估计吞吐量：89.35 令牌/秒
字错误率（WER）：12.949380786341228

BibTeX 引用和引用信息

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

📄 许可证

本模型采用 Apache 2.0 许可证。

属性	详情
模型类型	whisper-large-v3
训练数据	未提及
发布日期	2025 年 1 月 31 日
版本	1.0
模型开发者	Neural Magic
基础模型	openai/whisper-large-v3
库名称	transformers
权重量化	INT4
激活量化	FP16
适合推理的框架	vLLM >= 0.5.2