pixtral-12b-FP8-dynamic开源模型 - 降内存支持多语言商用与研究用途

Pixtral 12b FP8 Dynamic

由 RedHatAI 开发

pixtral-12b-FP8-dynamic 是 mistral-community/pixtral-12b 的量化版本，通过将权重和激活量化为 FP8 数据类型，减少磁盘大小和 GPU 内存需求约 50%，适用于多种语言的商业和研究用途。

文本生成图像

Safetensors

支持多种语言开源协议:Apache-2.0 #FP8量化 #多模态推理 #高效部署

下载量 87.31k

发布时间 : 10/10/2024

模型简介

该模型是一个多模态模型，支持文本和图像输入，输出为文本。适用于多种语言的商业和研究用途，特别适合类似助手的聊天场景。

模型特点

FP8量化

通过将权重和激活量化为FP8数据类型，减少磁盘大小和GPU内存需求约50%。

多语言支持

支持英语、德语、法语、意大利语、葡萄牙语、印地语、西班牙语、泰语等多种语言。

高效推理

可使用vLLM后端进行高效推理，优化推理速度。

模型能力

文本生成

图像分析

多模态理解

使用案例

商业助手

多语言客服

用于多语言客服场景，支持多种语言的文本生成和理解。

研究

多模态研究

用于多模态理解和生成的研究，支持文本和图像的联合处理。

🚀 pixtral-12b-FP8-dynamic

这是 mistral-community/pixtral-12b 的量化版本，该模型将权重和激活量化为 FP8 数据类型，适用于多种语言的商业和研究用途，可借助 vLLM 进行高效推理。

🚀 快速开始

本模型可使用 vLLM 后端进行高效部署，示例代码如下：

from vllm import LLM, SamplingParams

# Initialize the LLM
model_name = "neuralmagic/pixtral-12b-FP8-dynamic"
llm = LLM(model=model_name, max_model_len=10000)

# Create the prompt
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe the image."},
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    },
]

# Set up sampling parameters
sampling_params = SamplingParams(temperature=0.2, max_tokens=100)

# Generate the response
outputs = llm.chat(messages, sampling_params=sampling_params)

# Print the generated text
for output in outputs:
    print(output.outputs[0].text)

vLLM 还支持与 OpenAI 兼容的服务，更多详情请参阅文档。

vllm serve neuralmagic/pixtral-12b-FP8-dynamic

✨ 主要特性

多语言支持：支持英语、德语、法语、意大利语、葡萄牙语、印地语、西班牙语、泰语等多种语言。
模型优化：通过将权重和激活量化为 FP8 数据类型，减少磁盘大小和 GPU 内存需求约 50%。
高效推理：可使用 vLLM 后端进行高效推理。

📦 安装指南

文档未提供具体安装步骤，故跳过该章节。

💻 使用示例

基础用法

from vllm import LLM, SamplingParams

# Initialize the LLM
model_name = "neuralmagic/pixtral-12b-FP8-dynamic"
llm = LLM(model=model_name, max_model_len=10000)

# Create the prompt
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe the image."},
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    },
]

# Set up sampling parameters
sampling_params = SamplingParams(temperature=0.2, max_tokens=100)

# Generate the response
outputs = llm.chat(messages, sampling_params=sampling_params)

# Print the generated text
for output in outputs:
    print(output.outputs[0].text)

高级用法

from transformers import AutoProcessor, LlavaForConditionalGeneration

from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot, wrap_hf_model_class

MODEL_ID = "mistral-community/pixtral-12b"

# Load model.
model_class = wrap_hf_model_class(LlavaForConditionalGeneration)
model = model_class.from_pretrained(MODEL_ID, device_map="auto", torch_dtype="auto")
processor = AutoProcessor.from_pretrained(MODEL_ID)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to fp8 with per channel via ptq
#   * quantize the activations to fp8 with dynamic per token
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["re:.*lm_head", "re:multi_modal_projector.*", "re:vision_model.*"],
)

# Apply quantization and save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
oneshot(model=model, recipe=recipe, output_dir=SAVE_DIR)
processor.save_pretrained(SAVE_DIR)

# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
input_ids = processor(text="Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=20)
print(processor.decode(output[0]))
print("==========================================")

📚 详细文档

模型概述

模型架构：Pixtral (Llava)
- 输入：文本/图像
- 输出：文本
模型优化：
- 权重量化：FP8
- 激活量化：FP8
预期用例：适用于多种语言的商业和研究用途，类似于 mistralai/Pixtral-12B-2409，该模型旨在用于类似助手的聊天场景。
不适用范围：以任何违反适用法律法规（包括贸易合规法律）的方式使用，以及使用英语以外的语言。
发布日期：2024 年 11 月 1 日
版本：1.0
许可证：Apache 2.0
模型开发者：Neural Magic

模型优化

本模型是通过将 mistral-community/pixtral-12b 的权重和激活量化为 FP8 数据类型得到的，可使用从源代码构建的 vLLM 进行推理。此优化将每个参数的位数从 16 位减少到 8 位，将磁盘大小和 GPU 内存需求降低了约 50%。仅对变压器块内线性算子的权重和激活进行量化，采用对称的逐通道量化，其中每个输出维度的线性缩放映射量化权重和激活的 FP8 表示。激活也在每个令牌的动态基础上进行量化，使用 LLM Compressor 进行量化。

创建过程

本模型通过应用 LLM Compressor 创建，代码如下：

from transformers import AutoProcessor, LlavaForConditionalGeneration

from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot, wrap_hf_model_class

MODEL_ID = "mistral-community/pixtral-12b"

# Load model.
model_class = wrap_hf_model_class(LlavaForConditionalGeneration)
model = model_class.from_pretrained(MODEL_ID, device_map="auto", torch_dtype="auto")
processor = AutoProcessor.from_pretrained(MODEL_ID)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to fp8 with per channel via ptq
#   * quantize the activations to fp8 with dynamic per token
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["re:.*lm_head", "re:multi_modal_projector.*", "re:vision_model.*"],
)

# Apply quantization and save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
oneshot(model=model, recipe=recipe, output_dir=SAVE_DIR)
processor.save_pretrained(SAVE_DIR)

# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
input_ids = processor(text="Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=20)
print(processor.decode(output[0]))
print("==========================================")

评估

多模态基准测试

	pixtral-12b	pixtral-12b-FP8-dynamic
MMMU (思维链)	49.44	51.11
Mathvista (思维链)	58.1	59.4
ChartQA (思维链)	82.64	82.68
DocVQA (平均归一化 Levenshtein 相似度)	89.36	89.35

文本基准测试

	pixtral-12b	pixtral-12b-FP8-dynamic
大规模多任务语言理解 (5 次提示)	69.27	68.96
数学 (0 次提示)	43.82	43.27
人类评估 (单次通过率)	77.80	76.4

复现

待确定

🔧 技术细节

本模型的优化过程涉及将 mistral-community/pixtral-12b 的权重和激活量化为 FP8 数据类型，仅对变压器块内线性算子的权重和激活进行量化，采用对称的逐通道量化，激活也在每个令牌的动态基础上进行量化，使用 LLM Compressor 进行量化。此优化将每个参数的位数从 16 位减少到 8 位，将磁盘大小和 GPU 内存需求降低了约 50%。