Llama-3.2-11B-Vision-Instruct-FP8-dynamic开源模型 - 支持多语言，适用于商业聊天助手

首页

Llama 3.2 11B Vision Instruct FP8 Dynamic

由 RedHatAI 开发

这是一个基于Llama-3.2-11B-Vision-Instruct的量化模型，适用于多语言的商业和研究用途，可用于类似助手的聊天场景。

图像生成文本

Safetensors

支持多种语言#FP8量化 #多模态助手 #商业研究通用

下载量 2,295

发布时间 : 9/25/2024

模型简介

该模型经过FP8权重量化和激活量化优化，适用于多语言商业和研究用途，特别适合类似助手的聊天应用。

模型特点

FP8量化

采用FP8进行权重和激活量化，减少磁盘大小和GPU内存需求约50%。

多模态支持

支持文本和图像输入，能够处理多模态任务。

高效推理

使用vLLM后端进行高效部署，支持快速推理。

模型能力

文本生成

图像理解

多模态交互

使用案例

助手应用

图像描述生成

根据输入的图像生成描述性文本或诗歌。

可生成符合图像内容的自然语言描述。

多模态聊天

结合图像和文本输入进行交互式对话。

能够理解并回应结合图像内容的对话。

🚀 Llama-3.2-11B-Vision-Instruct-FP8-dynamic

这是一个经过量化处理的模型，基于 Llama-3.2-11B-Vision-Instruct 进行优化，适用于多语言的商业和研究用途，可用于类似助手的聊天场景。

🚀 快速开始

本模型可使用 vLLM 后端进行高效部署，以下是使用示例：

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset

# Initialize the LLM
model_name = "neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic"
llm = LLM(model=model_name, max_num_seqs=1, enforce_eager=True)

# Load the image
image = ImageAsset("cherry_blossom").pil_image.convert("RGB")

# Create the prompt
question = "If I had to write a haiku for this one, it would be: "
prompt = f"<|image|><|begin_of_text|>{question}"

# Set up sampling parameters
sampling_params = SamplingParams(temperature=0.2, max_tokens=30)

# Generate the response
inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
    },
}
outputs = llm.generate(inputs, sampling_params=sampling_params)

# Print the generated text
print(outputs[0].outputs[0].text)

vLLM 还支持与 OpenAI 兼容的服务，更多详情请参阅文档。

vllm serve neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic --enforce-eager --max-num-seqs 16

✨ 主要特性

模型架构：Meta-Llama-3.2，输入为文本/图像，输出为文本。
模型优化：
- 权重量化：采用 FP8 进行权重量化。
- 激活量化：采用 FP8 进行激活量化。
预期用例：适用于多语言的商业和研究用途，类似于 Llama-3.2-11B-Vision-Instruct，用于类似助手的聊天。
适用范围外情况：不得用于违反适用法律法规（包括贸易合规法律）的任何方式，不支持英语以外的语言。
发布日期：2024 年 9 月 25 日
版本：1.0
许可证：llama3.2
模型开发者：Neural Magic

📦 安装指南

暂未提供相关安装步骤内容。

💻 使用示例

基础用法

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset

# Initialize the LLM
model_name = "neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic"
llm = LLM(model=model_name, max_num_seqs=1, enforce_eager=True)

# Load the image
image = ImageAsset("cherry_blossom").pil_image.convert("RGB")

# Create the prompt
question = "If I had to write a haiku for this one, it would be: "
prompt = f"<|image|><|begin_of_text|>{question}"

# Set up sampling parameters
sampling_params = SamplingParams(temperature=0.2, max_tokens=30)

# Generate the response
inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
    },
}
outputs = llm.generate(inputs, sampling_params=sampling_params)

# Print the generated text
print(outputs[0].outputs[0].text)

高级用法

vllm serve neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic --enforce-eager --max-num-seqs 16

📚 详细文档

模型优化

本模型通过将 Llama-3.2-11B-Vision-Instruct 的权重和激活量化为 FP8 数据类型获得，可使用从源代码构建的 vLLM 进行推理。这种优化将每个参数的位数从 16 位减少到 8 位，使磁盘大小和 GPU 内存需求大约减少 50%。仅对 Transformer 块内线性算子的权重和激活进行量化，采用对称的逐通道量化，其中每个输出维度的线性缩放映射量化权重和激活的 FP8 表示。激活也在每个令牌的动态基础上进行量化，使用 LLM Compressor 进行量化。

模型创建

本模型通过应用 LLM Compressor 创建，以下是代码示例：

from transformers import AutoProcessor, MllamaForConditionalGeneration

from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot, wrap_hf_model_class

MODEL_ID = "meta-llama/Llama-3.2-11B-Vision-Instruct"

# Load model.
model_class = wrap_hf_model_class(MllamaForConditionalGeneration)
model = model_class.from_pretrained(MODEL_ID, device_map="auto", torch_dtype="auto")
processor = AutoProcessor.from_pretrained(MODEL_ID)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to fp8 with per channel via ptq
#   * quantize the activations to fp8 with dynamic per token
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["re:.*lm_head", "re:multi_modal_projector.*", "re:vision_model.*"],
)

# Apply quantization and save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
oneshot(model=model, recipe=recipe, output_dir=SAVE_DIR)
processor.save_pretrained(SAVE_DIR)

# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
input_ids = processor(text="Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=20)
print(processor.decode(output[0]))
print("==========================================")