Llama-3.2-11B-Vision-Instruct-FP8-dynamic開源模型 - 支持多語言，適用於商業聊天助手

首頁

Llama 3.2 11B Vision Instruct FP8 Dynamic

由RedHatAI開發

這是一個基於Llama-3.2-11B-Vision-Instruct的量化模型，適用於多語言的商業和研究用途，可用於類似助手的聊天場景。

圖像生成文本

Safetensors

支持多種語言#FP8量化 #多模態助手 #商業研究通用

下載量 2,295

發布時間 : 9/25/2024

模型概述

該模型經過FP8權重量化和激活量化優化，適用於多語言商業和研究用途，特別適合類似助手的聊天應用。

模型特點

FP8量化

採用FP8進行權重和激活量化，減少磁盤大小和GPU內存需求約50%。

多模態支持

支持文本和圖像輸入，能夠處理多模態任務。

高效推理

使用vLLM後端進行高效部署，支持快速推理。

模型能力

文本生成

圖像理解

多模態交互

使用案例

助手應用

圖像描述生成

根據輸入的圖像生成描述性文本或詩歌。

可生成符合圖像內容的自然語言描述。

多模態聊天

結合圖像和文本輸入進行交互式對話。

能夠理解並回應結合圖像內容的對話。

🚀 Llama-3.2-11B-Vision-Instruct-FP8-dynamic

這是一個經過量化處理的模型，基於 Llama-3.2-11B-Vision-Instruct 進行優化，適用於多語言的商業和研究用途，可用於類似助手的聊天場景。

🚀 快速開始

本模型可使用 vLLM 後端進行高效部署，以下是使用示例：

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset

# Initialize the LLM
model_name = "neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic"
llm = LLM(model=model_name, max_num_seqs=1, enforce_eager=True)

# Load the image
image = ImageAsset("cherry_blossom").pil_image.convert("RGB")

# Create the prompt
question = "If I had to write a haiku for this one, it would be: "
prompt = f"<|image|><|begin_of_text|>{question}"

# Set up sampling parameters
sampling_params = SamplingParams(temperature=0.2, max_tokens=30)

# Generate the response
inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
    },
}
outputs = llm.generate(inputs, sampling_params=sampling_params)

# Print the generated text
print(outputs[0].outputs[0].text)

vLLM 還支持與 OpenAI 兼容的服務，更多詳情請參閱文檔。

vllm serve neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic --enforce-eager --max-num-seqs 16

✨ 主要特性

模型架構：Meta-Llama-3.2，輸入為文本/圖像，輸出為文本。
模型優化：
- 權重量化：採用 FP8 進行權重量化。
- 激活量化：採用 FP8 進行激活量化。
預期用例：適用於多語言的商業和研究用途，類似於 Llama-3.2-11B-Vision-Instruct，用於類似助手的聊天。
適用範圍外情況：不得用於違反適用法律法規（包括貿易合規法律）的任何方式，不支持英語以外的語言。
發佈日期：2024 年 9 月 25 日
版本：1.0
許可證：llama3.2
模型開發者：Neural Magic

📦 安裝指南

暫未提供相關安裝步驟內容。

💻 使用示例

基礎用法

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset

# Initialize the LLM
model_name = "neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic"
llm = LLM(model=model_name, max_num_seqs=1, enforce_eager=True)

# Load the image
image = ImageAsset("cherry_blossom").pil_image.convert("RGB")

# Create the prompt
question = "If I had to write a haiku for this one, it would be: "
prompt = f"<|image|><|begin_of_text|>{question}"

# Set up sampling parameters
sampling_params = SamplingParams(temperature=0.2, max_tokens=30)

# Generate the response
inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
    },
}
outputs = llm.generate(inputs, sampling_params=sampling_params)

# Print the generated text
print(outputs[0].outputs[0].text)

高級用法

vllm serve neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic --enforce-eager --max-num-seqs 16

📚 詳細文檔

模型優化

本模型通過將 Llama-3.2-11B-Vision-Instruct 的權重和激活量化為 FP8 數據類型獲得，可使用從源代碼構建的 vLLM 進行推理。這種優化將每個參數的位數從 16 位減少到 8 位，使磁盤大小和 GPU 內存需求大約減少 50%。僅對 Transformer 塊內線性算子的權重和激活進行量化，採用對稱的逐通道量化，其中每個輸出維度的線性縮放映射量化權重和激活的 FP8 表示。激活也在每個令牌的動態基礎上進行量化，使用 LLM Compressor 進行量化。

模型創建

本模型通過應用 LLM Compressor 創建，以下是代碼示例：

from transformers import AutoProcessor, MllamaForConditionalGeneration

from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot, wrap_hf_model_class

MODEL_ID = "meta-llama/Llama-3.2-11B-Vision-Instruct"

# Load model.
model_class = wrap_hf_model_class(MllamaForConditionalGeneration)
model = model_class.from_pretrained(MODEL_ID, device_map="auto", torch_dtype="auto")
processor = AutoProcessor.from_pretrained(MODEL_ID)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to fp8 with per channel via ptq
#   * quantize the activations to fp8 with dynamic per token
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["re:.*lm_head", "re:multi_modal_projector.*", "re:vision_model.*"],
)

# Apply quantization and save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
oneshot(model=model, recipe=recipe, output_dir=SAVE_DIR)
processor.save_pretrained(SAVE_DIR)

# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
input_ids = processor(text="Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=20)
print(processor.decode(output[0]))
print("==========================================")