Llama-3.2-11B-Vision-Instruct-FP8-dynamic Open-source Model - Supports multiple languages and is suitable for commercial chat assistants

Llama 3.2 11B Vision Instruct FP8 Dynamic

Developed by RedHatAI

This is a quantized model based on Llama-3.2-11B-Vision-Instruct, suitable for multilingual business and research purposes, and can be used in chat scenarios similar to assistants.

Image-to-Text

Safetensors

Supports Multiple Languages#FP8 Quantization #Multimodal Assistant #General for Business Research

Downloads 2,295

Release Time : 9/25/2024

Model Overview

This model is optimized through FP8 weight quantization and activation quantization, suitable for multilingual business and research purposes, and is particularly suitable for chat applications similar to assistants.

Model Features

FP8 Quantization

Use FP8 for weight and activation quantization, reducing disk size and GPU memory requirements by approximately 50%.

Multimodal Support

Support text and image input and can handle multimodal tasks.

Efficient Inference

Use the vLLM backend for efficient deployment and support fast inference.

Model Capabilities

Text Generation

Image Understanding

Multimodal Interaction

Use Cases

Assistant Application

Image Description Generation

Generate descriptive text or poetry based on the input image.

Can generate natural language descriptions that match the image content.

Multimodal Chat

Conduct interactive conversations by combining image and text input.

Can understand and respond to conversations that incorporate image content.

🚀 Llama-3.2-11B-Vision-Instruct-FP8-dynamic

A quantized version of Llama-3.2-11B-Vision-Instruct, optimized for efficient inference with vLLM.

🚀 Quick Start

This model can be deployed efficiently using the vLLM backend. See the "💻 Usage Examples" section for detailed code examples.

✨ Features

Model Architecture: Meta-Llama-3.2. It takes text or image as input and generates text as output.
Model Optimizations:
- Weight quantization: FP8
- Activation quantization: FP8
Intended Use Cases: Intended for commercial and research use in multiple languages. Similar to Llama-3.2-11B-Vision-Instruct, it is designed for assistant - like chat.
Out - of - scope: Use in any way that violates applicable laws or regulations (including trade compliance laws) and use in languages other than English.
Release Date: 9/25/2024
Version: 1.0
License(s): llama3.2
Model Developers: Neural Magic

📦 Installation

There is no specific installation steps provided in the original README. If you want to use this model with vLLM, make sure you have installed vLLM.

💻 Usage Examples

Basic Usage

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset

# Initialize the LLM
model_name = "neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic"
llm = LLM(model=model_name, max_num_seqs=1, enforce_eager=True)

# Load the image
image = ImageAsset("cherry_blossom").pil_image.convert("RGB")

# Create the prompt
question = "If I had to write a haiku for this one, it would be: "
prompt = f"<|image|><|begin_of_text|>{question}"

# Set up sampling parameters
sampling_params = SamplingParams(temperature=0.2, max_tokens=30)

# Generate the response
inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
    },
}
outputs = llm.generate(inputs, sampling_params=sampling_params)

# Print the generated text
print(outputs[0].outputs[0].text)

Advanced Usage

vLLM also supports OpenAI - compatible serving. You can use the following command:

vllm serve neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic --enforce-eager --max-num-seqs 16

📚 Documentation

Model Optimizations

This model was obtained by quantizing the weights and activations of Llama-3.2-11B-Vision-Instruct to FP8 data type, ready for inference with vLLM built from source. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.

Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per - channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per - token dynamic basis. LLM Compressor is used for quantization.

Creation

This model was created by applying LLM Compressor, as presented in the code snippet below:

from transformers import AutoProcessor, MllamaForConditionalGeneration

from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot, wrap_hf_model_class

MODEL_ID = "meta-llama/Llama-3.2-11B-Vision-Instruct"

# Load model.
model_class = wrap_hf_model_class(MllamaForConditionalGeneration)
model = model_class.from_pretrained(MODEL_ID, device_map="auto", torch_dtype="auto")
processor = AutoProcessor.from_pretrained(MODEL_ID)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to fp8 with per channel via ptq
#   * quantize the activations to fp8 with dynamic per token
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["re:.*lm_head", "re:multi_modal_projector.*", "re:vision_model.*"],
)

# Apply quantization and save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
oneshot(model=model, recipe=recipe, output_dir=SAVE_DIR)
processor.save_pretrained(SAVE_DIR)

# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
input_ids = processor(text="Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=20)
print(processor.decode(output[0]))
print("==========================================")

🔧 Technical Details

Model Architecture: Meta-Llama-3.2.
Input: Text/Image
Output: Text
Quantization:
- Weight quantization: FP8
- Activation quantization: FP8
- Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per - channel quantization is applied, and activations are quantized on a per - token dynamic basis. LLM Compressor is used for quantization.

📄 License

This model is licensed under llama3.2.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご