đ Llama-3.2-11B-Vision-Instruct-FP8-dynamic
A quantized version of Llama-3.2-11B-Vision-Instruct, optimized for efficient inference with vLLM.
đ Quick Start
This model can be deployed efficiently using the vLLM backend. See the "đģ Usage Examples" section for detailed code examples.
⨠Features
- Model Architecture: Meta-Llama-3.2. It takes text or image as input and generates text as output.
- Model Optimizations:
- Weight quantization: FP8
- Activation quantization: FP8
- Intended Use Cases: Intended for commercial and research use in multiple languages. Similar to Llama-3.2-11B-Vision-Instruct, it is designed for assistant - like chat.
- Out - of - scope: Use in any way that violates applicable laws or regulations (including trade compliance laws) and use in languages other than English.
- Release Date: 9/25/2024
- Version: 1.0
- License(s): llama3.2
- Model Developers: Neural Magic
đĻ Installation
There is no specific installation steps provided in the original README. If you want to use this model with vLLM, make sure you have installed vLLM.
đģ Usage Examples
Basic Usage
from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
model_name = "neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic"
llm = LLM(model=model_name, max_num_seqs=1, enforce_eager=True)
image = ImageAsset("cherry_blossom").pil_image.convert("RGB")
question = "If I had to write a haiku for this one, it would be: "
prompt = f"<|image|><|begin_of_text|>{question}"
sampling_params = SamplingParams(temperature=0.2, max_tokens=30)
inputs = {
"prompt": prompt,
"multi_modal_data": {
"image": image
},
}
outputs = llm.generate(inputs, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
Advanced Usage
vLLM also supports OpenAI - compatible serving. You can use the following command:
vllm serve neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic --enforce-eager --max-num-seqs 16
đ Documentation
Model Optimizations
This model was obtained by quantizing the weights and activations of Llama-3.2-11B-Vision-Instruct to FP8 data type, ready for inference with vLLM built from source. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per - channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per - token dynamic basis. LLM Compressor is used for quantization.
Creation
This model was created by applying LLM Compressor, as presented in the code snippet below:
from transformers import AutoProcessor, MllamaForConditionalGeneration
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot, wrap_hf_model_class
MODEL_ID = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model_class = wrap_hf_model_class(MllamaForConditionalGeneration)
model = model_class.from_pretrained(MODEL_ID, device_map="auto", torch_dtype="auto")
processor = AutoProcessor.from_pretrained(MODEL_ID)
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=["re:.*lm_head", "re:multi_modal_projector.*", "re:vision_model.*"],
)
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
oneshot(model=model, recipe=recipe, output_dir=SAVE_DIR)
processor.save_pretrained(SAVE_DIR)
print("========== SAMPLE GENERATION ==============")
input_ids = processor(text="Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=20)
print(processor.decode(output[0]))
print("==========================================")
đ§ Technical Details
- Model Architecture: Meta-Llama-3.2.
- Input: Text/Image
- Output: Text
- Quantization:
- Weight quantization: FP8
- Activation quantization: FP8
- Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per - channel quantization is applied, and activations are quantized on a per - token dynamic basis. LLM Compressor is used for quantization.
đ License
This model is licensed under llama3.2.