đ Qwen2.5-VL-3B-Instruct-FP8-Dynamic
A quantized version of Qwen/Qwen2.5-VL-3B-Instruct, optimized for efficient inference.
đ Quick Start
This model is a quantized version of Qwen/Qwen2.5-VL-3B-Instruct, with weights and activations quantized to FP8 data type. It's ready for inference with vLLM >= 0.5.2.
⨠Features
- Model Architecture: Based on Qwen2.5-VL-3B-Instruct, supporting vision-text input and text output.
- Model Optimizations: Both weight and activation quantization are set to FP8.
- Release Date: 2/24/2025
- Version: 1.0
- Model Developers: Neural Magic
đĻ Installation
This model can be deployed efficiently using the vLLM backend. Ensure you have vLLM installed and follow the deployment steps below.
đģ Usage Examples
Basic Usage
from vllm.assets.image import ImageAsset
from vllm import LLM, SamplingParams
llm = LLM(
model="neuralmagic/Qwen2.5-VL-3B-Instruct-FP8-Dynamic",
trust_remote_code=True,
max_model_len=4096,
max_num_seqs=2,
)
question = "What is the content of this image?"
inputs = {
"prompt": f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n",
"multi_modal_data": {
"image": ImageAsset("cherry_blossom").pil_image.convert("RGB")
},
}
print("========== SAMPLE GENERATION ==============")
outputs = llm.generate(inputs, SamplingParams(temperature=0.2, max_tokens=64))
print(f"PROMPT : {outputs[0].prompt}")
print(f"RESPONSE: {outputs[0].outputs[0].text}")
print("==========================================")
vLLM also supports OpenAI-compatible serving. See the documentation for more details.
đ Documentation
Creation
This model was created with llm-compressor by running the following code snippet as part of a multimodal announcement blog.
Model Creation Code
import requests
import torch
from PIL import Image
from transformers import AutoProcessor
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.tracing import (
TraceableQwen2_5_VLForConditionalGeneration,
)
from llmcompressor.modifiers.quantization import QuantizationModifier
model_id = Qwen/Qwen2.5-VL-3B-Instruct
model = TraceableQwen2_5_VLForConditionalGeneration.from_pretrained(
model_id, device_map="auto", torch_dtype="auto"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
recipe = [
QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
sequential_targets=["MistralDecoderLayer"],
ignore=["re:.*lm_head", "re:vision_tower.*", "re:multi_modal_projector.*"],
),
]
SAVE_DIR=f"{model_id.split('/')[1]}-FP8-Dynamic"
oneshot(
model=model,
recipe=recipe,
trust_remote_code_model=True,
output_dir=SAVE_DIR
)
Evaluation
The model was evaluated using mistral-evals for vision-related tasks and lm_evaluation_harness for select text-based benchmarks.
Evaluation Commands
Vision Tasks
- vqav2
- docvqa
- mathvista
- mmmu
- chartqa
vllm serve neuralmagic/pixtral-12b-quantized.w8a8 --tensor_parallel_size 1 --max_model_len 25000 --trust_remote_code --max_num_seqs 8 --gpu_memory_utilization 0.9 --dtype float16 --limit_mm_per_prompt image=7
python -m eval.run eval_vllm \
--model_name neuralmagic/pixtral-12b-quantized.w8a8 \
--url http://0.0.0.0:8000 \
--output_dir ~/tmp \
--eval_name <vision_task_name>
Text-based Tasks
MMLU
lm_eval \
--model vllm \
--model_args pretrained="<model_name>",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=<n>,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
--tasks mmlu \
--num_fewshot 5 \
--batch_size auto \
--output_path output_dir
MGSM
lm_eval \
--model vllm \
--model_args pretrained="<model_name>",dtype=auto,max_model_len=4096,max_gen_toks=2048,max_num_seqs=128,tensor_parallel_size=<n>,gpu_memory_utilization=0.9 \
--tasks mgsm_cot_native \
--apply_chat_template \
--num_fewshot 0 \
--batch_size auto \
--output_path output_dir
Accuracy
Category |
Metric |
Qwen/Qwen2.5-VL-3B-Instruct |
nm-testing/Qwen2.5-VL-3B-Instruct-FP8-Dynamic |
Recovery (%) |
Vision |
MMMU (val, CoT) explicit_prompt_relaxed_correctness |
44.56 |
45.78 |
102.74% |
Vision |
VQAv2 (val) vqa_match |
75.94 |
76.22 |
100.37% |
Vision |
DocVQA (val) anls |
92.53 |
92.40 |
99.86% |
Vision |
ChartQA (test, CoT) anywhere_in_answer_relaxed_correctness |
81.20 |
80.72 |
99.41% |
Vision |
Mathvista (testmini, CoT) explicit_prompt_relaxed_correctness |
54.15 |
53.25 |
98.34% |
Vision |
Average Score |
69.28 |
69.67 |
100.56% |
Text |
MGSM (CoT) |
43.69 |
43.14 |
98.74% |
Text |
MMLU (5-shot) |
65.32 |
65.03 |
99.56% |
đ§ Technical Details
Inference Performance
This model achieves up to 1.10x speedup in single-stream deployment and up to 1.32x speedup in multi-stream asynchronous deployment, depending on hardware and use-case scenario.
The following performance benchmarks were conducted with vLLM version 0.7.2, and GuideLLM.
Benchmarking Command
```
guidellm --model neuralmagic/Qwen2.5-VL-3B-Instruct-FP8-Dynamic --target "http://localhost:8000/v1" --data-type emulated --data prompt_tokens=,generated_tokens=,images=,width=,height= --max seconds 120 --backend aiohttp_server
```
Single-stream performance (measured with vLLM version 0.7.2)
Hardware |
Model |
Average Cost Reduction |
Document Visual Question Answering 1680W x 2240H 64/128 Latency (s) |
Document Visual Question Answering 1680W x 2240H 64/128 Queries Per Dollar |
Visual Reasoning 640W x 480H 128/128 Latency (s) |
Visual Reasoning 640W x 480H 128/128 Queries Per Dollar |
Image Captioning 480W x 360H 0/128 Latency (s) |
Image Captioning 480W x 360H 0/128 Queries Per Dollar |
A6000x1 |
Qwen/Qwen2.5-VL-3B-Instruct |
|
3.1 |
1454 |
1.8 |
2546 |
1.7 |
2610 |
A6000x1 |
neuralmagic/Qwen2.5-VL-3B-Instruct-quantized.w8a8 |
1.27 |
2.6 |
1708 |
1.3 |
3340 |
1.3 |
3459 |
A6000x1 |
neuralmagic/Qwen2.5-VL-3B-Instruct-quantized.w4a16 |
1.57 |
2.4 |
1886 |
1.0 |
4409 |
1.0 |
4409 |
A100x1 |
Qwen/Qwen2.5-VL-3B-Instruct |
|
2.2 |
920 |
1.3 |
1603 |
1.2 |
1636 |
A100x1 |
neuralmagic/Qwen2.5-VL-3B-Instruct-quantized.w8a8 |
1.09 |
2.1 |
975 |
1.2 |
1743 |
1.1 |
1814 |
A100x1 |
neuralmagic/Qwen2.5-VL-3B-Instruct-quantized.w4a16 |
1.20 |
2.0 |
1011 |
1.0 |
2015 |
1.0 |
2012 |
H100x1 |
Qwen/Qwen2.5-VL-3B-Instruct |
1.5 |
0.9 |
740 |
0.9 |
1221 |
0.9 |
1276 |
H100x1 |
neuralmagic/Qwen2.5-VL-3B-Instruct-FP8-Dynamic |
1.06 |
1.4 |
768 |
0.9 |
1276 |
0.8 |
1399 |
H100x1 |
neuralmagic/Qwen2.5-VL-3B-Instruct-quantized.w4a16 |
1.24 |
0.9 |
1219 |
0.9 |
1270 |
0.8 |
1304 |
**Use case profiles: Image Size (WxH) / prompt tokens / generation tokens
**QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025).
Multi-stream asynchronous performance (measured with vLLM version 0.7.2)
Hardware |
Model |
Average Cost Reduction |
Document Visual Question Answering 1680W x 2240H 64/128 Maximum throughput (QPS) |
Document Visual Question Answering 1680W x 2240H 64/128 Queries Per Dollar |
Visual Reasoning 640W x 480H 128/128 Maximum throughput (QPS) |
Visual Reasoning 640W x 480H 128/128 Queries Per Dollar |
Image Captioning 480W x 360H 0/128 Maximum throughput (QPS) |
Image Captioning 480W x 360H 0/128 Queries Per Dollar |
A6000x1 |
Qwen/Qwen2.5-VL-3B-Instruct |
|
0.5 |
2405 |
2.6 |
11889 |
2.9 |
12909 |
A6000x1 |
neuralmagic/Qwen2.5-VL-3B-Instruct-quantized.w8a8 |
1.26 |
0.6 |
2725 |
3.4 |
15162 |
3.9 |
17673 |
A6000x1 |
neuralmagic/Qwen2.5-VL-3B-Instruct-quantized.w4a16 |
1.39 |
0.6 |
2548 |
3.9 |
17437 |
4.7 |
21223 |
A100x1 |
Qwen/Qwen2.5-VL-3B-Instruct |
|
0.8 |
1663 |
3.9 |
7899 |
4.4 |
8924 |
A100x1 |
neuralmagic/Qwen2.5-VL-3B-Instruct-quantized.w8a8 |
1.06 |
0.9 |
1734 |
4.2 |
8488 |
4.7 |
9548 |
A100x1 |
neuralmagic/Qwen2.5-VL-3B-Instruct-quantized.w4a16 |
1.10 |
0.9 |
1775 |
4.2 |
8540 |
5.1 |
10318 |
H100x1 |
Qwen/Qwen2.5-VL-3B-Instruct |
|
1.1 |
1188 |
4.3 |
4656 |
4.3 |
4676 |
H100x1 |
neuralmagic/Qwen2.5-VL-3B-Instruct-FP8-Dynamic |
1.15 |
1.4 |
1570 |
4.3 |
4676 |
4.8 |
5220 |
H100x1 |
neuralmagic/Qwen2.5-VL-3B-Instruct-quantized.w4a16 |
1.96 |
4.2 |
4598 |
4.1 |
4505 |
4.4 |
4838 |
**Use case profiles: Image Size (WxH) / prompt tokens / generation tokens
**QPS: Queries per second.
**QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025).
đ License
This project is licensed under the Apache-2.0 License.