๐ Pixtral-Large-Instruct-2411-hf-quantized.w4a16
This is a quantized version of the neuralmagic/Pixtral-Large-Instruct-2411-hf model, offering efficient deployment and optimized performance for vision-text tasks.
๐ Quick Start
This model is a quantized variant of neuralmagic/Pixtral-Large-Instruct-2411-hf. It can be efficiently deployed using the vLLM backend.
โจ Features
- Model Architecture: Based on neuralmagic/Pixtral-Large-Instruct-2411-hf, it takes vision-text as input and generates text as output.
- Model Optimizations:
- Weight quantization: INT4
- Activation quantization: FP16
- Multilingual Support: Supports languages such as English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Russian, and Korean.
๐ฆ Installation
No specific installation steps are provided in the original README. If you want to use this model, you need to ensure that the necessary dependencies such as vLLM
are installed.
๐ป Usage Examples
Basic Usage
from vllm.assets.image import ImageAsset
from vllm import LLM, SamplingParams
llm = LLM(
model="neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w4a16",
trust_remote_code=True,
max_model_len=4096,
max_num_seqs=2,
)
question = "What is the content of this image?"
inputs = {
"prompt": f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n",
"multi_modal_data": {
"image": ImageAsset("cherry_blossom").pil_image.convert("RGB")
},
}
print("========== SAMPLE GENERATION ==============")
outputs = llm.generate(inputs, SamplingParams(temperature=0.2, max_tokens=64))
print(f"PROMPT : {outputs[0].prompt}")
print(f"RESPONSE: {outputs[0].outputs[0].text}")
print("==========================================")
Advanced Usage
This model can also be used for model creation and evaluation.
Model Creation
import requests
import torch
from PIL import Image
from transformers import AutoProcessor
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.tracing import TraceableLlavaForConditionalGeneration
from compressed_tensors.quantization import QuantizationArgs, QuantizationType, QuantizationStrategy, ActivationOrdering, QuantizationScheme
model_id = "neuralmagic/Pixtral-Large-Instruct-2411-hf"
model = TraceableLlavaForConditionalGeneration.from_pretrained(
model_id, device_map="auto", torch_dtype="auto"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
DATASET_ID = "flickr30k"
DATASET_SPLIT = {"calibration": "test[:512]"}
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048
dampening_frac=0.01
def data_collator(batch):
assert len(batch) == 1
return {
"input_ids": torch.LongTensor(batch[0]["input_ids"]),
"attention_mask": torch.tensor(batch[0]["attention_mask"]),
"pixel_values": torch.tensor(batch[0]["pixel_values"]),
}
recipe = GPTQModifier(
targets="Linear",
config_groups={
"config_group": QuantizationScheme(
targets=["Linear"],
weights=QuantizationArgs(
num_bits=4,
type=QuantizationType.INT,
strategy=QuantizationStrategy.GROUP,
group_size=128,
symmetric=True,
dynamic=False,
actorder=ActivationOrdering.WEIGHT,
),
),
},
sequential_targets=["MistralDecoderLayer"],
ignore=["re:.*lm_head", "re:vision_tower.*", "re:multi_modal_projector.*"],
update_size=NUM_CALIBRATION_SAMPLES,
dampening_frac=dampening_frac,
)
SAVE_DIR=f"{model_id.split('/')[1]}-quantized.w4a16"
oneshot(
model=model,
tokenizer=model_id,
dataset=DATASET_ID,
splits=DATASET_SPLIT,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
trust_remote_code_model=True,
data_collator=data_collator,
output_dir=SAVE_DIR
)
Model Evaluation
The model was evaluated using mistral-evals for vision-related tasks and using lm_evaluation_harness for select text-based benchmarks.
Vision Tasks
vllm serve neuralmagic/pixtral-12b-quantized.w8a8 --tensor_parallel_size 1 --max_model_len 25000 --trust_remote_code --max_num_seqs 8 --gpu_memory_utilization 0.9 --dtype float16 --limit_mm_per_prompt image=7
python -m eval.run eval_vllm \
--model_name neuralmagic/pixtral-12b-quantized.w8a8 \
--url http://0.0.0.0:8000 \
--output_dir ~/tmp \
--eval_name <vision_task_name>
Text-based Tasks - MMLU
lm_eval \
--model vllm \
--model_args pretrained="<model_name>",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=<n>,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
--tasks mmlu \
--num_fewshot 5 \
--batch_size auto \
--output_path output_dir
Text-based Tasks - MGSM
lm_eval \
--model vllm \
--model_args pretrained="<model_name>",dtype=auto,max_model_len=4096,max_gen_toks=2048,max_num_seqs=128,tensor_parallel_size=<n>,gpu_memory_utilization=0.9 \
--tasks mgsm_cot_native \
--apply_chat_template \
--num_fewshot 0 \
--batch_size auto \
--output_path output_dir
๐ Documentation
Model Overview
Property |
Details |
Model Type |
neuralmagic/Pixtral-Large-Instruct-2411-hf |
Input |
Vision-Text |
Output |
Text |
Weight quantization |
INT4 |
Activation quantization |
FP16 |
Release Date |
2/24/2025 |
Version |
1.0 |
Model Developers |
Neural Magic |
Accuracy
Category |
Metric |
neuralmagic/Pixtral-Large-Instruct-2411-hf |
neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w4a16 |
Recovery (%) |
Vision |
MMMU (val, CoT) explicit_prompt_relaxed_correctness |
63.56 |
60.56 |
95.28% |
Vision |
VQAv2 (val) vqa_match |
... |
... |
... |
... |
... |
... |
... |
... |
๐ License
This model is licensed under the Mistral AI Research License.
โ ๏ธ Important Note
If you want to use a Mistral Model, a Derivative or an Output for any purpose that is not expressly authorized under this Agreement, you must request a license from Mistral AI, which Mistral AI may grant to you in Mistral AI's sole discretion. To discuss such a license, please contact Mistral AI via the website contact form: https://mistral.ai/contact/.
๐ก Usage Tip
This model is only for research purposes. For more information on your rights and data handling, please see the privacy policy.