Mistral-Small-3.1-24B-Instruct-2503-GPTQ Open Source Model - Reduce Memory Requirements, Easily Deploy and Use

Mistral Small 3.1 24B Instruct 2503 GPTQ 4b 128g

Developed by ISTA-DASLab

This model is an INT4 quantized version of Mistral-Small-3.1-24B-Instruct-2503, using the GPTQ algorithm to reduce weights from 16-bit to 4-bit, significantly decreasing disk size and GPU memory requirements.

Large Language Model

Safetensors

Open Source License:Apache-2.0 #INT4 quantization #instruction fine-tuning #multimodal reasoning

Downloads 21.89k

Release Time : 3/20/2025

Model Overview

This model is a quantized version of Mistral-Small-3.1-24B-Instruct-2503, primarily designed for text generation tasks and supports multimodal input (image + text). It retains 97.8% of the original model's performance after quantization.

Model Features

Efficient Quantization

Utilizes INT4 quantization technology, reducing disk space and GPU memory requirements by 75%

High Performance Retention

Maintains 97.8% of the original model's performance after quantization

Multimodal Support

Supports joint input processing of images and text

Efficient Inference

The optimized model is suitable for deployment in resource-constrained environments

Model Capabilities

Text generation

Image caption generation

Multimodal understanding

Instruction following

Use Cases

Content generation

Image caption generation

Generate detailed descriptions based on input images

Can produce accurate and detailed image captions

Intelligent assistant

Multimodal dialogue

Process complex dialogues containing images and text

Capable of understanding and responding to complex queries involving visual information

🚀 Mistral-Small-3.1-24B-Instruct-2503-GPTQ-4b-128g

This model is obtained by quantizing the weights of Mistral-Small-3.1-24B-Instruct-2503 to INT4, significantly reducing disk size and GPU memory requirements.

🚀 Quick Start

This model is a quantized version of Mistral-Small-3.1-24B-Instruct-2503, which optimizes the model by reducing the bit depth of parameters from 16 to 4.

✨ Features

Quantization Optimization: The model quantizes the weights of linear operators within language_model transformers blocks to INT4 data type, reducing the disk size and GPU memory requirements by approximately 75%.
Precision Preservation: The vision model and multimodal projection are kept in their original precision.
Quantization Scheme: A symmetric per-group quantization scheme with a group size of 128 is used, and the GPTQ algorithm is applied for quantization.
Checkpoint Format: The model checkpoint is saved in compressed_tensors format.

📚 Documentation

Evaluation

This model was evaluated on the OpenLLM v1 benchmarks, and the model outputs were generated with the vLLM engine.

Model	ArcC	GSM8k	Hellaswag	MMLU	TruthfulQA-mc2	Winogrande	Average	Recovery
Mistral-Small-3.1-24B-Instruct-2503	0.7125	0.8848	0.8576	0.8107	0.6409	0.8398	0.7910	1.0000
Mistral-Small-3.1-24B-Instruct-2503-INT4 (this)	0.7073	0.8711	0.8530	0.8062	0.6252	0.8256	0.7814	0.9878

Reproduction

The results were obtained using the following commands:

MODEL=ISTA-DASLab/Mistral-Small-3.1-24B-Instruct-2503-GPTQ-4b-128g
MODEL_ARGS="pretrained=$MODEL,max_model_len=4096,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.80"

lm_eval \
  --model vllm \
  --model_args $MODEL_ARGS \
  --tasks openllm \
  --batch_size auto

Usage

Package Update:
- To use the model in transformers, update the package to the stable release of Mistral-3: pip install git+https://github.com/huggingface/transformers@v4.49.0-Mistral-3
- To use the model in vLLM, update the package to version vllm>=0.8.0.

💻 Usage Examples

Basic Usage

# pip install accelerate

from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
import requests
import torch

model_id = "ISTA-DASLab/Mistral-Small-3.1-24B-Instruct-2503-GPTQ-4b-128g"

model = AutoModelForImageTextToText.from_pretrained(
    model_id, device_map="auto"
).eval()

processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

📄 License

This model is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご