Mistral Small 3.1 24B Instruct 2503 Quantized.w4a16

Developed by RedHatAI

This is an INT4-quantized Mistral-Small-3.1-24B-Instruct-2503 model, optimized and released by Red Hat (Neural Magic), suitable for fast-response dialogue agents and low-latency inference scenarios.

Text-to-Image

Safetensors

Supports Multiple LanguagesOpen Source License:Apache-2.0 #INT4 Quantization #Multimodal Inference #Low-latency Dialogue

Downloads 219

Release Time : 4/15/2025

Model Overview

This model is an INT4 weight-quantized version based on Mistral-Small-3.1-24B-Instruct-2503, reducing about 75% of the disk size and GPU memory requirements while maintaining good performance.

Model Features

Efficient Quantization

Adopt INT4 weight quantization to reduce 75% of the disk size and GPU memory requirements

Multilingual Support

Support text understanding and generation in 24 languages

Multimodal Capability

Have the ability to understand text and images

Low-latency Inference

Optimized for fast-response dialogue agents and function calls

Model Capabilities

Text Generation

Dialogue Agent

Programming Reasoning

Mathematical Reasoning

Long Document Understanding

Visual Understanding

Multilingual Processing

Use Cases

Dialogue System

Intelligent Customer Service

Used to build a fast-response customer service dialogue system

Low-latency response, support multiple languages

Code Assistance

Programming Assistant

Help developers understand and generate code

Support code completion and explanation in multiple programming languages

Document Processing

Long Document Summarization

Automatically generate summaries and key points of long documents

Support long context understanding of 8192 tokens

🚀 Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16

This model is a quantized version of Mistral-Small-3.1-24B-Instruct-2503, which significantly reduces disk size and GPU memory requirements while maintaining high performance across various tasks.

🚀 Quick Start

You can quickly start using this model with the vLLM backend. Here is a simple example:

from vllm import LLM, SamplingParams
from transformers import AutoProcessor

model_id = "RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
processor = AutoProcessor.from_pretrained(model_id)

messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]

prompts = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

✨ Features

Model Overview

Model Architecture: Mistral3ForConditionalGeneration
- Input: Text / Image
- Output: Text
Model Optimizations:
- Weight quantization: INT4
Intended Use Cases:
- Ideal for fast - response conversational agents.
- Suitable for low - latency function calling.
- Can be fine - tuned for subject matter experts.
- Enables local inference for hobbyists and organizations handling sensitive data.
- Capable of programming and math reasoning.
- Good at long document understanding.
- Supports visual understanding.
Out - of - scope: Do not use in any manner that violates applicable laws or regulations (including trade compliance laws). Avoid using in languages not officially supported by the model.
Release Date: 04/15/2025
Version: 1.0
Model Developers: Red Hat (Neural Magic)

Model Optimizations

This model was obtained by quantizing the weights of Mistral-Small-3.1-24B-Instruct-2503 to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.

Only the weights of the linear operators within transformers blocks are quantized. Weights are quantized using a symmetric per - group scheme, with group size 128. The GPTQ algorithm is applied for quantization, as implemented in the llm - compressor library.

💻 Usage Examples

Deployment Example

from vllm import LLM, SamplingParams
from transformers import AutoProcessor

model_id = "RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
processor = AutoProcessor.from_pretrained(model_id)

messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]

prompts = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

Creation Example

Creation details

```python from transformers import AutoProcessor from llmcompressor.modifiers.quantization import GPTQModifier from llmcompressor.transformers import oneshot from llmcompressor.transformers.tracing import TraceableMistral3ForConditionalGeneration from datasets import load_dataset, interleave_datasets from PIL import Image import io

Load model

model_stub = "mistralai/Mistral-Small-3.1-24B-Instruct-2503" model_name = model_stub.split("/")[-1]

num_text_samples = 1024 num_vision_samples = 1024 max_seq_len = 8192

processor = AutoProcessor.from_pretrained(model_stub)

model = TraceableMistral3ForConditionalGeneration.from_pretrained( model_stub, device_map="auto", torch_dtype="auto", )

Text-only data subset

def preprocess_text(example): input = { "text": processor.apply_chat_template( example["messages"], add_generation_prompt=False, ), "images": None, } tokenized_input = processor(**input, max_length=max_seq_len, truncation=True) tokenized_input["pixel_values"] = tokenized_input.get("pixel_values", None) tokenized_input["image_sizes"] = tokenized_input.get("image_sizes", None) return tokenized_input

dst = load_dataset("neuralmagic/calibration", name="LLM", split="train").select(range(num_text_samples)) dst = dst.map(preprocess_text, remove_columns=dst.column_names)

Text + vision data subset

def preprocess_vision(example): messages = [] image = None for message in example["messages"]: message_content = [] for content in message["content"]: if content["type"] == "text": message_content.append({"type": "text", "text": content["text"]}) else: message_content.append({"type": "image"}) image = Image.open(io.BytesIO(content["image"]))

    messages.append(
        {
            "role": message["role"],
            "content": message_content,
        }
    )

input = {
    "text": processor.apply_chat_template(
        messages,
        add_generation_prompt=False,
    ),
    "images": image,
}
tokenized_input = processor(**input, max_length=max_seq_len, truncation=True)
tokenized_input["pixel_values"] = tokenized_input.get("pixel_values", None)
tokenized_input["image_sizes"] = tokenized_input.get("image_sizes", None)
return tokenized_input

dsv = load_dataset("neuralmagic/calibration", name="VLM", split="train").select(range(num_vision_samples)) dsv = dsv.map(preprocess_vision, remove_columns=dsv.column_names)

Interleave subsets

ds = interleave_datasets((dsv, dst))

Configure the quantization algorithm and scheme

recipe = GPTQModifier( ignore=["language_model.lm_head", "re:vision_tower.", "re:multi_modal_projector."], sequential_targets=["MistralDecoderLayer"], dampening_frac=0.01, targets="Linear", scheme="W4A16", )

Define data collator

def data_collator(batch): import torch assert len(batch) == 1 collated = {} for k, v in batch[0].items(): if v is None: continue if k == "input_ids": collated[k] = torch.LongTensor(v) elif k == "pixel_values": collated[k] = torch.tensor(v, dtype=torch.bfloat16) else: collated[k] = torch.tensor(v) return collated

Apply quantization

oneshot( model=model, dataset=ds, recipe=recipe, max_seq_length=max_seq_len, data_collator=data_collator, num_calibration_samples=num_text_samples + num_vision_samples, )

Save to disk in compressed-tensors format

save_path = model_name + "-quantized.w4a16" model.save_pretrained(save_path) processor.save_pretrained(save_path) print(f"Model and tokenizer saved to: {save_path}")

</details>

### Evaluation Example
The model was evaluated on the OpenLLM leaderboard tasks (version 1), MMLU - pro, GPQA, HumanEval and MBPP. Non - coding tasks were evaluated with [lm - evaluation - harness](https://github.com/EleutherAI/lm - evaluation - harness), whereas coding tasks were evaluated with a fork of [evalplus](https://github.com/neuralmagic/evalplus). [vLLM](https://docs.vllm.ai/en/stable/) is used as the engine in all cases.

<details>
  <summary>Evaluation details</summary>

  **MMLU**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks mmlu
--num_fewshot 5
--apply_chat_template
--fewshot_as_multiturn
--batch_size auto


**ARC Challenge**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks arc_challenge
--num_fewshot 25
--apply_chat_template
--fewshot_as_multiturn
--batch_size auto


**GSM8k**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.9,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks gsm8k
--num_fewshot 8
--apply_chat_template
--fewshot_as_multiturn
--batch_size auto


**Hellaswag**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks hellaswag
--num_fewshot 10
--apply_chat_template
--fewshot_as_multiturn
--batch_size auto


**Winogrande**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks winogrande
--num_fewshot 5
--apply_chat_template
--fewshot_as_multiturn
--batch_size auto


**TruthfulQA**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks truthfulqa
--num_fewshot 0
--apply_chat_template
--batch_size auto


**MMLU-pro**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks mmlu_pro
--num_fewshot 5
--apply_chat_template
--fewshot_as_multiturn
--batch_size auto


**MMMU**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.9,max_images=8,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks mmmu_val
--apply_chat_template
--batch_size auto


**ChartQA**


**Coding**

The commands below can be used for mbpp by simply replacing the dataset name.

*Generation*

python3 codegen/generate.py
--model RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16
--bs 16
--temperature 0.2
--n_samples 50
--root "."
--dataset humaneval


*Sanitization*

python3 evalplus/sanitize.py
humaneval/RedHatAI--Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16_vllm_temp_0.2


*Evaluation*

evalplus.evaluate
--dataset humaneval
--samples humaneval/RedHatAI--Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16_vllm_temp_0.2-sanitized

</details>

### Accuracy

| Category | Benchmark | Mistral - Small - 3.1 - 24B - Instruct - 2503 | Mistral - Small - 3.1 - 24B - Instruct - 2503 - quantized.w4a16<br>(this model) | Recovery |
| --- | --- | --- | --- | --- |
| **OpenLLM v1** | MMLU (5 - shot) | 80.67 | 79.74 | 98.9% |
|  | ARC Challenge (25 - shot) | 72.78 | 72.18 | 99.2% |
|  | GSM - 8K (5 - shot, strict - match) | 58.68 | 59.59 | 101.6% |
|  | Hellaswag (10 - shot) | 83.70 | 83.25 | 99.5% |
|  | Winogrande (5 - shot) | 83.74 | 83.43 | 99.6% |
|  | TruthfulQA (0 - shot, mc2) | 70.62 | 69.56 | 98.5% |
|  | **Average** | **75.03** | **74.63** | **99.5%** |
|  | MMLU - Pro (5 - shot) | 67.25 | 66.56 | 99.0% |
|  | GPQA CoT main (5 - shot) | 42.63 | 47.10 | 110.5% |
|  | GPQA CoT diamond (5 - shot) | 45.96 | 44.95 | 97.80% |
| **Coding** | HumanEval pass@1 | 84.70 | 84.60 | 99.9% |
|  | HumanEval+ pass@1 | 79.50 | 79.90 | 100.5% |
|  | MBPP pass@1 | 71.10 | 70.10 | 98.6% |
|  | MBPP+ pass@1 | 60.60 | 60.70 | 100.2% |
| **Vision** | MMMU (0 - shot) | 52.11 | 50.11 | 96.2% |
|  | ChartQA (0 - shot) | 81.36 | 80.92 | 99.5% |


## 📄 License

This model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご