Mistral Small 3.1 24B Instruct 2503 Quantized.w8a8

由 RedHatAI 开发

这是一个经过INT8量化的Mistral-Small-3.1-24B-Instruct-2503模型，由Red Hat和Neural Magic优化，适用于快速响应和低延迟场景。

文本到文本

Safetensors

支持多种语言开源协议:Apache-2.0 #INT8量化 #多模态理解 #低延迟推理

下载量 833

发布时间 : 4/15/2025

模型简介

该模型是基于Mistral-Small-3.1-24B-Instruct-2503的量化版本，通过将权重和激活量化为INT8，显著降低了GPU内存需求并提高了计算效率。

模型特点

高效量化

通过INT8量化技术，将GPU内存需求降低约50%，计算吞吐量提高约2倍

多语言支持

支持24种语言的文本生成和理解

多功能应用

适用于对话代理、函数调用、文档理解和视觉理解等多种任务

快速响应

优化后的模型特别适合需要低延迟的应用场景

模型能力

文本生成

多语言处理

对话代理

函数调用

长文档理解

视觉理解

编程和数学推理

使用案例

对话系统

客户服务聊天机器人

部署快速响应的客户服务代理

降低响应延迟，提高用户体验

开发工具

代码辅助

帮助开发者进行编程和调试

提高开发效率

内容理解

长文档摘要

快速理解和总结长文档内容

提高信息处理效率

🚀 Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8

本模型是对 Mistral-Small-3.1-24B-Instruct-2503 进行量化处理后的版本，将激活值和权重量化为 INT8 数据类型，有效减少了 GPU 内存需求和磁盘空间占用，同时提升了计算吞吐量。适用于快速响应对话、低延迟函数调用等多种场景。

支持语言

英语
法语
德语
西班牙语
葡萄牙语
意大利语
日语
韩语
俄语
中文
阿拉伯语
波斯语
印尼语
马来语
尼泊尔语
波兰语
罗马尼亚语
塞尔维亚语
瑞典语
土耳其语
乌克兰语
越南语
印地语
孟加拉语

许可证

本项目采用 Apache-2.0 许可证。

库名称

vllm

基础模型

mistralai/Mistral-Small-3.1-24B-Instruct-2503

任务类型

图像文本到文本

🚀 快速开始

本模型可以使用 vLLM 后端进行高效部署，以下是一个示例代码：

from vllm import LLM, SamplingParams
from transformers import AutoProcessor

model_id = "RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
processor = AutoProcessor.from_pretrained(model_id)

messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]

prompts = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM 还支持与 OpenAI 兼容的服务，更多详细信息请参考文档。

✨ 主要特性

模型概述

模型架构：Mistral3ForConditionalGeneration
- 输入：文本 / 图像
- 输出：文本
模型优化：
- 激活值量化：INT8
- 权重量化：INT8
预期用例：
- 快速响应的对话代理。
- 低延迟的函数调用。
- 通过微调适用于特定领域专家。
- 适合处理敏感数据的爱好者和组织进行本地推理。
- 编程和数学推理。
- 长文档理解。
- 视觉理解。
不适用范围：以任何违反适用法律法规（包括贸易合规法律）的方式使用。在模型未官方支持的语言中使用。
发布日期：2025 年 4 月 15 日
版本：1.0
模型开发者：Red Hat (Neural Magic)

模型优化

本模型通过将 Mistral-Small-3.1-24B-Instruct-2503 的激活值和权重量化为 INT8 数据类型而获得。这种优化将表示权重和激活值的位数从 16 位减少到 8 位，降低了 GPU 内存需求（约 50%），并提高了矩阵乘法的计算吞吐量（约 2 倍）。权重量化还将磁盘空间需求减少了约 50%。

仅对 Transformer 块内线性算子的权重和激活值进行量化。权重采用对称静态每通道量化方案，而激活值采用对称动态每令牌量化方案。量化过程应用了 SmoothQuant 和 GPTQ 算法的组合，具体实现于 llm-compressor 库中。

📦 模型创建

本模型使用 llm-compressor 创建，以下是创建模型的代码片段：

创建详情

```python from transformers import AutoProcessor from llmcompressor.modifiers.quantization import GPTQModifier from llmcompressor.modifiers.smoothquant import SmoothQuantModifier from llmcompressor.transformers import oneshot from llmcompressor.transformers.tracing import TraceableMistral3ForConditionalGeneration from datasets import load_dataset, interleave_datasets from PIL import Image import io

加载模型

model_stub = "mistralai/Mistral-Small-3.1-24B-Instruct-2503" model_name = model_stub.split("/")[-1]

num_text_samples = 1024 num_vision_samples = 1024 max_seq_len = 8192

processor = AutoProcessor.from_pretrained(model_stub)

model = TraceableMistral3ForConditionalGeneration.from_pretrained( model_stub, device_map="auto", torch_dtype="auto", )

仅文本数据子集

def preprocess_text(example): input = { "text": processor.apply_chat_template( example["messages"], add_generation_prompt=False, ), "images": None, } tokenized_input = processor(**input, max_length=max_seq_len, truncation=True) tokenized_input["pixel_values"] = tokenized_input.get("pixel_values", None) tokenized_input["image_sizes"] = tokenized_input.get("image_sizes", None) return tokenized_input

dst = load_dataset("neuralmagic/calibration", name="LLM", split="train").select(range(num_text_samples)) dst = dst.map(preprocess_text, remove_columns=dst.column_names)

文本 + 视觉数据子集

def preprocess_vision(example): messages = [] image = None for message in example["messages"]: message_content = [] for content in message["content"]: if content["type"] == "text": message_content.append({"type": "text", "text": content["text"]}) else: message_content.append({"type": "image"}) image = Image.open(io.BytesIO(content["image"]))

    messages.append(
        {
            "role": message["role"],
            "content": message_content,
        }
    )

input = {
    "text": processor.apply_chat_template(
        messages,
        add_generation_prompt=False,
    ),
    "images": image,
}
tokenized_input = processor(**input, max_length=max_seq_len, truncation=True)
tokenized_input["pixel_values"] = tokenized_input.get("pixel_values", None)
tokenized_input["image_sizes"] = tokenized_input.get("image_sizes", None)
return tokenized_input

dsv = load_dataset("neuralmagic/calibration", name="VLM", split="train").select(range(num_vision_samples)) dsv = dsv.map(preprocess_vision, remove_columns=dsv.column_names)

交错子集

ds = interleave_datasets((dsv, dst))

配置量化算法和方案

recipe = [ SmoothQuantModifier( smoothing_strength=0.8, mappings=[ [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"], [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"], [["re:.down_proj"], "re:.up_proj"], ], ), GPTQModifier( ignore=["language_model.lm_head", "re:vision_tower.", "re:multi_modal_projector."], sequential_targets=["MistralDecoderLayer"], dampening_frac=0.01, targets="Linear", scheme="W8A8", ), ]

定义数据收集器

def data_collator(batch): import torch assert len(batch) == 1 collated = {} for k, v in batch[0].items(): if v is None: continue if k == "input_ids": collated[k] = torch.LongTensor(v) elif k == "pixel_values": collated[k] = torch.tensor(v, dtype=torch.bfloat16) else: collated[k] = torch.tensor(v) return collated

应用量化

oneshot( model=model, dataset=ds, recipe=recipe, max_seq_length=max_seq_len, data_collator=data_collator, num_calibration_samples=num_text_samples + num_vision_samples, )

以压缩张量格式保存到磁盘

save_path = model_name + "-quantized.w8a8" model.save_pretrained(save_path) processor.save_pretrained(save_path) print(f"模型和分词器保存到: {save_path}")

</details>

## 📚 模型评估
本模型在 OpenLLM 排行榜任务（版本 1）、MMLU-pro、GPQA、HumanEval 和 MBPP 上进行了评估。非编码任务使用 [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) 进行评估，而编码任务使用 [evalplus](https://github.com/neuralmagic/evalplus) 的一个分支进行评估。所有评估均使用 [vLLM](https://docs.vllm.ai/en/stable/) 作为推理引擎。
<details>
  <summary>评估详情</summary>

### 非编码任务评估命令
**MMLU**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks mmlu
--num_fewshot 5
--apply_chat_template
--fewshot_as_multiturn
--batch_size auto


**ARC Challenge**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks arc_challenge
--num_fewshot 25
--apply_chat_template
--fewshot_as_multiturn
--batch_size auto


**GSM8k**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.9,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks gsm8k
--num_fewshot 8
--apply_chat_template
--fewshot_as_multiturn
--batch_size auto


**Hellaswag**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks hellaswag
--num_fewshot 10
--apply_chat_template
--fewshot_as_multiturn
--batch_size auto


**Winogrande**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks winogrande
--num_fewshot 5
--apply_chat_template
--fewshot_as_multiturn
--batch_size auto


**TruthfulQA**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks truthfulqa
--num_fewshot 0
--apply_chat_template
--batch_size auto


**MMLU-pro**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks mmlu_pro
--num_fewshot 5
--apply_chat_template
--fewshot_as_multiturn
--batch_size auto


**MMMU**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.9,max_images=8,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks mmmu
--apply_chat_template
--batch_size auto


**ChartQA**


### 编码任务评估命令
以下命令可用于 MBPP 评估，只需替换数据集名称即可。

#### 代码生成

python3 codegen/generate.py
--model RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8
--bs 16
--temperature 0.2
--n_samples 50
--root "."
--dataset humaneval


#### 代码清理

python3 evalplus/sanitize.py
humaneval/RedHatAI--Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8_vllm_temp_0.2


#### 代码评估

evalplus.evaluate
--dataset humaneval
--samples humaneval/RedHatAI--Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8_vllm_temp_0.2-sanitized

</details>

### 准确率
| 类别 | 基准测试 | Mistral-Small-3.1-24B-Instruct-2503 | Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8（本模型） | 恢复率 |
| ---- | ---- | ---- | ---- | ---- |
| <strong>OpenLLM v1</strong> | MMLU (5-shot) | 80.67 | 80.40 | 99.7% |
| <strong>OpenLLM v1</strong> | ARC Challenge (25-shot) | 72.78 | 73.46 | 100.9% |
| <strong>OpenLLM v1</strong> | GSM-8K (5-shot, strict-match) | 56.68 | 61.18 | 104.3% |
| <strong>OpenLLM v1</strong> | Hellaswag (10-shot) | 83.70 | 82.26 | 98.3% |
| <strong>OpenLLM v1</strong> | Winogrande (5-shot) | 83.74 | 80.90 | 96.6% |
| <strong>OpenLLM v1</strong> | TruthfulQA (0-shot, mc2) | 70.62 | 69.15 | 97.9% |
| <strong>OpenLLM v1</strong> | <strong>平均</strong> | <strong>75.03</strong> | <strong>74.56</strong> | <strong>99.4%</strong> |
|  | MMLU-Pro (5-shot) | 67.25 | 66.54 | 98.9% |
|  | GPQA CoT main (5-shot) | 42.63 | 44.64 | 104.7% |
|  | GPQA CoT diamond (5-shot) | 45.96 | 41.92 | 91.2% |
| <strong>编码</strong> | HumanEval pass@1 | 84.70 | 84.20 | 99.4% |
| <strong>编码</strong> | HumanEval+ pass@1 | 79.50 | 81.00 | 101.9% |
| <strong>编码</strong> | MBPP pass@1 | 71.10 | 72.10 | 101.4% |
| <strong>编码</strong> | MBPP+ pass@1 | 60.60 | 62.10 | 100.7% |
| <strong>视觉</strong> | MMMU (0-shot) | 52.11 | 53.11 | 101.9% |
| <strong>视觉</strong> | ChartQA (0-shot) | 81.36 | 82.36 | 101.2% |

## 📄 许可证
本项目采用 Apache-2.0 许可证。