Mistral Small 3.1 24B Instruct 2503 Quantized.w8a8

由RedHatAI開發

這是一個經過INT8量化的Mistral-Small-3.1-24B-Instruct-2503模型，由Red Hat和Neural Magic優化，適用於快速響應和低延遲場景。

Safetensors

支持多種語言開源協議:Apache-2.0 #INT8量化 #多模態理解 #低延遲推理

下載量 833

發布時間 : 4/15/2025

模型概述

該模型是基於Mistral-Small-3.1-24B-Instruct-2503的量化版本，通過將權重和激活量化為INT8，顯著降低了GPU內存需求並提高了計算效率。

模型特點

高效量化

通過INT8量化技術，將GPU內存需求降低約50%，計算吞吐量提高約2倍

多語言支持

支持24種語言的文本生成和理解

多功能應用

適用於對話代理、函數調用、文檔理解和視覺理解等多種任務

快速響應

優化後的模型特別適合需要低延遲的應用場景

模型能力

文本生成

多語言處理

對話代理

函數調用

長文檔理解

視覺理解

編程和數學推理

使用案例

對話系統

客戶服務聊天機器人

部署快速響應的客戶服務代理

降低響應延遲，提高用戶體驗

開發工具

代碼輔助

幫助開發者進行編程和調試

提高開發效率

內容理解

長文檔摘要

快速理解和總結長文檔內容

提高信息處理效率

🚀 Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8

本模型是對 Mistral-Small-3.1-24B-Instruct-2503 進行量化處理後的版本，將激活值和權重量化為 INT8 數據類型，有效減少了 GPU 內存需求和磁盤空間佔用，同時提升了計算吞吐量。適用於快速響應對話、低延遲函數調用等多種場景。

支持語言

英語
法語
德語
西班牙語
葡萄牙語
意大利語
日語
韓語
俄語
中文
阿拉伯語
波斯語
印尼語
馬來語
尼泊爾語
波蘭語
羅馬尼亞語
塞爾維亞語
瑞典語
土耳其語
烏克蘭語
越南語
印地語
孟加拉語

許可證

本項目採用 Apache-2.0 許可證。

庫名稱

vllm

基礎模型

mistralai/Mistral-Small-3.1-24B-Instruct-2503

任務類型

圖像文本到文本

🚀 快速開始

本模型可以使用 vLLM 後端進行高效部署，以下是一個示例代碼：

from vllm import LLM, SamplingParams
from transformers import AutoProcessor

model_id = "RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
processor = AutoProcessor.from_pretrained(model_id)

messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]

prompts = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM 還支持與 OpenAI 兼容的服務，更多詳細信息請參考文檔。

✨ 主要特性

模型概述

模型架構：Mistral3ForConditionalGeneration
- 輸入：文本 / 圖像
- 輸出：文本
模型優化：
- 激活值量化：INT8
- 權重量化：INT8
預期用例：
- 快速響應的對話代理。
- 低延遲的函數調用。
- 通過微調適用於特定領域專家。
- 適合處理敏感數據的愛好者和組織進行本地推理。
- 編程和數學推理。
- 長文檔理解。
- 視覺理解。
不適用範圍：以任何違反適用法律法規（包括貿易合規法律）的方式使用。在模型未官方支持的語言中使用。
發佈日期：2025 年 4 月 15 日
版本：1.0
模型開發者：Red Hat (Neural Magic)

模型優化

本模型通過將 Mistral-Small-3.1-24B-Instruct-2503 的激活值和權重量化為 INT8 數據類型而獲得。這種優化將表示權重和激活值的位數從 16 位減少到 8 位，降低了 GPU 內存需求（約 50%），並提高了矩陣乘法的計算吞吐量（約 2 倍）。權重量化還將磁盤空間需求減少了約 50%。

僅對 Transformer 塊內線性算子的權重和激活值進行量化。權重採用對稱靜態每通道量化方案，而激活值採用對稱動態每令牌量化方案。量化過程應用了 SmoothQuant 和 GPTQ 算法的組合，具體實現於 llm-compressor 庫中。

📦 模型創建

本模型使用 llm-compressor 創建，以下是創建模型的代碼片段：

創建詳情

```python from transformers import AutoProcessor from llmcompressor.modifiers.quantization import GPTQModifier from llmcompressor.modifiers.smoothquant import SmoothQuantModifier from llmcompressor.transformers import oneshot from llmcompressor.transformers.tracing import TraceableMistral3ForConditionalGeneration from datasets import load_dataset, interleave_datasets from PIL import Image import io

加載模型

model_stub = "mistralai/Mistral-Small-3.1-24B-Instruct-2503" model_name = model_stub.split("/")[-1]

num_text_samples = 1024 num_vision_samples = 1024 max_seq_len = 8192

processor = AutoProcessor.from_pretrained(model_stub)

model = TraceableMistral3ForConditionalGeneration.from_pretrained( model_stub, device_map="auto", torch_dtype="auto", )

僅文本數據子集

def preprocess_text(example): input = { "text": processor.apply_chat_template( example["messages"], add_generation_prompt=False, ), "images": None, } tokenized_input = processor(**input, max_length=max_seq_len, truncation=True) tokenized_input["pixel_values"] = tokenized_input.get("pixel_values", None) tokenized_input["image_sizes"] = tokenized_input.get("image_sizes", None) return tokenized_input

dst = load_dataset("neuralmagic/calibration", name="LLM", split="train").select(range(num_text_samples)) dst = dst.map(preprocess_text, remove_columns=dst.column_names)

文本 + 視覺數據子集

def preprocess_vision(example): messages = [] image = None for message in example["messages"]: message_content = [] for content in message["content"]: if content["type"] == "text": message_content.append({"type": "text", "text": content["text"]}) else: message_content.append({"type": "image"}) image = Image.open(io.BytesIO(content["image"]))

    messages.append(
        {
            "role": message["role"],
            "content": message_content,
        }
    )

input = {
    "text": processor.apply_chat_template(
        messages,
        add_generation_prompt=False,
    ),
    "images": image,
}
tokenized_input = processor(**input, max_length=max_seq_len, truncation=True)
tokenized_input["pixel_values"] = tokenized_input.get("pixel_values", None)
tokenized_input["image_sizes"] = tokenized_input.get("image_sizes", None)
return tokenized_input

dsv = load_dataset("neuralmagic/calibration", name="VLM", split="train").select(range(num_vision_samples)) dsv = dsv.map(preprocess_vision, remove_columns=dsv.column_names)

交錯子集

ds = interleave_datasets((dsv, dst))

配置量化算法和方案

recipe = [ SmoothQuantModifier( smoothing_strength=0.8, mappings=[ [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"], [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"], [["re:.down_proj"], "re:.up_proj"], ], ), GPTQModifier( ignore=["language_model.lm_head", "re:vision_tower.", "re:multi_modal_projector."], sequential_targets=["MistralDecoderLayer"], dampening_frac=0.01, targets="Linear", scheme="W8A8", ), ]

定義數據收集器

def data_collator(batch): import torch assert len(batch) == 1 collated = {} for k, v in batch[0].items(): if v is None: continue if k == "input_ids": collated[k] = torch.LongTensor(v) elif k == "pixel_values": collated[k] = torch.tensor(v, dtype=torch.bfloat16) else: collated[k] = torch.tensor(v) return collated

應用量化

oneshot( model=model, dataset=ds, recipe=recipe, max_seq_length=max_seq_len, data_collator=data_collator, num_calibration_samples=num_text_samples + num_vision_samples, )

以壓縮張量格式保存到磁盤

save_path = model_name + "-quantized.w8a8" model.save_pretrained(save_path) processor.save_pretrained(save_path) print(f"模型和分詞器保存到: {save_path}")

</details>

## 📚 模型評估
本模型在 OpenLLM 排行榜任務（版本 1）、MMLU-pro、GPQA、HumanEval 和 MBPP 上進行了評估。非編碼任務使用 [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) 進行評估，而編碼任務使用 [evalplus](https://github.com/neuralmagic/evalplus) 的一個分支進行評估。所有評估均使用 [vLLM](https://docs.vllm.ai/en/stable/) 作為推理引擎。
<details>
  <summary>評估詳情</summary>

### 非編碼任務評估命令
**MMLU**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks mmlu
--num_fewshot 5
--apply_chat_template
--fewshot_as_multiturn
--batch_size auto


**ARC Challenge**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks arc_challenge
--num_fewshot 25
--apply_chat_template
--fewshot_as_multiturn
--batch_size auto


**GSM8k**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.9,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks gsm8k
--num_fewshot 8
--apply_chat_template
--fewshot_as_multiturn
--batch_size auto


**Hellaswag**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks hellaswag
--num_fewshot 10
--apply_chat_template
--fewshot_as_multiturn
--batch_size auto


**Winogrande**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks winogrande
--num_fewshot 5
--apply_chat_template
--fewshot_as_multiturn
--batch_size auto


**TruthfulQA**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks truthfulqa
--num_fewshot 0
--apply_chat_template
--batch_size auto


**MMLU-pro**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks mmlu_pro
--num_fewshot 5
--apply_chat_template
--fewshot_as_multiturn
--batch_size auto


**MMMU**

lm_eval
--model vllm
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.9,max_images=8,enable_chunk_prefill=True,tensor_parallel_size=2
--tasks mmmu
--apply_chat_template
--batch_size auto


**ChartQA**


### 編碼任務評估命令
以下命令可用於 MBPP 評估，只需替換數據集名稱即可。

#### 代碼生成

python3 codegen/generate.py
--model RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8
--bs 16
--temperature 0.2
--n_samples 50
--root "."
--dataset humaneval


#### 代碼清理

python3 evalplus/sanitize.py
humaneval/RedHatAI--Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8_vllm_temp_0.2


#### 代碼評估

evalplus.evaluate
--dataset humaneval
--samples humaneval/RedHatAI--Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8_vllm_temp_0.2-sanitized

</details>

### 準確率
| 類別 | 基準測試 | Mistral-Small-3.1-24B-Instruct-2503 | Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8（本模型） | 恢復率 |
| ---- | ---- | ---- | ---- | ---- |
| <strong>OpenLLM v1</strong> | MMLU (5-shot) | 80.67 | 80.40 | 99.7% |
| <strong>OpenLLM v1</strong> | ARC Challenge (25-shot) | 72.78 | 73.46 | 100.9% |
| <strong>OpenLLM v1</strong> | GSM-8K (5-shot, strict-match) | 56.68 | 61.18 | 104.3% |
| <strong>OpenLLM v1</strong> | Hellaswag (10-shot) | 83.70 | 82.26 | 98.3% |
| <strong>OpenLLM v1</strong> | Winogrande (5-shot) | 83.74 | 80.90 | 96.6% |
| <strong>OpenLLM v1</strong> | TruthfulQA (0-shot, mc2) | 70.62 | 69.15 | 97.9% |
| <strong>OpenLLM v1</strong> | <strong>平均</strong> | <strong>75.03</strong> | <strong>74.56</strong> | <strong>99.4%</strong> |
|  | MMLU-Pro (5-shot) | 67.25 | 66.54 | 98.9% |
|  | GPQA CoT main (5-shot) | 42.63 | 44.64 | 104.7% |
|  | GPQA CoT diamond (5-shot) | 45.96 | 41.92 | 91.2% |
| <strong>編碼</strong> | HumanEval pass@1 | 84.70 | 84.20 | 99.4% |
| <strong>編碼</strong> | HumanEval+ pass@1 | 79.50 | 81.00 | 101.9% |
| <strong>編碼</strong> | MBPP pass@1 | 71.10 | 72.10 | 101.4% |
| <strong>編碼</strong> | MBPP+ pass@1 | 60.60 | 62.10 | 100.7% |
| <strong>視覺</strong> | MMMU (0-shot) | 52.11 | 53.11 | 101.9% |
| <strong>視覺</strong> | ChartQA (0-shot) | 81.36 | 82.36 | 101.2% |

## 📄 許可證
本項目採用 Apache-2.0 許可證。

精選推薦AI模型

Llama 3 Typhoon V1.5x 8b Instruct

專為泰語設計的80億參數指令模型，性能媲美GPT-3.5-turbo，優化了應用場景、檢索增強生成、受限生成和推理任務

大型語言模型