Mistral-Small-24B-Instruct-2501-quantized开源模型 - 低内存高吞吐量智能指令微调

首页

Mistral Small 24B Instruct 2501 Quantized.w8a8

由 RedHatAI 开发

经过INT8量化的24B参数Mistral指令微调模型，显著降低GPU内存需求并提高计算吞吐量

大型语言模型

Safetensors

支持多种语言开源协议:Apache-2.0 #多语言对话 #低延迟推理 #INT8量化

下载量 158

发布时间 : 3/3/2025

模型简介

基于Mistral-Small-24B-Instruct-2501的量化版本，支持多语言文本生成和对话任务，适用于低延迟推理场景

模型特点

高效量化

采用W8A8量化方案，减少50%内存占用和磁盘空间，提升2倍计算吞吐量

多语言支持

支持24种语言的文本生成和理解

低延迟推理

优化后的模型特别适合需要快速响应的对话场景

企业级部署支持

提供Red Hat生态系统全栈部署方案

模型能力

多语言文本生成

指令跟随

长文档理解

编程辅助

数学推理

使用案例

对话系统

客服机器人

构建低延迟多语言客服对话系统

开发辅助

代码生成

帮助开发者生成和优化代码片段

教育

数学问题解答

解释和解决数学问题

GSM8K评估得分90.00

🚀 Mistral-Small-24B-Instruct-2501量化模型（w8a8）

本项目提供了经过量化处理的Mistral-Small-24B-Instruct-2501模型，通过优化显著降低了GPU内存需求并提高了计算吞吐量，适用于多种自然语言处理场景。

语言支持

支持以下语言：

英语、法语、德语、西班牙语、葡萄牙语、意大利语、日语、韩语、俄语、中文、阿拉伯语、波斯语、印尼语、马来语、尼泊尔语、波兰语、罗马尼亚语、塞尔维亚语、瑞典语、土耳其语、乌克兰语、越南语、印地语、孟加拉语

许可证

采用Apache-2.0许可证。

库名称

vllm

基础模型

mistralai/Mistral-Small-24B-Instruct-2501

任务类型

图像文本到文本

🚀 快速开始

本模型可使用 vLLM 后端高效部署，以下是示例代码：

from vllm import LLM, SamplingParams
from transformers import AutoProcessor

model_id = "RedHatAI/Mistral-Small-24B-Instruct-2501-FP8-quantized.w8a8"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
processor = AutoProcessor.from_pretrained(model_id)

messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]

prompts = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM 还支持兼容 OpenAI 的服务，更多详情请参阅文档。

✨ 主要特性

模型概述

模型架构：Mistral3ForConditionalGeneration
- 输入：文本/图像
- 输出：文本
模型优化：
- 激活量化：INT8
- 权重量化：INT8
预期用例：
- 快速响应的对话代理。
- 低延迟的函数调用。
- 通过微调实现特定领域的专家知识。
- 适用于处理敏感数据的爱好者和组织的本地推理。
- 编程和数学推理。
- 长文档理解。
- 视觉理解。
不适用场景：以任何违反适用法律法规（包括贸易合规法律）的方式使用。在模型未正式支持的语言环境中使用。
发布日期：2025 年 3 月 3 日
版本：1.0
模型开发者：Red Hat (Neural Magic)

模型优化细节

本模型通过将 Mistral-Small-24B-Instruct-2501 的激活和权重量化为 INT8 数据类型而获得。这种优化将表示权重和激活所需的位数从 16 位减少到 8 位，从而减少了 GPU 内存需求（约 50%）并提高了矩阵乘法的计算吞吐量（约 2 倍）。权重量化还将磁盘空间需求减少了约 50%。

仅对 Transformer 块内线性算子的权重和激活进行量化。权重采用对称静态逐通道方案进行量化，而激活采用对称动态逐令牌方案进行量化。量化过程应用了 SmoothQuant 和 GPTQ 算法的组合，具体实现于 llm-compressor 库中。

📦 安装指南

在 Red Hat AI 推理服务器上部署

podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
 --ipc=host \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
--name=vllm \
registry.access.redhat.com/rhaiis/rh-vllm-cuda \
vllm serve \
--tensor-parallel-size 8 \
--max-model-len 32768  \
--enforce-eager --model RedHatAI/Mistral-Small-24B-Instruct-2501-quantized.w8a8

更多详情请参阅 Red Hat AI 推理服务器文档。

在 Red Hat Enterprise Linux AI 上部署

# 从 Red Hat 注册表通过 docker 下载模型
# 注意：除非指定 --model-dir，否则模型将下载到 ~/.cache/instructlab/models
ilab model download --repository docker://registry.redhat.io/rhelai1/mistral-small-24b-instruct-2501-quantized-w8a8:1.5

# 通过 ilab 服务模型
ilab model serve --model-path ~/.cache/instructlab/models/mistral-small-24b-instruct-2501-quantized-w8a8

# 与模型进行对话
ilab model chat --model ~/.cache/instructlab/models/mistral-small-24b-instruct-2501-quantized-w8a8

更多详情请参阅 Red Hat Enterprise Linux AI 文档。

在 Red Hat Openshift AI 上部署

# 使用 ServingRuntime 设置 vllm 服务器
# 保存为: vllm-servingruntime.yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
 name: vllm-cuda-runtime # 可选更改: 设置唯一名称
 annotations:
   openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
   opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
 labels:
   opendatahub.io/dashboard: 'true'
spec:
 annotations:
   prometheus.io/port: '8080'
   prometheus.io/path: '/metrics'
 multiModel: false
 supportedModelFormats:
   - autoSelect: true
     name: vLLM
 containers:
   - name: kserve-container
     image: quay.io/modh/vllm:rhoai-2.20-cuda # 如有需要更改。如果是 AMD: quay.io/modh/vllm:rhoai-2.20-rocm
     command:
       - python
       - -m
       - vllm.entrypoints.openai.api_server
     args:
       - "--port=8080"
       - "--model=/mnt/models"
       - "--served-model-name={{.Name}}"
     env:
       - name: HF_HOME
         value: /tmp/hf_home
     ports:
       - containerPort: 8080
         protocol: TCP

# 将模型附加到 vllm 服务器。这是一个 NVIDIA 模板
# 保存为: inferenceservice.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: mistral-small-24b-instruct-2501-quantized-w8a8 # 可选更改
    serving.kserve.io/deploymentMode: RawDeployment
  name: mistral-small-24b-instruct-2501-quantized-w8a8         # 指定模型名称。此值将用于在有效负载中调用模型
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: vLLM
      name: ''
      resources:
        limits:
          cpu: '2'			# 这是特定于模型的
          memory: 8Gi		# 这是特定于模型的
          nvidia.com/gpu: '1'	# 这是特定于加速器的
        requests:			# 此块同样适用
          cpu: '1'
          memory: 4Gi
          nvidia.com/gpu: '1'
      runtime: vllm-cuda-runtime	# 必须与上面的 ServingRuntime 名称匹配
      storageUri: oci://registry.redhat.io/rhelai1/modelcar-mistral-small-24b-instruct-2501-quantized-w8a8:1.5
    tolerations:
    - effect: NoSchedule
      key: nvidia.com/gpu
      operator: Exists

# 确保首先进入要部署模型的项目
# oc project <项目名称>

# 应用两个资源以运行模型

# 应用 ServingRuntime
oc apply -f vllm-servingruntime.yaml

# 应用 InferenceService
oc apply -f qwen-inferenceservice.yaml

# 替换下面的 <推理服务名称> 和 <集群入口域名>:
# - 如果不确定，请运行 `oc get inferenceservice` 查找 URL。

# 使用 curl 调用服务器:
curl https://<推理服务名称>-predictor-default.<域名>/v1/chat/completions
        -H "Content-Type: application/json" \
        -d '{
    "model": "mistral-small-24b-instruct-2501-quantized-w8a8",
    "stream": true,
    "stream_options": {
        "include_usage": true
    },
    "max_tokens": 1,
    "messages": [
        {
            "role": "user",
            "content": "How can a bee fly when its wings are so small?"
        }
    ]
}'

更多详情请参阅 Red Hat Openshift AI 文档。

🔧 技术细节

模型创建

本模型使用 llm-compressor 创建，以下是创建代码：

from transformers import AutoTokenizer, AutoModelForCausalLM
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot
from datasets import load_dataset

# 加载模型
model_stub = "mistralai/Mistral-Small-24B-Instruct-2501"
model_name = model_stub.split("/")[-1]

num_samples = 1024
max_seq_len = 8192

tokenizer = AutoTokenizer.from_pretrained(model_stub)

model = AutoModelForCausalLM.from_pretrained(
    model_stub,
    device_map="auto",
    torch_dtype="auto",
)

# 数据处理
def preprocess_text(example):
    text = tokenizer.apply_chat_template(example["messages"], tokenize=False, add_generation_prompt=False)
    return tokenizer(text, padding=False, max_length=max_seq_len, truncation=True)

ds = load_dataset("neuralmagic/calibration", name="LLM", split="train").select(range(num_samples))
ds = ds.map(preprocess_text, remove_columns=ds.column_names)

# 配置量化算法和方案
recipe = [
    SmoothQuantModifier(
        smoothing_strength=0.9,
        mappings=[
            [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
            [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
            [["re:.*down_proj"], "re:.*up_proj"],
        ],
    ),
    GPTQModifier(
        ignore=["lm_head"],
        sequential_targets=["MistralDecoderLayer"],
        dampening_frac=0.1,
        targets="Linear",
        scheme="W8A8",
    ),
]

# 应用量化
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=max_seq_len,
    num_calibration_samples=num_samples
)

# 以压缩张量格式保存到磁盘
save_path = model_name + "-quantized.w8a8"
model.save_pretrained(save_path)
processor.save_pretrained(save_path)
print(f"模型和分词器保存到: {save_path}")

模型评估

本模型在 OpenLLM 排行榜 V1 和 V2 上进行了评估，使用以下命令：

OpenLLM 排行榜 V1

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Mistral-Small-24B-Instruct-2501-FP8-Dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks openllm \
  --write_out \
  --batch_size auto \
  --output_path output_dir \
  --show_config

OpenLLM 排行榜 V2

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Mistral-Small-24B-Instruct-2501-FP8-Dynamic",dtype=auto,add_bos_token=False,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --tasks leaderboard \
  --write_out \
  --batch_size auto \
  --output_path output_dir \
  --show_config

准确率

OpenLLM 排行榜 V1 评估得分

指标	mistralai/Mistral-Small-24B-Instruct-2501	nm-testing/Mistral-Small-24B-Instruct-2501-quantized.w8a8
ARC-Challenge (Acc-Norm, 25-shot)	72.18	68.86
GSM8K (Strict-Match, 5-shot)	90.14	90.00
HellaSwag (Acc-Norm, 10-shot)	85.05	85.06
MMLU (Acc, 5-shot)	80.69	80.25
TruthfulQA (MC2, 0-shot)	65.55	65.69
Winogrande (Acc, 5-shot)	83.11	81.69
平均得分	79.45	78.59
恢复率 (%)	100.00	98.92