Mistral Small 24B Instruct 2501 AWQ

由 stelterlab 开发

Mistral Small 3 (2501版本)是一个24B参数的指令微调大语言模型，在70B参数以下类别中树立了新标杆，具有卓越的知识密度和多语言支持能力。

大型语言模型

Transformers

支持多种语言开源协议:Apache-2.0 #多语言对话代理 #低延迟推理 #函数调用优化

下载量 52.55k

发布时间 : 1/30/2025

模型简介

这是一个24B参数的大语言模型，经过指令微调优化，适用于对话代理、函数调用和本地推理等场景。

模型特点

多语言支持

支持包括英语、法语、德语、西班牙语、意大利语、中文等数十种语言

智能体核心

提供顶尖的智能体能力，支持原生函数调用和JSON输出

高级推理

具备最先进的对话和推理能力

本地部署

量化后仅需单张RTX 4090显卡或32GB内存的MacBook即可运行

模型能力

多语言文本生成

对话系统

函数调用

JSON格式输出

指令遵循

使用案例

对话代理

多语言对话

支持多种语言的流畅对话

能够生成符合语言习惯的响应

函数调用

API集成

可作为智能体核心集成外部API

支持JSON格式输出和工具调用

本地推理

敏感数据处理

适用于需要本地处理敏感数据的场景

可在本地设备上运行，保护数据隐私

🚀 Mistral-Small-24B-Instruct-2501

Mistral-Small-24B-Instruct-2501是一款出色的小型大语言模型，参数达24B，在性能上可媲美大型模型。它支持本地部署，适用于快速响应的对话代理、低延迟函数调用等多种场景。

🚀 快速开始

Mistral-Small-24B-Instruct-2501模型可搭配以下框架使用：

vllm：详情见此处
transformers：详情见此处

✨ 主要特性

多语言支持：支持包括英语、法语、德语、西班牙语、意大利语、中文、日语、韩语、葡萄牙语、荷兰语和波兰语等数十种语言。
以代理为中心：具备一流的代理能力，支持原生函数调用和JSON输出。
高级推理：拥有先进的对话和推理能力。
Apache 2.0许可证：开放许可，允许商业和非商业用途的使用与修改。
上下文窗口：拥有32k的上下文窗口。
系统提示：严格遵循并支持系统提示。
分词器：采用Tekken分词器，词汇量达131k。

📦 安装指南

vLLM

建议使用 vLLM库来实现生产就绪的推理管道。

注意1：建议使用较低的温度参数，例如 temperature=0.15。

注意2：确保为模型添加系统提示，以更好地满足您的需求。如果将模型用作通用助手，建议使用以下系统提示：

system_prompt = """You are Mistral Small 3, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris.
Your knowledge base was last updated on 2023-10-01. The current date is 2025-01-30.
When you're not sure about some information, you say that you don't have the information and don't make up anything.
If the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. \"What are some good restaurants around me?\" => \"Where are you?\" or \"When is the next flight to Tokyo\" => \"Where do you travel from?\")"""

安装步骤：确保安装 vLLM >= 0.6.4：

pip install --upgrade vllm

同时确保安装 mistral_common >= 1.5.2：

pip install --upgrade mistral_common

您也可以使用现成的 Docker镜像或在 Docker Hub 上获取。

服务器部署

建议在服务器/客户端环境中使用Mistral-Small-24B-Instruct-2501。

启动服务器：

vllm serve mistralai/Mistral-Small-24B-Instruct-2501 --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice

注意：在GPU上运行Mistral-Small-24B-Instruct-2501需要约55GB的GPU显存（bf16或fp16）。 2. 可以使用以下简单的Python代码片段来测试客户端：

import requests
import json
from datetime import datetime, timedelta

url = "http://<your-server>:8000/v1/chat/completions"
headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}

model = "mistralai/Mistral-Small-24B-Instruct-2501"

messages = [
    {
        "role": "system",
        "content": "You are a conversational agent that always answers straight to the point, always end your accurate response with an ASCII drawing of a cat."
    },
    {
        "role": "user",
        "content": "Give me 5 non-formal ways to say 'See you later' in French."
    },
]

data = {"model": model, "messages": messages}

response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json()["choices"][0]["message"]["content"])

# Sure, here are five non-formal ways to say "See you later" in French:
#
# 1. À plus tard
# 2. À plus
# 3. Salut
# 4. À toute
# 5. Bisous
#
# ```
#  /\_/\
# ( o.o )
#  > ^ <
# ```

Ollama

Ollama 可在MacOS、Windows和Linux上本地运行此模型。

4位量化（默认）：

ollama run mistral-small

8位量化：

ollama run mistral-small:24b-instruct-2501-q8_0

FP16：

ollama run mistral-small:24b-instruct-2501-fp16

💻 使用示例

vLLM

函数调用

Mistral-Small-24-Instruct-2501在通过vLLM进行函数/工具调用任务方面表现出色。示例如下：

示例

```py import requests import json from huggingface_hub import hf_hub_download from datetime import datetime, timedelta

url = "http://:8000/v1/chat/completions" headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}

model = "mistralai/Mistral-Small-24B-Instruct-2501"

def load_system_prompt(repo_id: str, filename: str) -> str: file_path = hf_hub_download(repo_id=repo_id, filename=filename) with open(file_path, "r") as file: system_prompt = file.read() today = datetime.today().strftime("%Y-%m-%d") yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d") model_name = repo_id.split("/")[-1] return system_prompt.format(name=model_name, today=today, yesterday=yesterday)

SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

tools = [ { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather in a given location", "parameters": { "type": "object", "properties": { "city": { "type": "string", "description": "The city to find the weather for, e.g. 'San Francisco'", }, "state": { "type": "string", "description": "The state abbreviation, e.g. 'CA' for California", }, "unit": { "type": "string", "description": "The unit for temperature", "enum": ["celsius", "fahrenheit"], }, }, "required": ["city", "state", "unit"], }, }, }, { "type": "function", "function": { "name": "rewrite", "description": "Rewrite a given text for improved clarity", "parameters": { "type": "object", "properties": { "text": { "type": "string", "description": "The input text to rewrite", } }, }, }, }, ]

messages = [ {"role": "system", "content": SYSTEM_PROMPT}, { "role": "user", "content": "Could you please make the below article more concise?\n\nOpenAI is an artificial intelligence research laboratory consisting of the non-profit OpenAI Incorporated and its for-profit subsidiary corporation OpenAI Limited Partnership.", }, { "role": "assistant", "content": "", "tool_calls": [ { "id": "bbc5b7ede", "type": "function", "function": { "name": "rewrite", "arguments": '{"text": "OpenAI is an artificial intelligence research laboratory consisting of the non-profit OpenAI Incorporated and its for-profit subsidiary corporation OpenAI Limited Partnership."}', }, } ], }, { "role": "tool", "content": '{"action":"rewrite","outcome":"OpenAI is a FOR-profit company."}', "tool_call_id": "bbc5b7ede", "name": "rewrite", }, { "role": "assistant", "content": "---\n\nOpenAI is a FOR-profit company.", }, { "role": "user", "content": "Can you tell me what the temperature will be in Dallas, in Fahrenheit?", }, ]

data = {"model": model, "messages": messages, "tools": tools}

response = requests.post(url, headers=headers, data=json.dumps(data)) import ipdb; ipdb.set_trace() print(response.json()["choices"][0]["message"]["tool_calls"])

[{'id': '8PdihwL6d', 'type': 'function', 'function': {'name': 'get_current_weather', 'arguments': '{"city": "Dallas", "state": "TX", "unit": "fahrenheit"}'}}]

</details>

#### 离线使用
```py
from vllm import LLM
from vllm.sampling_params import SamplingParams
from datetime import datetime, timedelta

SYSTEM_PROMPT = "You are a conversational agent that always answers straight to the point, always end your accurate response with an ASCII drawing of a cat."

user_prompt = "Give me 5 non-formal ways to say 'See you later' in French."

messages = [
    {
        "role": "system",
        "content": SYSTEM_PROMPT
    },
    {
        "role": "user",
        "content": user_prompt
    },
]

# note that running this model on GPU requires over 60 GB of GPU RAM
llm = LLM(model=model_name, tokenizer_mode="mistral", tensor_parallel_size=8)

sampling_params = SamplingParams(max_tokens=512, temperature=0.15)
outputs = llm.chat(messages, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)
# Sure, here are five non-formal ways to say "See you later" in French:
#
# 1. À plus tard
# 2. À plus
# 3. Salut
# 4. À toute
# 5. Bisous
#
# ```
#  /\_/\
# ( o.o )
#  > ^ <
# ```

Transformers

如果想使用Hugging Face的transformers库生成文本，可以参考以下代码：

from transformers import pipeline
import torch

messages = [
    {"role": "user", "content": "Give me 5 non-formal ways to say 'See you later' in French."},
]
chatbot = pipeline("text-generation", model="mistralai/Mistral-Small-24B-Instruct-2501", max_new_tokens=256, torch_dtype=torch.bfloat16)
chatbot(messages)

📚 详细文档

基准测试结果

人工评估基准测试

类别	Gemma-2-27B	Qwen-2.5-32B	Llama-3.3-70B	Gpt4o-mini
Mistral更优	0.536	0.496	0.192	0.200
Mistral略优	0.196	0.184	0.164	0.204
平局	0.052	0.060	0.236	0.160
其他模型略优	0.060	0.088	0.112	0.124
其他模型更优	0.156	0.172	0.296	0.312

注意：

与外部第三方供应商进行了并排评估，使用了超过1k个专有编码和通用提示。
评估人员需要从Mistral Small 3和其他模型生成的匿名结果中选择他们更喜欢的模型响应。
我们意识到在某些情况下，人工判断的基准测试结果与公开可用的基准测试结果有很大差异，但我们已格外谨慎地验证了评估的公平性，相信上述基准测试结果是有效的。

公开可用的基准测试

推理与知识

评估指标	mistral-small-24B-instruct-2501	gemma-2b-27b	llama-3.3-70b	qwen2.5-32b	gpt-4o-mini-2024-07-18
mmlu_pro_5shot_cot_instruct	0.663	0.536	0.666	0.683	0.617
gpqa_main_cot_5shot_instruct	0.453	0.344	0.531	0.404	0.377

数学与编码

评估指标	mistral-small-24B-instruct-2501	gemma-2b-27b	llama-3.3-70b	qwen2.5-32b	gpt-4o-mini-2024-07-18
humaneval_instruct_pass@1	0.848	0.732	0.854	0.909	0.890
math_instruct	0.706	0.535	0.743	0.819	0.761

指令遵循

评估指标	mistral-small-24B-instruct-2501	gemma-2b-27b	llama-3.3-70b	qwen2.5-32b	gpt-4o-mini-2024-07-18
mtbench_dev	8.35	7.86	7.96	8.26	8.33
wildbench	52.27	48.21	50.04	52.73	56.13
arena_hard	0.873	0.788	0.840	0.860	0.897
ifeval	0.829	0.8065	0.8835	0.8401	0.8499

注意：

所有基准测试的性能准确性均通过相同的内部评估管道获得，因此数字可能与之前报告的性能略有差异（Qwen2.5-32B-Instruct，Llama-3.3-70B-Instruct，Gemma-2-27B-IT）。
基于评判的评估，如Wildbench、Arena hard和MTBench，基于gpt-4o-2024-05-13。

基本指令模板 (V7-Tekken)

<s>[SYSTEM_PROMPT]<system prompt>[/SYSTEM_PROMPT][INST]<user message>[/INST]<assistant response></s>[INST]<user message>[/INST]

<system_prompt>、<user message> 和 <assistant response> 是占位符。

请确保使用 mistral-common 作为参考标准

🔧 技术细节

AWQ量化：由stelterlab使用AutoAWQ（由casper-hansen开发，https://github.com/casper-hansen/AutoAWQ/）在INT4 GEMM中完成。原始权重由Mistral AI提供。

📄 许可证

本模型采用Apache 2.0许可证，允许商业和非商业用途的使用与修改。

此外，如果您想了解我们如何处理您的个人数据，请阅读我们的隐私政策。您可以在我们的博客文章中了解更多关于Mistral Small的信息。模型开发者为Mistral AI团队。

精选推荐AI模型

Llama 3 Typhoon V1.5x 8b Instruct

专为泰语设计的80亿参数指令模型，性能媲美GPT-3.5-turbo，优化了应用场景、检索增强生成、受限生成和推理任务

Cadet-Tiny是一个基于SODA数据集训练的超小型对话模型，专为边缘设备推理设计，体积仅为Cosmo-3B模型的2%左右。

Roberta Base Chinese Extractive Qa

基于RoBERTa架构的中文抽取式问答模型，适用于从给定文本中提取答案的任务。

智启未来，您的人工智能解决方案智库