Mistral Small 24B Instruct 2501 AWQ

由stelterlab開發

Mistral Small 3 (2501版本)是一個24B參數的指令微調大語言模型，在70B參數以下類別中樹立了新標杆，具有卓越的知識密度和多語言支持能力。

大型語言模型

Transformers

支持多種語言開源協議:Apache-2.0 #多語言對話代理 #低延遲推理 #函數調用優化

下載量 52.55k

發布時間 : 1/30/2025

模型概述

這是一個24B參數的大語言模型，經過指令微調優化，適用於對話代理、函數調用和本地推理等場景。

模型特點

多語言支持

支持包括英語、法語、德語、西班牙語、意大利語、中文等數十種語言

智能體核心

提供頂尖的智能體能力，支持原生函數調用和JSON輸出

高級推理

具備最先進的對話和推理能力

本地部署

量化後僅需單張RTX 4090顯卡或32GB內存的MacBook即可運行

模型能力

多語言文本生成

對話系統

函數調用

JSON格式輸出

指令遵循

使用案例

對話代理

多語言對話

支持多種語言的流暢對話

能夠生成符合語言習慣的響應

函數調用

API集成

可作為智能體核心集成外部API

支持JSON格式輸出和工具調用

本地推理

敏感數據處理

適用於需要本地處理敏感數據的場景

可在本地設備上運行，保護數據隱私

🚀 Mistral-Small-24B-Instruct-2501

Mistral-Small-24B-Instruct-2501是一款出色的小型大語言模型，參數達24B，在性能上可媲美大型模型。它支持本地部署，適用於快速響應的對話代理、低延遲函數調用等多種場景。

🚀 快速開始

Mistral-Small-24B-Instruct-2501模型可搭配以下框架使用：

vllm：詳情見此處
transformers：詳情見此處

✨ 主要特性

多語言支持：支持包括英語、法語、德語、西班牙語、意大利語、中文、日語、韓語、葡萄牙語、荷蘭語和波蘭語等數十種語言。
以代理為中心：具備一流的代理能力，支持原生函數調用和JSON輸出。
高級推理：擁有先進的對話和推理能力。
Apache 2.0許可證：開放許可，允許商業和非商業用途的使用與修改。
上下文窗口：擁有32k的上下文窗口。
系統提示：嚴格遵循並支持系統提示。
分詞器：採用Tekken分詞器，詞彙量達131k。

📦 安裝指南

vLLM

建議使用 vLLM庫來實現生產就緒的推理管道。

注意1：建議使用較低的溫度參數，例如 temperature=0.15。

注意2：確保為模型添加系統提示，以更好地滿足您的需求。如果將模型用作通用助手，建議使用以下系統提示：

system_prompt = """You are Mistral Small 3, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris.
Your knowledge base was last updated on 2023-10-01. The current date is 2025-01-30.
When you're not sure about some information, you say that you don't have the information and don't make up anything.
If the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. \"What are some good restaurants around me?\" => \"Where are you?\" or \"When is the next flight to Tokyo\" => \"Where do you travel from?\")"""

安裝步驟：確保安裝 vLLM >= 0.6.4：

pip install --upgrade vllm

同時確保安裝 mistral_common >= 1.5.2：

pip install --upgrade mistral_common

您也可以使用現成的 Docker鏡像或在 Docker Hub 上獲取。

服務器部署

建議在服務器/客戶端環境中使用Mistral-Small-24B-Instruct-2501。

啟動服務器：

vllm serve mistralai/Mistral-Small-24B-Instruct-2501 --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice

注意：在GPU上運行Mistral-Small-24B-Instruct-2501需要約55GB的GPU顯存（bf16或fp16）。 2. 可以使用以下簡單的Python代碼片段來測試客戶端：

import requests
import json
from datetime import datetime, timedelta

url = "http://<your-server>:8000/v1/chat/completions"
headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}

model = "mistralai/Mistral-Small-24B-Instruct-2501"

messages = [
    {
        "role": "system",
        "content": "You are a conversational agent that always answers straight to the point, always end your accurate response with an ASCII drawing of a cat."
    },
    {
        "role": "user",
        "content": "Give me 5 non-formal ways to say 'See you later' in French."
    },
]

data = {"model": model, "messages": messages}

response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json()["choices"][0]["message"]["content"])

# Sure, here are five non-formal ways to say "See you later" in French:
#
# 1. À plus tard
# 2. À plus
# 3. Salut
# 4. À toute
# 5. Bisous
#
# ```
#  /\_/\
# ( o.o )
#  > ^ <
# ```

Ollama

Ollama 可在MacOS、Windows和Linux上本地運行此模型。

4位量化（默認）：

ollama run mistral-small

8位量化：

ollama run mistral-small:24b-instruct-2501-q8_0

FP16：

ollama run mistral-small:24b-instruct-2501-fp16

💻 使用示例

vLLM

函數調用

Mistral-Small-24-Instruct-2501在通過vLLM進行函數/工具調用任務方面表現出色。示例如下：

示例

```py import requests import json from huggingface_hub import hf_hub_download from datetime import datetime, timedelta

url = "http://:8000/v1/chat/completions" headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}

model = "mistralai/Mistral-Small-24B-Instruct-2501"

def load_system_prompt(repo_id: str, filename: str) -> str: file_path = hf_hub_download(repo_id=repo_id, filename=filename) with open(file_path, "r") as file: system_prompt = file.read() today = datetime.today().strftime("%Y-%m-%d") yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d") model_name = repo_id.split("/")[-1] return system_prompt.format(name=model_name, today=today, yesterday=yesterday)

SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

tools = [ { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather in a given location", "parameters": { "type": "object", "properties": { "city": { "type": "string", "description": "The city to find the weather for, e.g. 'San Francisco'", }, "state": { "type": "string", "description": "The state abbreviation, e.g. 'CA' for California", }, "unit": { "type": "string", "description": "The unit for temperature", "enum": ["celsius", "fahrenheit"], }, }, "required": ["city", "state", "unit"], }, }, }, { "type": "function", "function": { "name": "rewrite", "description": "Rewrite a given text for improved clarity", "parameters": { "type": "object", "properties": { "text": { "type": "string", "description": "The input text to rewrite", } }, }, }, }, ]

messages = [ {"role": "system", "content": SYSTEM_PROMPT}, { "role": "user", "content": "Could you please make the below article more concise?\n\nOpenAI is an artificial intelligence research laboratory consisting of the non-profit OpenAI Incorporated and its for-profit subsidiary corporation OpenAI Limited Partnership.", }, { "role": "assistant", "content": "", "tool_calls": [ { "id": "bbc5b7ede", "type": "function", "function": { "name": "rewrite", "arguments": '{"text": "OpenAI is an artificial intelligence research laboratory consisting of the non-profit OpenAI Incorporated and its for-profit subsidiary corporation OpenAI Limited Partnership."}', }, } ], }, { "role": "tool", "content": '{"action":"rewrite","outcome":"OpenAI is a FOR-profit company."}', "tool_call_id": "bbc5b7ede", "name": "rewrite", }, { "role": "assistant", "content": "---\n\nOpenAI is a FOR-profit company.", }, { "role": "user", "content": "Can you tell me what the temperature will be in Dallas, in Fahrenheit?", }, ]

data = {"model": model, "messages": messages, "tools": tools}

response = requests.post(url, headers=headers, data=json.dumps(data)) import ipdb; ipdb.set_trace() print(response.json()["choices"][0]["message"]["tool_calls"])

[{'id': '8PdihwL6d', 'type': 'function', 'function': {'name': 'get_current_weather', 'arguments': '{"city": "Dallas", "state": "TX", "unit": "fahrenheit"}'}}]

</details>

#### 離線使用
```py
from vllm import LLM
from vllm.sampling_params import SamplingParams
from datetime import datetime, timedelta

SYSTEM_PROMPT = "You are a conversational agent that always answers straight to the point, always end your accurate response with an ASCII drawing of a cat."

user_prompt = "Give me 5 non-formal ways to say 'See you later' in French."

messages = [
    {
        "role": "system",
        "content": SYSTEM_PROMPT
    },
    {
        "role": "user",
        "content": user_prompt
    },
]

# note that running this model on GPU requires over 60 GB of GPU RAM
llm = LLM(model=model_name, tokenizer_mode="mistral", tensor_parallel_size=8)

sampling_params = SamplingParams(max_tokens=512, temperature=0.15)
outputs = llm.chat(messages, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)
# Sure, here are five non-formal ways to say "See you later" in French:
#
# 1. À plus tard
# 2. À plus
# 3. Salut
# 4. À toute
# 5. Bisous
#
# ```
#  /\_/\
# ( o.o )
#  > ^ <
# ```

Transformers

如果想使用Hugging Face的transformers庫生成文本，可以參考以下代碼：

from transformers import pipeline
import torch

messages = [
    {"role": "user", "content": "Give me 5 non-formal ways to say 'See you later' in French."},
]
chatbot = pipeline("text-generation", model="mistralai/Mistral-Small-24B-Instruct-2501", max_new_tokens=256, torch_dtype=torch.bfloat16)
chatbot(messages)

📚 詳細文檔

基準測試結果

人工評估基準測試

類別	Gemma-2-27B	Qwen-2.5-32B	Llama-3.3-70B	Gpt4o-mini
Mistral更優	0.536	0.496	0.192	0.200
Mistral略優	0.196	0.184	0.164	0.204
平局	0.052	0.060	0.236	0.160
其他模型略優	0.060	0.088	0.112	0.124
其他模型更優	0.156	0.172	0.296	0.312

注意：

與外部第三方供應商進行了並排評估，使用了超過1k個專有編碼和通用提示。
評估人員需要從Mistral Small 3和其他模型生成的匿名結果中選擇他們更喜歡的模型響應。
我們意識到在某些情況下，人工判斷的基準測試結果與公開可用的基準測試結果有很大差異，但我們已格外謹慎地驗證了評估的公平性，相信上述基準測試結果是有效的。

公開可用的基準測試

推理與知識

評估指標	mistral-small-24B-instruct-2501	gemma-2b-27b	llama-3.3-70b	qwen2.5-32b	gpt-4o-mini-2024-07-18
mmlu_pro_5shot_cot_instruct	0.663	0.536	0.666	0.683	0.617
gpqa_main_cot_5shot_instruct	0.453	0.344	0.531	0.404	0.377

數學與編碼

評估指標	mistral-small-24B-instruct-2501	gemma-2b-27b	llama-3.3-70b	qwen2.5-32b	gpt-4o-mini-2024-07-18
humaneval_instruct_pass@1	0.848	0.732	0.854	0.909	0.890
math_instruct	0.706	0.535	0.743	0.819	0.761

指令遵循

評估指標	mistral-small-24B-instruct-2501	gemma-2b-27b	llama-3.3-70b	qwen2.5-32b	gpt-4o-mini-2024-07-18
mtbench_dev	8.35	7.86	7.96	8.26	8.33
wildbench	52.27	48.21	50.04	52.73	56.13
arena_hard	0.873	0.788	0.840	0.860	0.897
ifeval	0.829	0.8065	0.8835	0.8401	0.8499

注意：

所有基準測試的性能準確性均通過相同的內部評估管道獲得，因此數字可能與之前報告的性能略有差異（Qwen2.5-32B-Instruct，Llama-3.3-70B-Instruct，Gemma-2-27B-IT）。
基於評判的評估，如Wildbench、Arena hard和MTBench，基於gpt-4o-2024-05-13。

基本指令模板 (V7-Tekken)

<s>[SYSTEM_PROMPT]<system prompt>[/SYSTEM_PROMPT][INST]<user message>[/INST]<assistant response></s>[INST]<user message>[/INST]

<system_prompt>、<user message> 和 <assistant response> 是佔位符。

請確保使用 mistral-common 作為參考標準

🔧 技術細節

AWQ量化：由stelterlab使用AutoAWQ（由casper-hansen開發，https://github.com/casper-hansen/AutoAWQ/）在INT4 GEMM中完成。原始權重由Mistral AI提供。

📄 許可證

本模型採用Apache 2.0許可證，允許商業和非商業用途的使用與修改。

此外，如果您想了解我們如何處理您的個人數據，請閱讀我們的隱私政策。您可以在我們的博客文章中瞭解更多關於Mistral Small的信息。模型開發者為Mistral AI團隊。

精選推薦AI模型

Llama 3 Typhoon V1.5x 8b Instruct

專為泰語設計的80億參數指令模型，性能媲美GPT-3.5-turbo，優化了應用場景、檢索增強生成、受限生成和推理任務

Cadet-Tiny是一個基於SODA數據集訓練的超小型對話模型，專為邊緣設備推理設計，體積僅為Cosmo-3B模型的2%左右。

Roberta Base Chinese Extractive Qa

基於RoBERTa架構的中文抽取式問答模型，適用於從給定文本中提取答案的任務。

智啟未來，您的人工智能解決方案智庫