模型简介
模型特点
模型能力
使用案例
🚀 Mistral-Small-3.1-24B-Instruct-2503 GGUF模型
Mistral-Small-3.1-24B-Instruct-2503 GGUF模型是基于Mistral架构的高性能语言模型,具备多语言支持、视觉理解等先进能力,适用于多种自然语言处理和视觉分析任务。
🚀 快速开始
安装依赖
若要使用该模型,推荐使用vLLM库来实现生产就绪的推理管道。确保安装vLLM >= 0.8.1
:
pip install vllm --upgrade
这将自动安装mistral_common >= 1.5.4
。你可以通过以下命令进行检查:
python -c "import mistral_common; print(mistral_common.__version__)"
你也可以使用现成的Docker镜像或在Docker Hub上获取。
启动服务
建议在服务器/客户端环境中使用Mistral-Small-3.1-24B-Instruct-2503:
- 启动服务器:
vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice --limit_mm_per_prompt 'image=10' --tensor-parallel-size 2
注意:在GPU上运行Mistral-Small-3.1-24B-Instruct-2503需要约55GB的GPU显存(bf16或fp16)。
- 使用Python代码向客户端发送请求:
import requests
import json
from huggingface_hub import hf_hub_download
from datetime import datetime, timedelta
url = "http://<your-server-url>:8000/v1/chat/completions"
headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}
model = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
today = datetime.today().strftime("%Y-%m-%d")
yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
model_name = repo_id.split("/")[-1]
return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/europe.png"
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Which of the depicted countries has the best food? Which the second and third and fourth? Name the country, its color on the map and one its city that is visible on the map, but is not the capital. Make absolutely sure to only name a city that can be seen on the map.",
},
{"type": "image_url", "image_url": {"url": image_url}},
],
},
]
data = {"model": model, "messages": messages, "temperature": 0.15}
response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json()["choices"][0]["message"]["content"])
# Determining the "best" food is highly subjective and depends on personal preferences. However, based on general popularity and recognition, here are some countries known for their cuisine:
# 1. **Italy** - Color: Light Green - City: Milan
# - Italian cuisine is renowned worldwide for its pasta, pizza, and various regional specialties.
# 2. **France** - Color: Brown - City: Lyon
# - French cuisine is celebrated for its sophistication, including dishes like coq au vin, bouillabaisse, and pastries like croissants and éclairs.
# 3. **Spain** - Color: Yellow - City: Bilbao
# - Spanish cuisine offers a variety of flavors, from paella and tapas to jamón ibérico and churros.
# 4. **Greece** - Not visible on the map
# - Greek cuisine is known for dishes like moussaka, souvlaki, and baklava. Unfortunately, Greece is not visible on the provided map, so I cannot name a city.
# Since Greece is not visible on the map, I'll replace it with another country known for its good food:
# 4. **Turkey** - Color: Light Green (east part of the map) - City: Istanbul
# - Turkish cuisine is diverse and includes dishes like kebabs, meze, and baklava.
✨ 主要特性
- 视觉理解:模型具备视觉能力,可分析图像并基于视觉内容提供见解。
- 多语言支持:支持数十种语言,包括英语、法语、德语、希腊语、印地语、印尼语、意大利语、日语、韩语、马来语、尼泊尔语、波兰语、葡萄牙语、罗马尼亚语、俄语、塞尔维亚语、西班牙语、瑞典语、土耳其语、乌克兰语、越南语、阿拉伯语、孟加拉语、中文、波斯语。
- 以代理为中心:提供一流的代理能力,支持原生函数调用和JSON输出。
- 高级推理:具备先进的对话和推理能力。
- 开源许可:采用Apache 2.0许可证,允许商业和非商业用途的使用和修改。
- 长上下文窗口:拥有128k的上下文窗口。
- 系统提示支持:对系统提示有强大的遵循和支持能力。
- 分词器:使用Tekken分词器,词汇表大小为131k。
📦 安装指南
vLLM安装
确保安装vLLM >= 0.8.1
:
pip install vllm --upgrade
安装后,可通过以下命令检查mistral_common
版本:
python -c "import mistral_common; print(mistral_common.__version__)"
也可使用Docker镜像或从Docker Hub获取。
💻 使用示例
基础用法
import requests
import json
from huggingface_hub import hf_hub_download
from datetime import datetime, timedelta
url = "http://<your-server-url>:8000/v1/chat/completions"
headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}
model = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
today = datetime.today().strftime("%Y-%m-%d")
yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
model_name = repo_id.split("/")[-1]
return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/europe.png"
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Which of the depicted countries has the best food? Which the second and third and fourth? Name the country, its color on the map and one its city that is visible on the map, but is not the capital. Make absolutely sure to only name a city that can be seen on the map.",
},
{"type": "image_url", "image_url": {"url": image_url}},
],
},
]
data = {"model": model, "messages": messages, "temperature": 0.15}
response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json()["choices"][0]["message"]["content"])
高级用法
import requests
import json
from huggingface_hub import hf_hub_download
from datetime import datetime, timedelta
url = "http://<your-url>:8000/v1/chat/completions"
headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}
model = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
today = datetime.today().strftime("%Y-%m-%d")
yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
model_name = repo_id.split("/")[-1]
return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "The city to find the weather for, e.g. 'San Francisco'",
},
"state": {
"type": "string",
"description": "The state abbreviation, e.g. 'CA' for California",
},
"unit": {
"type": "string",
"description": "The unit for temperature",
"enum": ["celsius", "fahrenheit"],
},
},
"required": ["city", "state", "unit"],
},
},
},
{
"type": "function",
"function": {
"name": "rewrite",
"description": "Rewrite a given text for improved clarity",
"parameters": {
"type": "object",
"properties": {
"text": {
"type": "string",
"description": "The input text to rewrite",
}
},
},
},
},
]
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": "Could you please make the below article more concise?\n\nOpenAI is an artificial intelligence research laboratory consisting of the non-profit OpenAI Incorporated and its for-profit subsidiary corporation OpenAI Limited Partnership.",
},
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"id": "bbc5b7ede",
"type": "function",
"function": {
"name": "rewrite",
"arguments": '{"text": "OpenAI is an artificial intelligence research laboratory consisting of the non-profit OpenAI Incorporated and its for-profit subsidiary corporation OpenAI Limited Partnership."}',
},
}
],
},
{
"role": "tool",
"content": '{"action":"rewrite","outcome":"OpenAI is a FOR-profit company."}',
"tool_call_id": "bbc5b7ede",
"name": "rewrite",
},
{
"role": "assistant",
"content": "---\n\nOpenAI is a FOR-profit company.",
},
{
"role": "user",
"content": "Can you tell me what the temperature will be in Dallas, in Fahrenheit?",
},
]
data = {"model": model, "messages": messages, "tools": tools, "temperature": 0.15}
response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json()["choices"][0]["message"]["tool_calls"])
📚 详细文档
模型生成细节
该模型使用llama.cpp在提交版本92ecdcc0
时生成。
超低比特量化(1 - 2比特)
最新的量化方法为超低比特模型(1 - 2比特)引入了精度自适应量化,在Llama-3-8B上经基准测试证明有显著改进。该方法采用特定层策略,在保持极高内存效率的同时保留准确性。
基准测试上下文
所有测试均在Llama-3-8B-Instruct上进行,使用标准困惑度评估管道、2048令牌上下文窗口和相同的提示集。
量化性能比较(Llama-3-8B)
量化方式 | 标准困惑度 | DynamicGate困惑度 | 困惑度变化 | 标准大小 | DG大小 | 大小变化 | 标准速度 | DG速度 |
---|---|---|---|---|---|---|---|---|
IQ2_XXS | 11.30 | 9.84 | -12.9% | 2.5G | 2.6G | +0.1G | 234s | 246s |
IQ2_XS | 11.72 | 11.63 | -0.8% | 2.7G | 2.8G | +0.1G | 242s | 246s |
IQ2_S | 14.31 | 9.02 | -36.9% | 2.7G | 2.9G | +0.2G | 238s | 244s |
IQ1_M | 27.46 | 15.41 | -43.9% | 2.2G | 2.5G | +0.3G | 206s | 212s |
IQ1_S | 53.07 | 32.00 | -39.7% | 2.1G | 2.4G | +0.3G | 184s | 209s |
模型格式选择
选择正确的模型格式取决于硬件能力和内存限制:
模型格式 | 精度 | 内存使用 | 设备要求 | 最佳用例 |
---|---|---|---|---|
BF16 | 最高 | 高 | 支持BF16加速的GPU/CPU | 高速推理且减少内存使用 |
F16 | 高 | 高 | 支持FP16的设备 | 当BF16不可用时的GPU推理 |
Q4_K | 中低 | 低 | CPU或低显存设备 | 内存受限环境的最佳选择 |
Q6_K | 中 | 中等 | 内存较多的CPU | 量化模型中准确性较好的选择 |
Q8_0 | 高 | 中等 | 有足够显存的CPU或GPU | 量化模型中准确性最高的选择 |
IQ3_XS | 极低 | 极低 | 超低内存设备 | 极致内存效率但准确性较低 |
Q4_0 | 低 | 低 | ARM或低内存设备 | llama.cpp可针对ARM设备进行优化 |
包含文件及详情
文件名 | 详情 |
---|---|
Mistral-Small-3.1-24B-Instruct-2503-bf16.gguf |
模型权重保存为BF16格式,适用于将模型重新量化为其他格式,设备支持BF16加速时最佳。 |
Mistral-Small-3.1-24B-Instruct-2503-f16.gguf |
模型权重保存为F16格式,设备支持FP16(尤其是BF16不可用时)使用。 |
Mistral-Small-3.1-24B-Instruct-2503-bf16-q8_0.gguf |
输出和嵌入层保持BF16,其他层量化为Q8_0,设备支持BF16且需要量化版本时使用。 |
Mistral-Small-3.1-24B-Instruct-2503-f16-q8_0.gguf |
输出和嵌入层保持F16,其他层量化为Q8_0。 |
Mistral-Small-3.1-24B-Instruct-2503-q4_k.gguf |
输出和嵌入层量化为Q8_0,其他层量化为Q4_K,适合内存有限的CPU推理。 |
Mistral-Small-3.1-24B-Instruct-2503-q4_k_s.gguf |
最小的Q4_K变体,以牺牲准确性为代价减少内存使用,适合极低内存设置。 |
Mistral-Small-3.1-24B-Instruct-2503-q6_k.gguf |
输出和嵌入层量化为Q8_0,其他层量化为Q6_K。 |
Mistral-Small-3.1-24B-Instruct-2503-q8_0.gguf |
完全Q8量化模型,准确性更高,但需要更多内存。 |
Mistral-Small-3.1-24B-Instruct-2503-iq3_xs.gguf |
IQ3_XS量化,针对极致内存效率优化,适合超低内存设备。 |
Mistral-Small-3.1-24B-Instruct-2503-iq3_m.gguf |
IQ3_M量化,提供中等块大小以提高准确性,适合低内存设备。 |
Mistral-Small-3.1-24B-Instruct-2503-q4_0.gguf |
纯Q4_0量化,针对ARM设备优化,适合低内存环境,若追求更高准确性可选择IQ4_NL。 |
测试说明
若发现这些模型有用,请点击“点赞”。可帮助测试AI网络监控助手,选择AI助手类型:
TurboLLM
(GPT-4o-mini)HugLLM
(Hugginface开源)TestLLM
(仅支持CPU的实验性模型)
测试内容
正在测试小型开源模型在AI网络监控中的极限,包括函数调用、模型大小与任务处理能力等。
示例命令
"Give me info on my websites SSL certificate"
"Check if my server is using quantum safe encyption for communication"
"Run a comprehensive security audit on my server"
"Create a cmd processor to .. (what ever you want)"
🔧 技术细节
基础指令模板(V7-Tekken)
<s>[SYSTEM_PROMPT]<system prompt>[/SYSTEM_PROMPT][INST]<user message>[/INST]<assistant response></s>[INST]<user message>[/INST]
<system_prompt>
、<user message>
和<assistant response>
为占位符。请确保使用mistral-common作为参考。
基准测试结果
预训练评估
模型 | MMLU (5-shot) | MMLU Pro (5-shot CoT) | TriviaQA | GPQA Main (5-shot CoT) | MMMU |
---|---|---|---|---|---|
Small 3.1 24B Base | 81.01% | 56.03% | 80.50% | 37.50% | 59.27% |
Gemma 3 27B PT | 78.60% | 52.20% | 81.30% | 24.30% | 56.10% |
指令评估 - 文本
模型 | MMLU | MMLU Pro (5-shot CoT) | MATH | GPQA Main (5-shot CoT) | GPQA Diamond (5-shot CoT) | MBPP | HumanEval | SimpleQA (TotalAcc) |
---|---|---|---|---|---|---|---|---|
Small 3.1 24B Instruct | 80.62% | 66.76% | 69.30% | 44.42% | 45.96% | 74.71% | 88.41% | 10.43% |
Gemma 3 27B IT | 76.90% | 67.50% | 89.00% | 36.83% | 42.40% | 74.40% | 87.80% | 10.00% |
GPT4o Mini | 82.00% | 61.70% | 70.20% | 40.20% | 39.39% | 84.82% | 87.20% | 9.50% |
Claude 3.5 Haiku | 77.60% | 65.00% | 69.20% | 37.05% | 41.60% | 85.60% | 88.10% | 8.02% |
Cohere Aya-Vision 32B | 72.14% | 47.16% | 41.98% | 34.38% | 33.84% | 70.43% | 62.20% | 7.65% |
指令评估 - 视觉
模型 | MMMU | MMMU PRO | Mathvista | ChartQA | DocVQA | AI2D | MM MT Bench |
---|---|---|---|---|---|---|---|
Small 3.1 24B Instruct | 64.00% | 49.25% | 68.91% | 86.24% | 94.08% | 93.72% | 7.3 |
Gemma 3 27B IT | 64.90% | 48.38% | 67.60% | 76.00% | 86.60% | 84.50% | 7 |
GPT4o Mini | 59.40% | 37.60% | 56.70% | 76.80% | 86.70% | 88.10% | 6.6 |
Claude 3.5 Haiku | 60.50% | 45.03% | 61.60% | 87.20% | 90.00% | 92.10% | 6.5 |
Cohere Aya-Vision 32B | 48.20% | 31.50% | 50.10% | 63.04% | 72.40% | 82.57% | 4.1 |
多语言评估
模型 | 平均得分 | 欧洲语言 | 东亚语言 | 中东语言 |
---|---|---|---|---|
Small 3.1 24B Instruct | 71.18% | 75.30% | 69.17% | 69.08% |
Gemma 3 27B IT | 70.19% | 74.14% | 65.65% | 70.76% |
GPT4o Mini | 70.36% | 74.21% | 65.96% | 70.90% |
Claude 3.5 Haiku | 70.16% | 73.45% | 67.05% | 70.00% |
Cohere Aya-Vision 32B | 62.15% | 64.70% | 57.61% | 64.12% |
长上下文评估
模型 | LongBench v2 | RULER 32K | RULER 128K |
---|---|---|---|
Small 3.1 24B Instruct | 37.18% | 93.96% | 81.20% |
Gemma 3 27B IT | 34.59% | 91.10% | 66.00% |
GPT4o Mini | 29.30% | 90.20% | 65.8% |
Claude 3.5 Haiku | 35.19% | 92.60% | 91.90% |
📄 许可证
该模型采用Apache 2.0许可证。
⚠️ 重要提示
建议使用相对较低的温度,如
temperature=0.15
。确保为模型添加系统提示以满足特定需求。
💡 使用建议
若使用Transformers库,其实现未经过充分测试,仅进行了“氛围检查”,因此使用vLLM库能确保100%正确的行为。



