Model Overview
Model Features
Model Capabilities
Use Cases
๐ Zero-Mistral-24B
Zero-Mistral-24B is an improved text-only version of mistralai/Mistral-Small-3.1-24B-Instruct-2503, mainly adapted for Russian and English. The original Mistral model's vision features have been removed from this one. It was trained at the SFT stage primarily on the Big Russian Dataset and a proprietary dataset from Shkolkovo.online. This model has good math skills and some reasoning abilities, and it retains the original Mistral's long context capabilities up to 128k tokens.
โจ Features
- Language Adaptation: Adapted for both Russian and English, making it suitable for a wider range of users.
- Feature Removal: Removed vision features from the original Mistral model, focusing solely on text processing.
- Training Data: Trained on high - quality datasets, including the Big Russian Dataset and a proprietary dataset.
- Math and Reasoning: Demonstrates good math skills and reasoning abilities.
- Long Context: Preserves the long - context capabilities of up to 128k tokens.
๐ฆ Installation
vLLM Installation
Make sure you install vLLM >= 0.8.4
:
pip install --upgrade vllm
Also make sure you have mistral_common >= 1.5.4
installed:
pip install --upgrade mistral_common
You can also make use of a ready - to - go docker image or on the docker hub.
๐ป Usage Examples
Recommended System Prompts
prompts = {
"generic": "You are a virtual assistant. You answer people's questions, help and support them. You are created to be helpful, harmless, and honest. You answer in the language the question was asked in or as the user requests.",
"think": """You are a virtual assistant. You answer people's questions, help and support them. You are created to be helpful, harmless, and honest. You answer in the language the question was asked in or as the user requests.
Answer in the following format:
<think>Reasoning: ...</think>
...""",
"task": "You are a virtual assistant. You answer people's questions, help and support them. You are created to be helpful, harmless, and honest. You answer in the language the question was asked in or as the user requests. Solve the task according to the instructions below. Don't apologize and don't build a dialogue.",
"task_think": """You are a virtual assistant. You answer people's questions, help and support them. You are created to be helpful, harmless, and honest. You answer in the language the question was asked in or as the user requests. Solve the task according to the instructions below. Don't apologize and don't build a dialogue.
Answer in the following format:
<think>Reasoning: ...</think>
...""",
"english_generic": """You are Mistral Small 3, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris.
Your knowledge base was last updated on 2023 - 10 - 01. The current date is 2025 - 01 - 30.
When you're not sure about some information, you say that you don't have the information and don't make up anything.
If the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. \"What are some good restaurants around me?\" => \"Where are you?\" or \"When is the next flight to Tokyo\" => \"Where do you travel from?\")
""",
"english_think": """You are Mistral Small 3, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris.
Your knowledge base was last updated on 2023 - 10 - 01. The current date is 2025 - 01 - 30.
When you're not sure about some information, you say that you don't have the information and don't make up anything.
If the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. \"What are some good restaurants around me?\" => \"Where are you?\" or \"When is the next flight to Tokyo\" => \"Where do you travel from?\")
Answer in the following format:
<think>Reasoning: ...</think>
""",
}
vLLM Server Usage
- Spin up a server:
vllm serveZeroAgency/Zero-Mistral-24B --enable-prefix-caching --dtype bfloat16 --max-model-len 32768 --tool-call-parser mistral --enable-auto-tool-choice
Note: Running Zero - Mistral - 24B on GPU requires ~55 GB of GPU RAM in bf16 or fp16.
- To ping the client you can use a simple Python snippet.
import requests
import json
from datetime import datetime, timedelta
url = "http://<your-server>:8000/v1/chat/completions"
headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}
model = "ZeroAgency/Zero-Mistral-24B"
messages = [
{
"role": "system",
"content": """You are a virtual assistant. You answer people's questions, help and support them. You are created to be helpful, harmless, and honest. You answer in the language the question was asked in or as the user requests. Solve the task according to the instructions below. Don't apologize and don't build a dialogue.
Answer in the following format:
<think>Reasoning: ...</think>
..."""
},
{ # Task from https://3.shkolkovo.online/catalog/2552/93150
"role": "user",
"content": """The first worker makes 9 more parts per hour than the second worker. The first worker completes an order of 216 parts 4 hours faster than the second worker who completes the same order. How many parts does the first worker make per hour?"""
},
]
data = {"model": model, "messages": messages}
response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json()["choices"][0]["message"]["content"])
#<think> Let x be the number of parts the second worker makes per hour. Then the first worker makes x + 9 parts per hour. Let's make a table: First worker Second worker Number of parts per hour x + 9 x Number of hours 216 : (x + 9) 216 : x Difference in number of hours 4 216 : (x + 9) โ 216 : x = 4 216x โ 216(x + 9) = 4x(x + 9) 216x โ 216x โ 1944 = 4x^2 + 36x 1944 = 4x^2 + 36x 4x^2 + 36x โ 1944 = 0 D = 36^2 + 4 ยท 4 ยท 1944 = 1296 + 31104 = 32400 = 180^2 x1 = โ36 + 180 : 8 = 144 : 8 = 18 x2 = โ36 โ 180 : 8 < 0 โ not suitable for the problem. Then the first worker makes 18 + 9 = 27 parts per hour. </think>
#27
vLLM Offline Usage
from vllm import LLM
from vllm.sampling_params import SamplingParams
from datetime import datetime, timedelta
# note that running this model on GPU requires over 60 GB of GPU RAM
llm = LLM(model="ZeroAgency/Zero-Mistral-24B", tokenizer_mode="mistral", tensor_parallel_size=8)
๐ Documentation
Model Details
Model Description
Property | Details |
---|---|
Developed by | ZeroAgency.ru |
Funded by | ZeroAgency.ru and Shkolkovo.online |
Shared by | Alexander Kozhevnikov (developer) |
Model Type | LLM |
Language(s) (NLP) | Russian, English |
License | MIT |
Finetuned from model | mistralai/Mistral-Small-3.1-24B-Instruct-2503 |
Model versions
- Merged 16-bit - original 16bit merged version for transformers.
- GGUF - different GGUF versions: BF16, F16, Q8_0, Q6_K, Q4_K_M, IQ4_XS, etc.
Benchmarks for main 16-bit merged version
MERA
MERA score: 0.623
Task | Result | Metric |
---|---|---|
LCS | 0.194 | Accuracy |
RCB | 0.607 / 0.592 | Avg. F1 / Accuracy |
USE | 0.452 | Grade Norm |
RWSD | 0.55 | Accuracy |
PARus | 0.942 | Accuracy |
ruTiE | 0.868 | Accuracy |
MultiQ | 0.781 / 0.629 | F1-score/EM |
CheGeKa | 0.397 / 0.322 | F1 / EM |
ruModAr | 0.971 | EM |
MaMuRAMu | 0.832 | Accuracy |
ruMultiAr | 0.354 | EM |
ruCodeEval | 0 / 0 / 0 | pass@k ยฏ\_(ใ)_/ยฏ |
MathLogicQA | 0.613 | Accuracy |
ruWorldTree | 0.987 / 0.987 | Avg. F1 / Accuracy |
ruOpenBookQA | 0.913 / 0.913 | Avg. F1 / Accuracy |
Open Task Evaluation
Task | Result | Metric |
---|---|---|
BPS | 0.981 | Accuracy |
ruMMLU | 0.778 | Accuracy |
SimpleAr | 0.997 | EM |
ruHumanEval | 0.006 / 0.006 / 0.006 | pass@k ยฏ\_(ใ)_/ยฏ |
ruHHH | 0.916 | Accuracy |
ruHateSpeech | 0.834 | Accuracy |
ruDetox | 0.341 / 0.843 / 0.624 / 0.66 | Overall average score (J) / Meaning preservation score (SIM) / Naturalness score (FL) / Style transfer accuracy (STA) |
ruEthics | [[0.386, 0.399, 0.41, 0.333, 0.327], [0.421, 0.427, 0.452, 0.375, 0.363], [0.653, 0.65, 0.697, 0.596, 0.573]] | 5 MCC |
๐ License
The model is released under the MIT license.

