đ Devstral-Small-2505 GGUF Models
A set of models for software engineering tasks with advanced quantization and various deployment options.
đ Quick Start
This README provides details about the Devstral-Small-2505 GGUF models, including their generation, quantization methods, model format selection, and usage instructions.
⨠Features
Model Generation
Ultra-Low-Bit Quantization
- Introduces precision-adaptive quantization for ultra-low-bit models (1 - 2 bit).
- Uses layer-specific strategies to preserve accuracy and maintain extreme memory efficiency.
- Benchmark-proven improvements on Llama-3-8B.
Model Format Selection
- Offers multiple model formats (BF16, F16, Quantized Models, Very Low-Bit Quantization) to suit different hardware capabilities and memory constraints.
Agentic Coding
- Designed for agentic coding tasks, making it suitable for software engineering agents.
Long Context Window
- Has a 128k context window, allowing for more extensive input processing.
Open License
- Licensed under Apache 2.0, enabling both commercial and non-commercial use.
đĻ Installation
Prerequisites
- Ensure you have the necessary hardware and software requirements for the chosen model format.
- For API usage, create a Mistral account and obtain an API key.
Deployment
- API: Follow these instructions to create an account and get an API key. Then run the provided Docker commands to start the OpenHands docker container.
- Local Inference: Use LMStudio or other providers like vllm, mistral-inference, transformers, or ollama. Follow the respective instructions for each provider.
- OpenHands: Launch an OpenAI-compatible server (e.g., vLLM or Ollama) and then use OpenHands to interact with the model. Follow the detailed steps for launching OpenHands and connecting to the server.
đģ Usage Examples
API Usage
export MISTRAL_API_KEY=<MY_KEY>
docker pull docker.all-hands.dev/all-hands-ai/runtime:0.39-nikolaik
mkdir -p ~/.openhands-state && echo '{"language":"en","agent":"CodeActAgent","max_iterations":null,"security_analyzer":null,"confirmation_mode":false,"llm_model":"mistral/devstral-small-2505","llm_api_key":"'$MISTRAL_API_KEY'","remote_runtime_resource_factor":null,"github_token":null,"enable_default_condenser":true}' > ~/.openhands-state/settings.json
docker run -it --rm --pull=always \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.39-nikolaik \
-e LOG_ALL_EVENTS=true \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ~/.openhands-state:/.openhands-state \
-p 3000:3000 \
--add-host host.docker.internal:host-gateway \
--name openhands-app \
docker.all-hands.dev/all-hands-ai/openhands:0.39
Local Inference
docker pull docker.all-hands.dev/all-hands-ai/runtime:0.38-nikolaik
docker run -it --rm --pull=always \
-e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.38-nikolaik \
-e LOG_ALL_EVENTS=true \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ~/.openhands-state:/.openhands-state \
-p 3000:3000 \
--add-host host.docker.internal:host-gateway \
--name openhands-app \
docker.all-hands.dev/all-hands-ai/openhands:0.38
OpenHands Usage
vllm serve mistralai/Devstral-Small-2505 --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice --tensor-parallel-size 2
đ Documentation
Ultra-Low-Bit Quantization
-
Precision-Adaptive Quantization: Our latest method uses layer-specific strategies for ultra-low-bit models (1 - 2 bit), with proven improvements on Llama-3-8B.
-
Benchmark Context: All tests were conducted on Llama-3-8B-Instruct using a standard perplexity evaluation pipeline, a 2048-token context window, and the same prompt set across all quantizations.
-
Method:
- Dynamic Precision Allocation: First/Last 25% of layers use IQ4_XS (selected layers), and the Middle 50% use IQ2_XXS/IQ3_S to increase efficiency.
- Critical Component Protection: Embeddings/output layers use Q5_K, reducing error propagation by 38% compared to standard 1 - 2 bit.
-
Quantization Performance Comparison (Llama-3-8B):
| Quantization | Standard PPL | DynamicGate PPL | â PPL | Std Size | DG Size | â Size | Std Speed | DG Speed |
|--------------|--------------|------------------|---------|----------|---------|--------|-----------|----------|
| IQ2_XXS | 11.30 | 9.84 | -12.9% | 2.5G | 2.6G | +0.1G | 234s | 246s |
| IQ2_XS | 11.72 | 11.63 | -0.8% | 2.7G | 2.8G | +0.1G | 242s | 246s |
| IQ2_S | 14.31 | 9.02 | -36.9% | 2.7G | 2.9G | +0.2G | 238s | 244s |
| IQ1_M | 27.46 | 15.41 | -43.9% | 2.2G | 2.5G | +0.3G | 206s | 212s |
| IQ1_S | 53.07 | 32.00 | -39.7% | 2.1G | 2.4G | +0.3G | 184s | 209s |
-
When to Use These Models:
- Fitting models into GPU VRAM
- Memory-constrained deployments
- CPU and Edge Devices where 1 - 2 bit errors can be tolerated
- Research into ultra-low-bit quantization
Choosing the Right Model Format
Model Format |
Precision |
Memory Usage |
Device Requirements |
Best Use Case |
BF16 |
Highest |
High |
BF16-supported GPU/CPUs |
High-speed inference with reduced memory |
F16 |
High |
High |
FP16-supported devices |
GPU inference when BF16 isn't available |
Q4_K |
Medium Low |
Low |
CPU or Low-VRAM devices |
Best for memory-constrained environments |
Q6_K |
Medium |
Moderate |
CPU with more memory |
Better accuracy while still being quantized |
Q8_0 |
High |
Moderate |
CPU or GPU with enough VRAM |
Best accuracy among quantized models |
IQ3_XS |
Very Low |
Very Low |
Ultra-low-memory devices |
Extreme memory efficiency and low accuracy |
Q4_0 |
Low |
Low |
ARM or low-memory devices |
llama.cpp can optimize for ARM devices |
Included Files & Details
File Name |
Details |
Devstral-Small-2505-bf16.gguf |
Model weights in BF16. Use for requantization or if your device supports BF16 acceleration. |
Devstral-Small-2505-f16.gguf |
Model weights in F16. Use if your device supports FP16, especially if BF16 is not available. |
Devstral-Small-2505-bf16-q8_0.gguf |
Output & embeddings in BF16, other layers quantized to Q8_0. Use if your device supports BF16 and you want a quantized version. |
Devstral-Small-2505-f16-q8_0.gguf |
Output & embeddings in F16, other layers quantized to Q8_0. |
Devstral-Small-2505-q4_k.gguf |
Output & embeddings quantized to Q8_0, other layers quantized to Q4_K. Good for CPU inference with limited memory. |
Devstral-Small-2505-q4_k_s.gguf |
Smallest Q4_K variant, using less memory at the cost of accuracy. Best for very low-memory setups. |
Devstral-Small-2505-q6_k.gguf |
Output & embeddings quantized to Q8_0, other layers quantized to Q6_K. |
Devstral-Small-2505-q8_0.gguf |
Fully Q8 quantized model for better accuracy. Requires more memory but offers higher precision. |
Devstral-Small-2505-iq3_xs.gguf |
IQ3_XS quantization, optimized for extreme memory efficiency. Best for ultra-low-memory devices. |
Devstral-Small-2505-iq3_m.gguf |
IQ3_M quantization, offering a medium block size for better accuracy. Suitable for low-memory devices. |
Devstral-Small-2505-q4_0.gguf |
Pure Q4_0 quantization, optimized for ARM devices. Best for low-memory environments. Prefer IQ4_NL for better accuracy. |
đ§ Technical Details
Model Architecture
- Devstral is an agentic LLM for software engineering tasks, built under a collaboration between Mistral AI and All Hands AI.
- It is finetuned from Mistral-Small-3.1, with a long context window of up to 128k tokens.
Benchmark Results
- SWE-Bench: Devstral achieves a score of 46.8% on SWE-Bench Verified, outperforming prior open-source SoTA by 6%.
| Model | Scaffold | SWE-Bench Verified (%) |
|------------------|--------------------|------------------------|
| Devstral | OpenHands Scaffold | 46.8 |
| GPT-4.1-mini | OpenAI Scaffold | 23.6 |
| Claude 3.5 Haiku | Anthropic Scaffold | 40.6 |
| SWE-smith-LM 32B | SWE-agent Scaffold | 40.2 |
đ License
This project is licensed under the Apache-2.0 license.