Devstral-Small-2505-GGUF Open-Source Language Model - Lightweight Design Empowers Complex Software Coding Tasks

Devstral Small 2505 GGUF

Developed by Mungert

An efficient language model specifically designed for software engineering projects, featuring a lightweight design and supporting a 128k large context window, suitable for complex coding tasks.

Large Language Model Supports Multiple LanguagesOpen Source License:Apache-2.0 #Intelligent Coding Assistant #128k Long Context #Ultra-Low Bit Quantization

Downloads 1,409

Release Time : 6/3/2025

Model Overview

Devstral-Small-2505 is an efficient lightweight language model optimized for software engineering tasks, supporting a large context window and intelligent coding functions, suitable for local deployment and device-side use.

Model Features

Lightweight and Efficient Design

With only 24 billion parameters, it can run on a single RTX 4090 or a Mac with 32GB RAM, suitable for local deployment.

Large Context Window

Supports a 128k context window, capable of handling long texts and complex coding tasks.

Multiple Quantization Formats

Provides BF16, F16, and multiple quantization formats (Q4_K, Q6_K, Q8, etc.) to meet different hardware requirements.

Ultra-Low Bit Quantization

Supports 1 - 2 bit quantization, ensuring accuracy while maintaining extremely high memory efficiency.

Intelligent Coding Ability

Good at intelligent coding tasks, an ideal choice for software engineering agents.

Model Capabilities

Text Generation

Code Generation

Code Completion

Task Automation

Long Text Processing

Use Cases

Software Development

Build a To-Do Application

Build a task management application with CRUD functions using FastAPI and React.

Complete front-end and back-end implementation solutions

Code Review and Optimization

Automatically analyze code quality and provide improvement suggestions.

Code optimization suggestion report

System Management

Security Audit

Run a server security audit to check SSL certificates and encryption settings.

Security audit report

🚀 Devstral-Small-2505 GGUF Models

A set of models for software engineering tasks with advanced quantization and various deployment options.

🚀 Quick Start

This README provides details about the Devstral-Small-2505 GGUF models, including their generation, quantization methods, model format selection, and usage instructions.

✨ Features

Model Generation

Generated using llama.cpp at commit f5cd27b7.

Ultra-Low-Bit Quantization

Introduces precision-adaptive quantization for ultra-low-bit models (1 - 2 bit).
Uses layer-specific strategies to preserve accuracy and maintain extreme memory efficiency.
Benchmark-proven improvements on Llama-3-8B.

Model Format Selection

Offers multiple model formats (BF16, F16, Quantized Models, Very Low-Bit Quantization) to suit different hardware capabilities and memory constraints.

Agentic Coding

Designed for agentic coding tasks, making it suitable for software engineering agents.

Long Context Window

Has a 128k context window, allowing for more extensive input processing.

Open License

Licensed under Apache 2.0, enabling both commercial and non-commercial use.

📦 Installation

Prerequisites

Ensure you have the necessary hardware and software requirements for the chosen model format.
For API usage, create a Mistral account and obtain an API key.

Deployment

API: Follow these instructions to create an account and get an API key. Then run the provided Docker commands to start the OpenHands docker container.
Local Inference: Use LMStudio or other providers like vllm, mistral-inference, transformers, or ollama. Follow the respective instructions for each provider.
OpenHands: Launch an OpenAI-compatible server (e.g., vLLM or Ollama) and then use OpenHands to interact with the model. Follow the detailed steps for launching OpenHands and connecting to the server.

💻 Usage Examples

API Usage

export MISTRAL_API_KEY=<MY_KEY>

docker pull docker.all-hands.dev/all-hands-ai/runtime:0.39-nikolaik

mkdir -p ~/.openhands-state && echo '{"language":"en","agent":"CodeActAgent","max_iterations":null,"security_analyzer":null,"confirmation_mode":false,"llm_model":"mistral/devstral-small-2505","llm_api_key":"'$MISTRAL_API_KEY'","remote_runtime_resource_factor":null,"github_token":null,"enable_default_condenser":true}' > ~/.openhands-state/settings.json

docker run -it --rm --pull=always \
    -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.39-nikolaik \
    -e LOG_ALL_EVENTS=true \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v ~/.openhands-state:/.openhands-state \
    -p 3000:3000 \
    --add-host host.docker.internal:host-gateway \
    --name openhands-app \
    docker.all-hands.dev/all-hands-ai/openhands:0.39

Local Inference

docker pull docker.all-hands.dev/all-hands-ai/runtime:0.38-nikolaik
docker run -it --rm --pull=always \
    -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:0.38-nikolaik \
    -e LOG_ALL_EVENTS=true \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v ~/.openhands-state:/.openhands-state \
    -p 3000:3000 \
    --add-host host.docker.internal:host-gateway \
    --name openhands-app \
    docker.all-hands.dev/all-hands-ai/openhands:0.38

OpenHands Usage

vllm serve mistralai/Devstral-Small-2505 --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice --tensor-parallel-size 2

📚 Documentation

Ultra-Low-Bit Quantization

Precision-Adaptive Quantization: Our latest method uses layer-specific strategies for ultra-low-bit models (1 - 2 bit), with proven improvements on Llama-3-8B.
Benchmark Context: All tests were conducted on Llama-3-8B-Instruct using a standard perplexity evaluation pipeline, a 2048-token context window, and the same prompt set across all quantizations.
Method:
- Dynamic Precision Allocation: First/Last 25% of layers use IQ4_XS (selected layers), and the Middle 50% use IQ2_XXS/IQ3_S to increase efficiency.
- Critical Component Protection: Embeddings/output layers use Q5_K, reducing error propagation by 38% compared to standard 1 - 2 bit.
Quantization Performance Comparison (Llama-3-8B): | Quantization | Standard PPL | DynamicGate PPL | ∆ PPL | Std Size | DG Size | ∆ Size | Std Speed | DG Speed | |--------------|--------------|------------------|---------|----------|---------|--------|-----------|----------| | IQ2_XXS | 11.30 | 9.84 | -12.9% | 2.5G | 2.6G | +0.1G | 234s | 246s | | IQ2_XS | 11.72 | 11.63 | -0.8% | 2.7G | 2.8G | +0.1G | 242s | 246s | | IQ2_S | 14.31 | 9.02 | -36.9% | 2.7G | 2.9G | +0.2G | 238s | 244s | | IQ1_M | 27.46 | 15.41 | -43.9% | 2.2G | 2.5G | +0.3G | 206s | 212s | | IQ1_S | 53.07 | 32.00 | -39.7% | 2.1G | 2.4G | +0.3G | 184s | 209s |
When to Use These Models:
- Fitting models into GPU VRAM
- Memory-constrained deployments
- CPU and Edge Devices where 1 - 2 bit errors can be tolerated
- Research into ultra-low-bit quantization

Choosing the Right Model Format

Model Format	Precision	Memory Usage	Device Requirements	Best Use Case
BF16	Highest	High	BF16-supported GPU/CPUs	High-speed inference with reduced memory
F16	High	High	FP16-supported devices	GPU inference when BF16 isn't available
Q4_K	Medium Low	Low	CPU or Low-VRAM devices	Best for memory-constrained environments
Q6_K	Medium	Moderate	CPU with more memory	Better accuracy while still being quantized
Q8_0	High	Moderate	CPU or GPU with enough VRAM	Best accuracy among quantized models
IQ3_XS	Very Low	Very Low	Ultra-low-memory devices	Extreme memory efficiency and low accuracy
Q4_0	Low	Low	ARM or low-memory devices	llama.cpp can optimize for ARM devices

Included Files & Details

File Name	Details
`Devstral-Small-2505-bf16.gguf`	Model weights in BF16. Use for requantization or if your device supports BF16 acceleration.
`Devstral-Small-2505-f16.gguf`	Model weights in F16. Use if your device supports FP16, especially if BF16 is not available.
`Devstral-Small-2505-bf16-q8_0.gguf`	Output & embeddings in BF16, other layers quantized to Q8_0. Use if your device supports BF16 and you want a quantized version.
`Devstral-Small-2505-f16-q8_0.gguf`	Output & embeddings in F16, other layers quantized to Q8_0.
`Devstral-Small-2505-q4_k.gguf`	Output & embeddings quantized to Q8_0, other layers quantized to Q4_K. Good for CPU inference with limited memory.
`Devstral-Small-2505-q4_k_s.gguf`	Smallest Q4_K variant, using less memory at the cost of accuracy. Best for very low-memory setups.
`Devstral-Small-2505-q6_k.gguf`	Output & embeddings quantized to Q8_0, other layers quantized to Q6_K.
`Devstral-Small-2505-q8_0.gguf`	Fully Q8 quantized model for better accuracy. Requires more memory but offers higher precision.
`Devstral-Small-2505-iq3_xs.gguf`	IQ3_XS quantization, optimized for extreme memory efficiency. Best for ultra-low-memory devices.
`Devstral-Small-2505-iq3_m.gguf`	IQ3_M quantization, offering a medium block size for better accuracy. Suitable for low-memory devices.
`Devstral-Small-2505-q4_0.gguf`	Pure Q4_0 quantization, optimized for ARM devices. Best for low-memory environments. Prefer IQ4_NL for better accuracy.

🔧 Technical Details

Model Architecture

Devstral is an agentic LLM for software engineering tasks, built under a collaboration between Mistral AI and All Hands AI.
It is finetuned from Mistral-Small-3.1, with a long context window of up to 128k tokens.

Benchmark Results

SWE-Bench: Devstral achieves a score of 46.8% on SWE-Bench Verified, outperforming prior open-source SoTA by 6%. | Model | Scaffold | SWE-Bench Verified (%) | |------------------|--------------------|------------------------| | Devstral | OpenHands Scaffold | 46.8 | | GPT-4.1-mini | OpenAI Scaffold | 23.6 | | Claude 3.5 Haiku | Anthropic Scaffold | 40.6 | | SWE-smith-LM 32B | SWE-agent Scaffold | 40.2 |

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご