Devstral-Small-2505 Open-source Model - Free 4-bit Quantized Version Compatible with Consumer-grade Hardware

Devstral Small 2505.w4a16 Gptq

Developed by mratsim

This is a 4-bit GPTQ quantized version based on the mistralai/Devstral-Small-2505 model, optimized for consumer-grade hardware.

Large Language Model

Safetensors

Open Source License:Apache-2.0 #4-bit GPTQ quantization #Long sequence processing #Code generation optimization

Downloads 557

Release Time : 5/25/2025

Model Overview

This model uses the asymmetric GPTQ method for 4-bit quantization (only 4-bit weights, W4A16) and is calibrated using 2048 samples with a maximum sequence length of 4096. It is suitable for text generation tasks.

Model Features

4-bit GPTQ quantization

The model is quantized to 4 bits (only 4-bit weights) using the asymmetric GPTQ method, significantly reducing hardware requirements

Optimized calibration strategy

Calibrated using 2048 samples with a maximum sequence length of 4096 to reduce the risk of overfitting and improve convergence

Consumer-grade hardware adaptation

Specifically optimized to run on consumer-grade GPUs (e.g., 32GB VRAM)

Model Capabilities

Text generation

Long sequence processing (up to 94000 tokens)

Use Cases

Code-related tasks

Code generation

Trained on the OpenCodeInstruct dataset, suitable for code generation tasks

🚀 mistralai/Devstral-Small-2505 Quantized with GPTQ (4-Bit weight-only, W4A16)

This repo provides a 4-bit quantized version of mistralai/Devstral-Small-2505 using asymmetric GPTQ, making it compatible with consumer hardware.

🚀 Quick Start

This repo contains mistralai/Devstral-Small-2505 quantized with asymmetric GPTQ to 4-bit to make it suitable for consumer hardware.

The model was calibrated with 2048 samples of max sequence length 4096 from the dataset nvidia/OpenCodeInstruct.

This is my second model, I welcome suggestions. In particular the peculiarities of Mistral's tekkenizer were tricky to figure out.

2048/4096 were chosen over the default of 512/2048 to minimize overfitting risk and maximize convergence.

Original Model:

mistralai/Devstral-Small-2505

✨ Features

Model Information

Property	Details
Model Type	mistralai/Devstral-Small-2505 Quantized with GPTQ (4-Bit weight-only, W4A16)
Training Data	2048 samples of max sequence length 4096 from the dataset `nvidia/OpenCodeInstruct`
Pipeline Tag	text2text-generation
Tags	gptq, vllm, llmcompressor, text-generation-inference
License	apache-2.0

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

The model was tested with vLLM, here is a script suitable for 32GB VRAM GPUs. It reserves 31.2GiB of GPU VRAM so you should run your OS on iGPU.

export MODEL="mratsim/Devstral-Small-2505.w4a16-gptq"
vllm serve "${MODEL}" \
  --served-model-name devstral-32b \
  --gpu-memory-utilization 0.95 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-model-len 94000 \
  --max_num_seqs 256 \
  --tokenizer_mode mistral \
  --generation-config "${MODEL}" \
  --enable-auto-tool-choice --tool-call-parser mistral

🔧 Technical Details

Quantization method

The llmcompressor library was used with the following recipe for asymmetric GPTQ:

default_stage:
  default_modifiers:
    GPTQModifier:
      dampening_frac: 0.005
      config_groups:
        group_0:
          targets: [Linear]
          weights: {num_bits: 4, type: int, symmetric: false, group_size: 128, strategy: group,
            dynamic: false, observer: minmax}
      ignore: [lm_head]

and calibrated on 2048 samples, 4096 sequence length of nvidia/OpenCodeInstruct

📄 License

This project is licensed under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご