QwQ-32B-INT8-W8A8 Open Source Model - Optimize Performance for Efficient Use in Various Scenarios, Free Deployment

Home

Qwq 32B INT8 W8A8

Developed by ospatch

INT8 quantized version of QWQ-32B, optimized by reducing the bit-width of weights and activations

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #INT8 Quantization #High-throughput Inference #Large Language Model

Downloads 590

Release Time : 3/13/2025

Model Overview

INT8 quantized version of QWQ-32B, optimized for GPU memory requirements and computational throughput, suitable for text generation tasks

Model Features

INT8 Quantization

Both weights and activations use INT8 quantization, reducing GPU memory requirements and disk space

Efficient Computation

Quantization improves matrix multiplication throughput by approximately 2x

vLLM Compatibility

Supports deployment via vLLM Docker image, providing OpenAI-compatible API

Model Capabilities

Text Generation

Use Cases

Natural Language Processing

Text Generation

Used for generating coherent text content

🚀 QWQ-32B-INT8-W8A8

The INT8 quantized version of QWQ-32B, optimized for reduced GPU memory usage and increased compute throughput.

image/jpeg

🚀 Quick Start

This README provides an overview of the QWQ-32B-INT8-W8A8 model, including its architecture, optimizations, deployment with vLLM, creation process, usage guidelines, and evaluation information.

✨ Features

Model Overview

Model Architecture: Transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
- Input: Text
- Output: Text
Model Optimizations:
- Weight quantization: INT8
- Activation quantization: INT8
Release Date: 3/13/2025

This is the INT8 quantized version of QWQ-32B.

Model Optimizations

This model was obtained by quantizing the weights and activations of QWQ-32B to the INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix - multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%.

Only the weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized using a symmetric per - channel scheme, whereas activations are quantized using a symmetric per - token scheme. The GPTQ algorithm is applied for quantization, as implemented in the [llm - compressor](https://github.com/vllm - project/llm - compressor) library.

📦 Installation

Use with vLLM

You can deploy using the OpenAI - compatible vLLM Docker image, as shown in the example below.

#!/bin/bash

# Default values
NAME_SUFFIX=""
PORT=8010
GPUS="0,1"  # Default GPUs

# Parse command line arguments
while getopts "s:p:g:" opt; do
    case $opt in
        s) NAME_SUFFIX="$OPTARG";;    # suffix for container name
        p) PORT="$OPTARG";;          # port number
        g) GPUS="$OPTARG";;          # GPU devices (e.g., "2,3")
        ?) echo "Usage: $0 [-s suffix] [-p port] [-g gpus]"
           exit 1;;
    esac
done

model=ospatch/QwQ-32B-INT8-W8A8
volume=~/.cache/huggingface/hub
revision=main
version=latest
context=16384
base_name="vllm-qwq-int8"
container_name="${base_name}${NAME_SUFFIX}"

sudo docker run --restart=unless-stopped --name $container_name --runtime nvidia --gpus '"device='"$GPUS"'"' \
     --shm-size 1g -p $PORT:8000 -e NCCL_P2P_DISABLE=1 -e HUGGING_FACE_HUB_TOKEN=<user_token> \
     -v $volume:/root/.cache/huggingface/hub vllm/vllm-openai:$version --model $model \
     --revision $revision --tensor-parallel-size 2 \
     --gpu-memory-utilization 0.97 --max-model-len $context --enable-chunked-prefill

No command line arguments are needed for the default configuration.

💻 Usage Examples

Creation

This model was created with [llm - compressor](https://github.com/vllm - project/llm - compressor) by running the code snippet below. Credit to Neural Magic for the recipe.

## script copied from Neural Magic

from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map

# Load model
model_stub = "Qwen/QwQ-32B"
model_name = model_stub.split("/")[-1]

num_samples = 1024
max_seq_len = 8192

tokenizer = AutoTokenizer.from_pretrained(model_stub)

device_map = calculate_offload_device_map(
    model_stub,
    reserve_for_hessians=True,
    num_gpus=4,
    torch_dtype="auto",
)

model = AutoModelForCausalLM.from_pretrained(
    model_stub,
    device_map=device_map,
    torch_dtype="auto",
)

def preprocess_fn(example):
  return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}

ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.map(preprocess_fn)

# Configure the quantization algorithm and scheme
recipe = [
    SmoothQuantModifier(smoothing_strength=0.7),
    QuantizationModifier(
        targets="Linear",
        scheme="W8A8",
        ignore=["lm_head"],
        dampening_frac=0.1,
    ),
]

# Apply quantization
oneshot(
    model=model,
    dataset=ds, 
    recipe=recipe,
    max_seq_length=max_seq_len,
    num_calibration_samples=num_samples,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-INT8-W8A8"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")

📚 Documentation

Usage Guidelines

Please reference the model card for QWQ-32B.

Evaluation & Accuracy

The model passes the vibe check, but no attempt was made to evaluate the quantized model for accuracy loss.

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご