QwQ-32B-FP8-dynamic Open Source Model - Dynamic Quantization Saves Half of the Storage, with Accuracy Reaching Nearly 99.75% of the Original Version

Qwq 32B FP8 Dynamic

Developed by RedHatAI

FP8 quantized version of QwQ-32B, reducing storage and memory requirements by 50% through dynamic quantization while maintaining 99.75% of the original model accuracy

Large Language Model

Transformers

Open Source License:MIT #FP8 quantization #Efficient inference #Mathematical reasoning

Downloads 3,107

Release Time : 3/5/2025

Model Overview

Optimized quantized version based on Qwen/QwQ-32B, utilizing FP8 dynamic quantization technology for weights and activations, suitable for efficient inference deployment

Model Features

FP8 Dynamic Quantization

FP8 quantization for weights and activations, reducing storage and memory requirements by approximately 50%

High Accuracy Retention

Maintains 99.75% of the original model accuracy across multiple benchmarks

vLLM Optimization Support

Optimized for the vLLM inference engine, supporting efficient deployment

Model Capabilities

Text generation

Dialogue systems

Code generation

Mathematical reasoning

Use Cases

Intelligent dialogue

Role-playing dialogue

Supports dialogue generation in specific character styles

Example demonstrates pirate-style response capabilities

Mathematical reasoning

Mathematical problem solving

Solves complex mathematical problems

Achieved 97.44% accuracy on the MATH-500 test

🚀 QwQ-32B-FP8-dynamic

A quantized version of Qwen/QwQ-32B with efficient deployment capabilities.

🚀 Quick Start

This model can be deployed efficiently using the vLLM backend. Check out the usage examples below for details.

✨ Features

Model Architecture: Qwen2ForCausalLM, taking text as input and outputting text.
Model Optimizations:
- Weight quantization: FP8
- Activation quantization: FP8
Release Date: 3/6/2025
Version: 1.0
Model Developers: Neural Magic

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

number_gpus = 1
model_name = "neuralmagic/QwQ-32B-FP8-dynamic"

tokenizer = AutoTokenizer.from_pretrained(model_name)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True)

messages_list = [
    [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

Advanced Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot
import os

# Load model
model_stub = "Qwen/QwQ-32B"
model_name = model_stub.split("/")[-1]

model = AutoModelForCausalLM.from_pretrained(
    model_stub,
    torch_dtype="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_stub)

# Configure the quantization algorithm and scheme
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["lm_head"],
)

# Apply quantization
oneshot(
    model=model,
    recipe=recipe,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-FP8-dynamic"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")

📚 Documentation

Model Optimizations

This model was obtained by quantizing the weights and activations of Qwen/QwQ-32B to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.

Only the weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized using a symmetric per-channel scheme, whereas quantizations are quantized using a symmetric per-token scheme. LLM Compressor is used for quantization.

Accuracy

Category	Metric	Qwen/QwQ-32B	neuralmagic/QwQ-32B-FP8-dynamic	Recovery
Reasoning	AIME 2024 (pass@1)	78.66	79.40	100.94%
	MATH-500 (pass@1)	97.39	97.44	100.05%
	GPQA Diamond (pass@1)	64.72	63.21	97.66%
	Average Score	80.25	80.05	99.75%

🔧 Technical Details

The model uses LLM Compressor for quantization. The weights are quantized using a symmetric per-channel scheme, and the activations are quantized using a symmetric per-token scheme.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご