QwQ-32B-FP8-dynamic Open Source Model - Dynamic Quantization Reduces Storage Memory, Retains Nearly 100% of Original Precision

Qwq 32B FP8 Dynamic

Developed by nm-testing

FP8 quantized version of QwQ-32B, reducing storage and memory requirements by 50% through dynamic quantization while maintaining 99.75% of the original model accuracy

Large Language Model

Transformers

Open Source License:MIT #FP8 quantization #Inference optimization #Large language model

Downloads 3,895

Release Time : 3/5/2025

Model Overview

FP8 quantized version based on Qwen/QwQ-32B, suitable for efficient inference deployment, specially optimized for vLLM backend support

Model Features

FP8 Dynamic Quantization

Both weights and activations use FP8 quantization, reducing storage and memory requirements by approximately 50%

High Accuracy Retention

Comprehensive tests show retention of 99.75% of the original model accuracy, with some test metrics even showing improvement

vLLM Optimization

Specially optimized for the vLLM inference framework, supporting efficient parallel inference

Quantization Scheme Optimization

Weights use per-channel symmetric quantization, while activations use per-token symmetric quantization

Model Capabilities

Chinese text generation

Multi-turn dialogue

Complex reasoning

Knowledge Q&A

Use Cases

Intelligent dialogue

Personalized role-playing

Simulate specific character styles for dialogue, such as pirate tone

Achieves stylized expression while maintaining semantic accuracy

Educational assistance

Mathematical problem solving

Solve high school and above difficulty math problems

Achieves 97.44% accuracy on the MATH-500 test

Professional consultation

Professional domain Q&A

Answer GPQA diamond-level professional questions

Maintains 63.21% accuracy

🚀 QwQ-32B-FP8-dynamic

A quantized version of Qwen/QwQ-32B, offering efficient deployment with reduced resource requirements.

🚀 Quick Start

This document provides an overview of the QwQ-32B-FP8-dynamic model, including its architecture, optimizations, usage examples, creation process, and accuracy metrics.

✨ Features

Model Architecture: Based on Qwen2ForCausalLM, taking text as input and outputting text.
Model Optimizations: Quantized weights and activations to FP8 data type, reducing disk size and GPU memory requirements by approximately 50%.
Efficient Deployment: Can be deployed efficiently using the vLLM backend.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

number_gpus = 1
model_name = "neuralmagic/QwQ-32B-FP8-dynamic"

tokenizer = AutoTokenizer.from_pretrained(model_name)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True)

messages_list = [
    [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Advanced Usage

This model was created with llm-compressor by running the code snippet below.

from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot
import os

# Load model
model_stub = "Qwen/QwQ-32B"
model_name = model_stub.split("/")[-1]

model = AutoModelForCausalLM.from_pretrained(
    model_stub,
    torch_dtype="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_stub)

# Configure the quantization algorithm and scheme
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["lm_head"],
)

# Apply quantization
oneshot(
    model=model,
    recipe=recipe,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-FP8-dynamic"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")

📚 Documentation

Model Overview

Model Architecture: Qwen2ForCausalLM
- Input: Text
- Output: Text
Model Optimizations:
- Weight quantization: FP8
- Activation quantization: FP8
Release Date: 3/6/2025
Version: 1.0
Model Developers: Neural Magic

Model Optimizations

This model was obtained by quantizing the weights and activations of Qwen/QwQ-32B to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.

Only the weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized using a symmetric per-channel scheme, whereas quantizations are quantized using a symmetric per-token scheme. LLM Compressor is used for quantization.

Accuracy

Category	Metric	Qwen/QwQ-32B	neuralmagic/QwQ-32B-FP8-dynamic	Recovery
Reasoning	AIME 2024 (pass@1)	78.66	79.40	100.94%
	MATH-500 (pass@1)	97.39	97.44	100.05%
	GPQA Diamond (pass@1)	64.72	63.21	97.66%
	Average Score	80.25	80.05	99.75%

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご