Qwen3-32B-quantized.w4a16 Open-source Model - Reduce Disk and Memory Requirements without Compromising High Performance

Qwen3 32B Quantized.w4a16

Developed by RedHatAI

INT4 quantized version of Qwen3-32B, reducing disk and GPU memory requirements by 75% through weight quantization while maintaining high performance

Large Language Model

Transformers

Open Source License:Apache-2.0 #INT4 quantization #Multilingual instructions #Function calling support

Downloads 2,213

Release Time : 5/5/2025

Model Overview

Quantized model based on Qwen3-32B, suitable for text generation, function calling, and multilingual tasks, supporting efficient inference

Model Features

Efficient quantization

Utilizes INT4 weight quantization to reduce disk and GPU memory requirements by 75%

High performance retention

Quantized model maintains over 99% of original performance across multiple benchmarks

Multilingual support

Supports instruction following and translation tasks in multiple languages

Efficient inference

Optimized for deployment on efficient inference frameworks like vLLM

Model Capabilities

Text generation

Function calling

Multilingual instruction following

Translation

Domain fine-tuning

Use Cases

General reasoning

Knowledge Q&A

Answers various knowledge-based questions

Achieved 80.36 points in MMLU tests

Mathematical reasoning

Solves mathematical problems

Achieved 85.97 points in GSM-8K tests

Professional applications

Domain expert

Becomes a domain expert through fine-tuning

Code generation

Generates programming code

🚀 Qwen3-32B-quantized.w4a16

This is a quantized version of the Qwen3-32B model, optimized for reduced disk space and GPU memory usage while maintaining high performance.

🚀 Quick Start

This model can be deployed efficiently using the vLLM backend. Here is an example:

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Qwen3-32B-quantized.w4a16"
number_gpus = 1
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, min_p=0, max_tokens=256)

messages = [
    {"role": "user", "content": prompt}
]

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

✨ Features

Model Overview

Model Architecture: Qwen3ForCausalLM
- Input: Text
- Output: Text
Model Optimizations:
- Weight quantization: INT4
Intended Use Cases:
- Reasoning.
- Function calling.
- Subject matter experts via fine-tuning.
- Multilingual instruction following.
- Translation.
Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws).
Release Date: 05/05/2025
Version: 1.0
Model Developers: RedHat (Neural Magic)

Model Optimizations

This model was obtained by quantizing the weights of Qwen3-32B to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.

Only the weights of the linear operators within transformers blocks are quantized. Weights are quantized using a symmetric per-group scheme, with group size 128. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Qwen3-32B-quantized.w4a16"
number_gpus = 1
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, min_p=0, max_tokens=256)

messages = [
    {"role": "user", "content": prompt}
]

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

📚 Documentation

Creation

Creation details

This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model_stub = "Qwen/Qwen3-32B"
model_name = model_stub.split("/")[-1]

num_samples = 1024
max_seq_len = 8192

model = AutoModelForCausalLM.from_pretrained(model_stub)

tokenizer = AutoTokenizer.from_pretrained(model_stub)

def preprocess_fn(example):
  return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}

ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.map(preprocess_fn)

# Configure the quantization algorithm and scheme
recipe = GPTQModifier(
    ignore=["lm_head"],
    sequential_targets=["Qwen3DecoderLayer"],
    targets="Linear",
    scheme="W4A16",
    dampening_frac=0.1,
)

# Apply quantization
oneshot(
    model=model,
    dataset=ds, 
    recipe=recipe,
    max_seq_length=max_seq_len,
    num_calibration_samples=num_samples,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-quantized.w4a16"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")

Evaluation

The model was evaluated on the OpenLLM leaderboard tasks (versions 1 and 2), using lm-evaluation-harness, and on reasoning tasks using lighteval. vLLM was used for all evaluations.

Evaluation details

lm-evaluation-harness

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Qwen3-32B-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=1 \
  --tasks openllm \
  --apply_chat_template\
  --fewshot_as_multiturn \
  --batch_size auto

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Qwen3-32B-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=1 \
  --tasks mgsm \
  --apply_chat_template\
  --batch_size auto

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Qwen3-32B-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=16384,enable_chunk_prefill=True,tensor_parallel_size=1 \
  --tasks leaderboard \
  --apply_chat_template\
  --fewshot_as_multiturn \
  --batch_size auto

lighteval

lighteval_model_arguments.yaml

model_parameters:
  model_name: RedHatAI/Qwen3-32B-quantized.w4a16
  dtype: auto
  gpu_memory_utilization: 0.9
  max_model_length: 40960
  generation_parameters:
    temperature: 0.6
    top_k: 20
    min_p: 0.0
    top_p: 0.95
    max_new_tokens: 32768

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|aime24|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|aime25|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|math_500|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|gpqa:diamond|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks extended|lcb:codegeneration \
  --use_chat_template = true

Accuracy

Category	Benchmark	Qwen3-32B	Qwen3-32B-quantized.w4a16 (this model)	Recovery
OpenLLM v1	MMLU (5-shot)	80.96	80.36	99.3%
	ARC Challenge (25-shot)	69.03	68.69	99.5%
	GSM-8K (5-shot, strict-match)	87.64	85.97	98.1%
	Hellaswag (10-shot)	71.10	71.18	100.1%
	Winogrande (5-shot)	69.77	70.90	100.5%
	TruthfulQA (0-shot, mc2)	58.63	58.86	100.4%
	Average	72.86	72.52	99.6%
OpenLLM v2	MMLU-Pro (5-shot)	54.24	52.63	97.03%
	IFEval (0-shot)	86.23	85.53	99.2%
	BBH (3-shot)	44.29	41.07	92.7%
	Math-lvl-5 (4-shot)	54.61	55.38	101.4%
	GPQA (0-shot)	5.53	4.59	---
	MuSR (0-shot)	7.85	8.25	---
	Average	42.13	41.24	97.9%
Multilingual	MGSM (0-shot)	32.57	33.77	103.7%
Reasoning (generation)	AIME 2024	79.37	77.29	97.4%
	AIME 2025	71.77	64.27	89.6%
	GPQA diamond	66.67	66.67	100.0%
	Math-lvl-5	96.20	97.20	101.0%
	LiveCodeBench	62.45	59.63	95.5%

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご