Qwen3-30B-A3B Quantized Open-Source Model: Reduces Memory Requirements by 75% While Maintaining High Performance

Qwen3 30B A3B Quantized.w4a16

Developed by RedHatAI

INT4 quantized version of Qwen3-30B-A3B, reducing disk and GPU memory requirements by 75% while maintaining high performance.

Large Language Model

Transformers

Open Source License:Apache-2.0 #INT4 quantization #Multilingual instruction #Long-text reasoning

Downloads 379

Release Time : 5/6/2025

Model Overview

Quantized model based on Qwen3-30B-A3B, suitable for inference, function calling, multilingual instruction following, and translation tasks.

Model Features

Efficient weight quantization

Adopts INT4 quantization scheme, reducing disk and GPU memory requirements by 75%.

High-performance inference

Maintains performance close to the original model in multiple benchmarks, with a recovery rate of over 98%.

Multilingual support

Supports multilingual instruction following and translation tasks.

Optimized deployment

Supports efficient deployment with vLLM backend and is compatible with OpenAI services.

Model Capabilities

Text generation

Function calling

Multilingual instruction following

Translation

Use Cases

Natural language processing

Multilingual translation

Supports high-quality translation between multiple languages.

Instruction following

Capable of understanding and executing complex multilingual instructions.

Reasoning tasks

Mathematical reasoning

Excels in mathematical reasoning tasks.

Achieved 86.66 points in GSM-8K tasks

Logical reasoning

Maintains high performance in logical reasoning tasks.

Achieved 62.97 points in ARC Challenge tasks

🚀 Qwen3-30B-A3B-quantized.w4a16

This is a quantized version of the Qwen3-30B-A3B model, optimized for reduced disk space and GPU memory usage, suitable for various text - generation tasks.

🚀 Quick Start

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Qwen3-30B-A3B-quantized.w4a16"
number_gpus = 1
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, min_p=0, max_tokens=256)

messages = [
    {"role": "user", "content": prompt}
]

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI - compatible serving. See the documentation for more details.

✨ Features

Model Overview

Model Architecture: Qwen3ForCausalLM
- Input: Text
- Output: Text
Model Optimizations:
- Weight quantization: INT4
Intended Use Cases:
- Reasoning.
- Function calling.
- Subject matter experts via fine - tuning.
- Multilingual instruction following.
- Translation.
Out - of - scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws).
Release Date: 05/05/2025
Version: 1.0
Model Developers: RedHat (Neural Magic)

Model Optimizations

This model was obtained by quantizing the weights of Qwen3-30B-A3B to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.

Only the weights of the linear operators within transformers blocks are quantized. Weights are quantized using a symmetric per - group scheme, with group size 128. The GPTQ algorithm is applied for quantization, as implemented in the llm - compressor library.

🔧 Technical Details

Creation

Creation details

This model was created with [llm - compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model_stub = "Qwen/Qwen3-30B-A3B"
model_name = model_stub.split("/")[-1]

num_samples = 1024
max_seq_len = 8192

model = AutoModelForCausalLM.from_pretrained(model_stub)

tokenizer = AutoTokenizer.from_pretrained(model_stub)

def preprocess_fn(example):
  return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}

ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.map(preprocess_fn)

# Configure the quantization algorithm and scheme
recipe = GPTQModifier(
    ignore: ["lm_head", "re:.*gate$"]
    sequential_targets=["Qwen3DecoderLayer"],
    targets="Linear",
    scheme="W4A16",
    dampening_frac=0.01,
)

# Apply quantization
oneshot(
    model=model,
    dataset=ds, 
    recipe=recipe,
    max_seq_length=max_seq_len,
    num_calibration_samples=num_samples,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-quantized.w4a16"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")

Evaluation

The model was evaluated on the OpenLLM leaderboard tasks (versions 1 and 2), using [lm - evaluation - harness](https://github.com/EleutherAI/lm - evaluation - harness), and on reasoning tasks using lighteval. vLLM was used for all evaluations.

Evaluation details

lm - evaluation - harness

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Qwen3-30B-A3B-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=1 \
  --tasks openllm \
  --apply_chat_template\
  --fewshot_as_multiturn \
  --batch_size auto

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Qwen3-30B-A3B-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=1 \
  --tasks mgsm \
  --apply_chat_template\
  --batch_size auto

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Qwen3-30B-A3B-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=16384,enable_chunk_prefill=True,tensor_parallel_size=1 \
  --tasks leaderboard \
  --apply_chat_template\
  --fewshot_as_multiturn \
  --batch_size auto

lighteval

lighteval_model_arguments.yaml

model_parameters:
  model_name: RedHatAI/Qwen3-30B-A3B-quantized.w4a16
  dtype: auto
  gpu_memory_utilization: 0.9
  max_model_length: 40960
  generation_parameters:
    temperature: 0.6
    top_k: 20
    min_p: 0.0
    top_p: 0.95
    max_new_tokens: 32768

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|aime24|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|aime25|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|math_500|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|gpqa:diamond|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks extended|lcb:codegeneration \
  --use_chat_template = true

Accuracy

Category	Benchmark	Qwen3 - 30B - A3B	Qwen3 - 30B - A3B - quantized.w4a16 (this model)	Recovery
OpenLLM v1	MMLU (5 - shot)	77.67	76.11	98.00%
	ARC Challenge (25 - shot)	63.40	62.97	99.3%
	GSM - 8K (5 - shot, strict - match)	87.26	86.66	99.3%
	Hellaswag (10 - shot)	54.33	54.76	100.8%
	Winogrande (5 - shot)	66.77	64.33	96.3%
	TruthfulQA (0 - shot, mc2)	56.27	54.76	97.3%
	Average	67.62	66.60	98.5%
OpenLLM v2	MMLU - Pro (5 - shot)	47.45	45.38	95.6%
	IFEval (0 - shot)	86.26	84.86	98.4%
	BBH (3 - shot)	34.81	28.12	80.8%
	Math - lvl - 5 (4 - shot)	52.14	56.99	109.3%
	GPQA (0 - shot)	0.31	0.60	---
	MuSR (0 - shot)	8.09	9.05	---
	Average	38.18	37.50	98.2%
Multilingual	MGSM (0 - shot)	32.27	33,890	104.8%
Reasoning (generation)	AIME 2024	78.33	78.54	100.3%
	AIME 2025	71.46	70.31	98.4%
	GPQA diamond	62.63	62.12	99.2%
	Math - lvl - 5	97.60	97.20	99.6%
	LiveCodeBench	60.66	58.75	96.9%

📄 License

This model is released under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご