Qwen3-30B-A3B-FP8-dynamic Open-source AI Model - Reducing Costs while Maintaining High Accuracy

Qwen3 30B A3B FP8 Dynamic

Developed by RedHatAI

Qwen3-30B-A3B-FP8-dynamic is an FP8 quantized version of the Qwen3-30B-A3B model, significantly reducing memory requirements and computational costs while maintaining the high accuracy of the original model.

Large Language Model

Transformers

Open Source License:Apache-2.0 #FP8 quantization #Multilingual instructions #Efficient inference

Downloads 187

Release Time : 5/3/2025

Model Overview

This model optimizes memory usage and computational efficiency by quantizing weights and activations to FP8 format, making it suitable for tasks such as inference, function calling, and multilingual instruction following.

Model Features

FP8 quantization

Both weights and activations use FP8 quantization, significantly reducing memory requirements and computational costs.

Efficient inference

Through quantization optimization, matrix multiplication throughput is improved by approximately 2x.

High accuracy retention

The quantized model maintains over 99% of the original model's accuracy across multiple benchmarks.

Multilingual support

Supports multilingual instruction following and translation tasks.

Model Capabilities

Text generation

Function calling

Multilingual instruction following

Translation

Domain fine-tuning

Use Cases

Natural language processing

Text generation

Generates high-quality natural language text

Performs excellently in the OpenLLM benchmark

Multilingual translation

Supports translation tasks between multiple languages

Professional domain applications

Domain expert fine-tuning

Can be fine-tuned to become an expert model for specific domains

🚀 Qwen3-30B-A3B-FP8-dynamic

This is a text generation model based on the Qwen3 architecture, optimized through FP8 quantization.

🚀 Quick Start

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Qwen3-30B-A3B-FP8-dynamic"
number_gpus = 1
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, min_p=0, max_tokens=256)

messages = [
    {"role": "user", "content": prompt}
]

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

✨ Features

Model Architecture: Qwen3MoeForCausalLM, taking text as input and outputting text.
Model Optimizations:
- Activation quantization: FP8
- Weight quantization: FP8
Intended Use Cases:
- Reasoning.
- Function calling.
- Subject matter experts via fine-tuning.
- Multilingual instruction following.
- Translation.
Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws).
Release Date: 05/05/2025
Version: 1.0
Model Developers: RedHat (Neural Magic)

Model Optimizations

This model was obtained by quantizing activations and weights of Qwen3-30B-A3B to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%.

Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. The llm-compressor library is used for quantization.

📚 Documentation

Creation

Creation details

This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.

from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model_stub = "Qwen/Qwen3-30B-A3B"
model_name = model_stub.split("/")[-1]

model = AutoModelForCausalLM.from_pretrained(model_stub)

tokenizer = AutoTokenizer.from_pretrained(model_stub)

# Configure the quantization algorithm and scheme
recipe = QuantizationModifier(
    ignore=["lm_head"],
    targets="Linear",
    scheme="FP8_dynamic",
)

# Apply quantization
oneshot(
    model=model,
    recipe=recipe,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-FP8-dynamic"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")

Evaluation

The model was evaluated on the OpenLLM leaderboard tasks (version 1), using lm-evaluation-harness and vLLM.

Evaluation details

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Qwen3-30B-A3B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=1 \
  --tasks openllm \
  --apply_chat_template\
  --fewshot_as_multiturn \
  --batch_size auto

Accuracy

Category	Benchmark	Qwen3-30B-A3B	Qwen3-30B-A3B-FP8-dynamic (this model)	Recovery
OpenLLM v1	MMLU (5-shot)	77.67	77.49	99.8%
OpenLLM v1	ARC Challenge (25-shot)	63.40	63.65	100.4%
OpenLLM v1	GSM-8K (5-shot, strict-match)	87.26	86.73	99.4%
OpenLLM v1	Hellaswag (10-shot)	54.33	54.33	100.0%
OpenLLM v1	Winogrande (5-shot)	66.77	66.30	99.3%
OpenLLM v1	TruthfulQA (0-shot, mc2)	56.27	56.88	101.1%
OpenLLM v1	Average	67.62	67.56	99.9%
OpenLLM v2	MMLU-Pro (5-shot)	47.45	48.40	102.0%
OpenLLM v2	IFEval (0-shot)	86.26	86.08	99.8%
OpenLLM v2	BBH (3-shot)	34.81	34.70	99.7%
OpenLLM v2	Math-lvl-5 (4-shot)	52.14	59.39	113.9%
OpenLLM v2	GPQA (0-shot)	0.31	0.90	---
OpenLLM v2	MuSR (0-shot)	8.09	9.05	---
OpenLLM v2	Average	38.18	39.75	104.1%
Multilingual	MGSM (0-shot)	32.27	32.73	101.5%
Reasoning (generation)	AIME 2024	78.33	78.96	100.8%
Reasoning (generation)	AIME 2025	71.46	68.44	95.8%
Reasoning (generation)	GPQA diamond	62.63	62.63	100.0%
Reasoning (generation)	Math-lvl-5	97.60	95.80	98.2%
Reasoning (generation)	LiveCodeBench	60.66	60.89	100.4%

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご