Qwen3-235B-A22B-FP8-dynamic Open-source Model - Reduces memory requirements and is suitable for multiple natural language processing scenarios

Qwen3 235B A22B FP8 Dynamic

Developed by RedHatAI

The FP8 quantized version of the Qwen3-235B-A22B model, which effectively reduces GPU memory requirements and improves computational throughput, suitable for various natural language processing scenarios.

Large Language Model

Transformers

Open Source License:Apache-2.0 #FP8 Quantization #Mixture of Experts #Multilingual Instruction Following

Downloads 2,198

Release Time : 5/4/2025

Model Overview

This model is the FP8 quantized version of the Qwen3-235B-A22B model, which can effectively reduce GPU memory requirements and improve computational throughput. It can be used in various natural language processing scenarios such as inference and function call.

Model Features

FP8 Quantization

Perform FP8 quantization on activations and weights, reducing GPU memory requirements by approximately 50%, increasing the computational throughput of matrix multiplication by about 2 times, and reducing disk size requirements by approximately 50%.

Efficient Deployment

Supports efficient deployment using the vLLM backend and is compatible with OpenAI services.

High Performance

Performs excellently in multiple benchmark tests, with an accuracy recovery rate close to 100%.

Model Capabilities

Text Generation

Function Call

Multilingual Instruction Following

Translation

Use Cases

Natural Language Processing

Inference

Used for inference tasks such as text generation and question answering.

Function Call

Supports the function call feature and can be used to build complex applications.

Translation

Supports multilingual translation tasks.

🚀 Qwen3-235B-A22B-FP8-dynamic

A quantized large language model based on Qwen3-235B-A22B, optimized with FP8 quantization for efficient deployment.

🚀 Quick Start

This section provides a quick guide on how to use the Qwen3-235B-A22B-FP8-dynamic model.

Deployment Example

The model can be efficiently deployed using the vLLM backend. Here is an example code snippet:

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Qwen3-235B-A22B-FP8-dynamic"
number_gpus = 4
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, min_p=0, max_tokens=256)

messages = [
    {"role": "user", "content": prompt}
]

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. For more details, refer to the documentation.

✨ Features

Model Overview

Model Architecture: Qwen3MoeForCausalLM
- Input: Text
- Output: Text
Model Optimizations:
- Activation quantization: FP8
- Weight quantization: FP8
Intended Use Cases:
- Reasoning.
- Function calling.
- Subject matter experts via fine-tuning.
- Multilingual instruction following.
- Translation.
Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws).
Release Date: 05/05/2025
Version: 1.0
Model Developers: RedHat (Neural Magic)

Model Optimizations

This model was obtained by quantizing activations and weights of Qwen3-235B-A22B to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%.

Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. The llm-compressor library is used for quantization.

📦 Installation

This section does not contain specific installation steps, so it is skipped.

💻 Usage Examples

Basic Usage

The deployment example above shows the basic usage of the model. Here is the code again for reference:

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Qwen3-235B-A22B-FP8-dynamic"
number_gpus = 4
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, min_p=0, max_tokens=256)

messages = [
    {"role": "user", "content": prompt}
]

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

Advanced Usage

There is no advanced usage example provided in the original document, so this section is skipped.

📚 Documentation

Creation

Creation details

This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.

from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model_stub = "Qwen/Qwen3-235B-A22B"
model_name = model_stub.split("/")[-1]

model = AutoModelForCausalLM.from_pretrained(model_stub)

tokenizer = AutoTokenizer.from_pretrained(model_stub)

# Configure the quantization algorithm and scheme
recipe = QuantizationModifier(
    ignore=["lm_head"],
    targets="Linear",
    scheme="FP8_dynamic",
)

# Apply quantization
oneshot(
    model=model,
    recipe=recipe,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-FP8-dynamic"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")

Evaluation

The model was evaluated on the OpenLLM leaderboard tasks (version 1), using lm-evaluation-harness and vLLM.

Evaluation details

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Qwen3-235B-A22B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=4 \
  --tasks openllm \
  --apply_chat_template\
  --fewshot_as_multiturn \
  --batch_size auto

Accuracy

Category	Benchmark	Qwen3-235B-A22B	Qwen3-235B-A22B-FP8-dynamic (this model)	Recovery
OpenLLM v1	MMLU (5-shot)	84.77	84.61	99.8%
OpenLLM v1	ARC Challenge (25-shot)	71.84	70.90	98.7%
OpenLLM v1	GSM-8K (5-shot, strict-match)	74.22	74.98	101.0%
OpenLLM v1	Hellaswag (10-shot)	76.56	76.10	99.4%
OpenLLM v1	Winogrande (5-shot)	73.95	75.06	101.5%
OpenLLM v1	TruthfulQA (0-shot, mc2)	61.18	60.93	99.6%
OpenLLM v1	Average	73.75	73.76	100.0%

🔧 Technical Details

There is no specific technical details section in the original document, so this section is skipped.

📄 License

The model is released under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご