Qwen3-32B-FP8-dynamic Open-source and High-efficiency Language Model Reduces Memory Requirement and Improves Computational Efficiency

Qwen3 32B FP8 Dynamic

Developed by RedHatAI

An efficient language model based on Qwen3-32B with FP8 dynamic quantization, significantly reducing memory requirements and improving computational efficiency

Large Language Model

Transformers

Open Source License:Apache-2.0 #FP8 quantization #Multilingual instruction #Function call support

Downloads 917

Release Time : 5/2/2025

Model Overview

This model is obtained by quantizing the activations and weights of Qwen3-32B to FP8 data type, reducing GPU memory requirements by approximately 50% and improving matrix multiplication throughput by about 2x. Suitable for tasks such as inference, function calling, and multilingual instruction following.

Model Features

FP8 Quantization

Quantization of weights and activations to FP8 data type, significantly reducing memory requirements and improving computational efficiency

Efficient Deployment

Supports efficient deployment via vLLM backend, optimizing inference performance

High Accuracy Retention

The quantized model retains over 99% of the original model's accuracy across multiple benchmarks

Model Capabilities

Text generation

Function calling

Multilingual instruction following

Translation

Inference task processing

Use Cases

General AI Assistant

Knowledge Q&A

Answering various knowledge-based questions

Achieved a score of 80.89 in MMLU (5-shot) testing

Mathematical Reasoning

Solving math problems and logical reasoning

Achieved a score of 88.32 in GSM-8K testing

Professional Domain Applications

Medical Q&A

Answering medical-related questions

Achieved a score of 79.37 in AIME 2024 testing

Code Generation

Generating code based on descriptions

Performs well in code generation tasks

🚀 Qwen3-32B-FP8-dynamic

This project is based on the Qwen3-32B model, which uses FP8 quantization technology to optimize the model, reducing GPU memory requirements and increasing compute throughput.

✨ Features

Model Architecture: Based on Qwen3ForCausalLM, it takes text as input and outputs text.
Model Optimizations: Both weights and activations are quantized to FP8, reducing GPU memory requirements by approximately 50% and increasing matrix - multiply compute throughput by approximately 2x.
Intended Use Cases: Suitable for reasoning, function calling, fine - tuning for subject matter experts, multilingual instruction following, and translation.

📦 Installation

There is no specific installation content provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Qwen3-32B-FP8-dynamic"
number_gpus = 1
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, min_p=0, max_tokens=256)

messages = [
    {"role": "user", "content": prompt}
]

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

Advanced Usage

There is no advanced usage content provided in the original document, so this part is not added.

📚 Documentation

Model Overview

Model Architecture: Qwen3ForCausalLM
- Input: Text
- Output: Text
Model Optimizations:
- Activation quantization: FP8
- Weight quantization: FP8
Intended Use Cases:
- Reasoning.
- Function calling.
- Subject matter experts via fine - tuning.
- Multilingual instruction following.
- Translation.
Out - of - scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws).
Release Date: 05/02/2025
Version: 1.0
Model Developers: RedHat (Neural Magic)

Model Optimizations

This model was obtained by quantizing activations and weights of [Qwen3 - 32B](https://huggingface.co/Qwen/Qwen3 - 32B) to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix - multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%.

Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per - channel scheme, whereas activations are quantized with a symmetric dynamic per - token scheme. The [llm - compressor](https://github.com/vllm - project/llm - compressor) library is used for quantization.

Deployment

This model can be deployed efficiently using the vLLM backend. vLLM also supports OpenAI - compatible serving. See the documentation for more details.

Creation

Creation details

This model was created with [llm - compressor](https://github.com/vllm - project/llm - compressor) by running the code snippet below.

from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model_stub = "Qwen/Qwen3-32B"
model_name = model_stub.split("/")[-1]

model = AutoModelForCausalLM.from_pretrained(model_stub)

tokenizer = AutoTokenizer.from_pretrained(model_stub)

# Configure the quantization algorithm and scheme
recipe = QuantizationModifier(
    ignore=["lm_head"],
    targets="Linear",
    scheme="FP8_dynamic",
)

# Apply quantization
oneshot(
    model=model,
    recipe=recipe,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-FP8-dynamic"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")

Evaluation

The model was evaluated on the OpenLLM leaderboard tasks (versions 1 and 2), using [lm - evaluation - harness](https://github.com/EleutherAI/lm - evaluation - harness), and on reasoning tasks using lighteval. vLLM was used for all evaluations.

Evaluation details

lm - evaluation - harness

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Qwen3-32B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
  --tasks openllm \
  --apply_chat_template\
  --fewshot_as_multiturn \
  --batch_size auto

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Qwen3-32B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
  --tasks mgsm \
  --apply_chat_template\
  --batch_size auto

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Qwen3-32B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=16384,enable_chunk_prefill=True,tensor_parallel_size=2 \
  --tasks leaderboard \
  --apply_chat_template\
  --fewshot_as_multiturn \
  --batch_size auto

lighteval

lighteval_model_arguments.yaml

model_parameters:
  model_name: RedHatAI/Qwen3-32B-FP8-dynamic
  dtype: auto
  gpu_memory_utilization: 0.9
  tensor_parallel_size: 2
  max_model_length: 40960
  generation_parameters:
    temperature: 0.6
    top_k: 20
    min_p: 0.0
    top_p: 0.95
    max_new_tokens: 32768

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|aime24|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|aime25|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|math_500|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|gpqa:diamond|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks extended|lcb:codegeneration \
  --use_chat_template = true

Accuracy

Category	Benchmark	Qwen3 - 32B	Qwen3 - 32B - FP8 - dynamic (this model)	Recovery
OpenLLM v1	MMLU (5 - shot)	80.96	80.89	99.9%
	ARC Challenge (25 - shot)	69.03	68.00	98.5%
	GSM - 8K (5 - shot, strict - match)	87.64	88.32	100.8%
	Hellaswag (10 - shot)	71.10	71.44	100.5%
	Winogrande (5 - shot)	69.77	69.85	100.1%
	TruthfulQA (0 - shot, mc2)	58.63	59.13	100.9%
	Average	72.86	72.94	100.1%
OpenLLM v2	MMLU - Pro (5 - shot)	54.24	54.78	101.0%
	IFEval (0 - shot)	86.23	86.23	100.0%
	BBH (3 - shot)	44.29	43.70	98.7%
	Math - lvl - 5 (4 - shot)	54.61	57.26	104.9%
	GPQA (0 - shot)	5.53	5.46	---
	MuSR (0 - shot)	7.85	8.81	---
	Average	42.13	42.71	101.4%
Multilingual	MGSM (0 - shot)	32.57
Reasoning (generation)	AIME 2024	79.37	79.37	100.0%
	AIME 2025	71.77	70.42	98.1%
	GPQA diamond	66.67	68.69	103.0%
	Math - lvl - 5	96.20	96.40	100.2%
	LiveCodeBench	62.45	63.32	101.4%

🔧 Technical Details

📄 License

The model is under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご