Qwen3 8B FP8 Dynamic

Developed by RedHatAI

Qwen3-8B-FP8-dynamic is an optimized version of the Qwen3-8B model through FP8 quantization, significantly reducing GPU memory requirements and disk space usage while maintaining the original model's performance.

Large Language Model

Transformers

Open Source License:Apache-2.0 #FP8 quantization #Multilingual generation #Efficient inference

Downloads 81

Release Time : 5/2/2025

Model Overview

This model is an optimized version obtained by quantizing the activations and weights of Qwen3-8B to FP8 data type, suitable for tasks such as inference, function calling, and multilingual instruction following.

Model Features

FP8 quantization

Through FP8 quantization technology, it significantly reduces GPU memory requirements (approximately 50%) and disk space usage (approximately 50%), while improving computational throughput (approximately 2x).

Efficient inference

The optimized model maintains the performance of the original model, excelling in multiple benchmarks, with some tasks even showing improvements.

Multilingual support

Supports multilingual instruction following and translation tasks, suitable for international application scenarios.

Model Capabilities

Text generation

Function calling

Multilingual instruction following

Translation

Use Cases

General AI assistant

Intelligent Q&A

Answers various user questions, providing accurate information and advice.

Achieved an average recovery rate of 101.0% in the OpenLLM v1 benchmark

Education

Math problem solving

Solves complex math problems, providing detailed solution steps.

Scored 51.90 in the Math-lvl-5 test

Business applications

Multilingual customer service

Provides multilingual customer support, understanding and responding to customer inquiries.

Scored 25.80 in the MGSM multilingual test

library_name: transformers license: apache-2.0 pipeline_tag: text-generation base_model:

Qwen/Qwen3-8B tags:
neuralmagic
redhat
llmcompressor
quantized
FP8

Qwen3-8B-FP8-dynamic

Model Overview

Model Architecture: Qwen3ForCausalLM
- Input: Text
- Output: Text
Model Optimizations:
- Activation quantization: FP8
- Weight quantization: FP8
Intended Use Cases:
- Reasoning.
- Function calling.
- Subject matter experts via fine-tuning.
- Multilingual instruction following.
- Translation.
Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws).
Release Date: 05/02/2025
Version: 1.0
Model Developers: RedHat (Neural Magic)

Model Optimizations

This model was obtained by quantizing activations and weights of Qwen3-8B to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%.

Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. The llm-compressor library is used for quantization.

Deployment

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Qwen3-8B-FP8-dynamic"
number_gpus = 1
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, min_p=0, max_tokens=256)

messages = [
    {"role": "user", "content": prompt}
]

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM aslo supports OpenAI-compatible serving. See the documentation for more details.

Creation

Creation details

This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.

from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model_stub = "Qwen/Qwen3-8B"
model_name = model_stub.split("/")[-1]

model = AutoModelForCausalLM.from_pretrained(model_stub)

tokenizer = AutoTokenizer.from_pretrained(model_stub)

# Configure the quantization algorithm and scheme
recipe = QuantizationModifier(
    ignore=["lm_head"],
    targets="Linear",
    scheme="FP8_dynamic",
)

# Apply quantization
oneshot(
    model=model,
    recipe=recipe,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-FP8-dynamic"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")

Evaluation

The model was evaluated on the OpenLLM leaderboard tasks (versions 1 and 2), using lm-evaluation-harness, and on reasoning tasks using lighteval. vLLM was used for all evaluations.

Evaluation details

lm-evaluation-harness

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Qwen3-8B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=1 \
  --tasks openllm \
  --apply_chat_template\
  --fewshot_as_multiturn \
  --batch_size auto

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Qwen3-8B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=1 \
  --tasks mgsm \
  --apply_chat_template\
  --batch_size auto

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Qwen3-8B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=16384,enable_chunk_prefill=True,tensor_parallel_size=1 \
  --tasks leaderboard \
  --apply_chat_template\
  --fewshot_as_multiturn \
  --batch_size auto

lighteval

lighteval_model_arguments.yaml

model_parameters:
  model_name: RedHatAI/Qwen3-8B-FP8-dynamic
  dtype: auto
  gpu_memory_utilization: 0.9
  max_model_length: 40960
  generation_parameters:
    temperature: 0.6
    top_k: 20
    min_p: 0.0
    top_p: 0.95
    max_new_tokens: 32768

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|aime24|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|aime25|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|math_500|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|gpqa:diamond|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks extended|lcb:codegeneration \
  --use_chat_template = true

Accuracy

Category	Benchmark	Qwen3-8B	Qwen3-8B-FP8-dynamic (this model)	Recovery
OpenLLM v1	MMLU (5-shot)	71.95	72.30	100.5%
	ARC Challenge (25-shot)	61.69	61.60	99.9%
	GSM-8K (5-shot, strict-match)	75.97	80.52	106.0%
	Hellaswag (10-shot)	56.52	55.95	99.0%
	Winogrande (5-shot)	65.98	66.22	100.4%
	TruthfulQA (0-shot, mc2)	53.17	52.39	98.5%
	Average	64.21	64.83	101.0%
OpenLLM v2	MMLU-Pro (5-shot)	34.57	37.82	109.4%
	IFEval (0-shot)	84.77	84.56	99.8%
	BBH (3-shot)	25.47	27.20	106.8%
	Math-lvl-5 (4-shot)	51.05	51.90	101.7%
	GPQA (0-shot)	0.00	0.00	---
	MuSR (0-shot)	10.02	10.65	---
	Average	34.31	35.35	103.0%
Multilingual	MGSM (0-shot)	25.97	25.80	99.4%
Reasoning (generation)	AIME 2024	74.58	76.35	102.4%
	AIME 2025	65.21	63.75	97.8%
	GPQA diamond	58.59	61.11	104.3%
	Math-lvl-5	97.60	96.60	99.0%
	LiveCodeBench	56.27	56.60	100.6%

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご