Qwen3-14B-FP8-dynamic Open-source Large Language Model - Reduce Memory Requirements and Improve Computational Efficiency. Free to Use.

Qwen3 14B FP8 Dynamic

Developed by RedHatAI

Qwen3-14B-FP8-dynamic is an optimized large language model. By quantizing activation values and weights to the FP8 data type, it effectively reduces GPU memory requirements and improves computational throughput.

Large Language Model

Transformers

Open Source License:Apache-2.0 #FP8 Quantization Optimization #Multilingual Instruction Following #Efficient Inference

Downloads 167

Release Time : 5/2/2025

Model Overview

This model is suitable for various scenarios such as inference, function calls, and multilingual instruction following. It optimizes performance and resource utilization efficiency through FP8 quantization technology.

Model Features

FP8 Quantization Optimization

Quantize activation values and weights using the FP8 data type, significantly reducing GPU memory requirements and disk space usage.

Efficient Computation

Improve the matrix multiplication computational throughput by approximately 2 times through quantization technology.

Suitable for Multiple Scenarios

Support various application scenarios such as inference, function calls, and multilingual instruction following.

Model Capabilities

Text Generation

Instruction Following

Function Call

Multilingual Translation

Inference Task

Use Cases

Natural Language Processing

Generate an Introduction to Large Language Models

Generate a short introduction text about large language models.

Generate text content that meets the requirements

Multilingual Application

Multilingual Instruction Following

Understand and execute instructions in multiple languages.

Accurately understand and respond to multilingual instructions

🚀 Qwen3-14B-FP8-dynamic

This is a text generation model optimized by quantizing Qwen3-14B to FP8, which can be used for reasoning, function calling and other tasks.

🚀 Quick Start

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Qwen3-14B-FP8-dynamic"
number_gpus = 1
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, min_p=0, max_tokens=256)

messages = [
    {"role": "user", "content": prompt}
]

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

✨ Features

Model Architecture: Qwen3ForCausalLM, with text as both input and output.
Model Optimizations:
- Activation quantization: FP8
- Weight quantization: FP8
Intended Use Cases:
- Reasoning.
- Function calling.
- Subject matter experts via fine-tuning.
- Multilingual instruction following.
- Translation.
Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws).
Release Date: 05/02/2025
Version: 1.0
Model Developers: RedHat (Neural Magic)

Model Optimizations

This model was obtained by quantizing activations and weights of Qwen3-14B to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%.

Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. The llm-compressor library is used for quantization.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Qwen3-14B-FP8-dynamic"
number_gpus = 1
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, min_p=0, max_tokens=256)

messages = [
    {"role": "user", "content": prompt}
]

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

Advanced Usage

No advanced usage examples are provided in the original document, so this part is skipped.

📚 Documentation

Creation

Creation details

This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.

from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model_stub = "Qwen/Qwen3-14B"
model_name = model_stub.split("/")[-1]

model = AutoModelForCausalLM.from_pretrained(model_stub)

tokenizer = AutoTokenizer.from_pretrained(model_stub)

# Configure the quantization algorithm and scheme
recipe = QuantizationModifier(
    ignore=["lm_head"],
    targets="Linear",
    scheme="FP8_dynamic",
)

# Apply quantization
oneshot(
    model=model,
    recipe=recipe,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-FP8-dynamic"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")

Evaluation

The model was evaluated on the OpenLLM leaderboard tasks (versions 1 and 2), using lm-evaluation-harness, and on reasoning tasks using lighteval. vLLM was used for all evaluations.

Evaluation details

lm-evaluation-harness

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Qwen3-14B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=1 \
  --tasks openllm \
  --apply_chat_template\
  --fewshot_as_multiturn \
  --batch_size auto

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Qwen3-14B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=1 \
  --tasks mgsm \
  --apply_chat_template\
  --batch_size auto

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Qwen3-14B-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.5,max_model_len=16384,enable_chunk_prefill=True,tensor_parallel_size=1 \
  --tasks leaderboard \
  --apply_chat_template\
  --fewshot_as_multiturn \
  --batch_size auto

lighteval

lighteval_model_arguments.yaml

model_parameters:
  model_name: RedHatAI/Qwen3-14B-FP8-dynamic
  dtype: auto
  gpu_memory_utilization: 0.9
  max_model_length: 40960
  generation_parameters:
    temperature: 0.6
    top_k: 20
    min_p: 0.0
    top_p: 0.95
    max_new_tokens: 32768

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|aime24|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|aime25|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|math_500|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks lighteval|gpqa:diamond|0|0 \
  --use_chat_template = true

lighteval vllm \
  --model_args lighteval_model_arguments.yaml \
  --tasks extended|lcb:codegeneration \
  --use_chat_template = true

Accuracy

Category	Benchmark	Qwen3-14B	Qwen3-14B-FP8-dynamic (this model)	Recovery
OpenLLM v1	MMLU (5-shot)	76.81	76.47	99.6%
	ARC Challenge (25-shot)	61.60	61.95	100.6%
	GSM-8K (5-shot, strict-match)	67.63	66.19	97.9%
	Hellaswag (10-shot)	55.09	56.46	102.5%
	Winogrande (5-shot)	62.51	63.61	101.8%
	TruthfulQA (0-shot, mc2)	55.39	55.59	100.4%
	Average	63.17	63.38	100.3%
OpenLLM v2	MMLU-Pro (5-shot)	44.59	45.21	101.6%
	IFEval (0-shot)	87.48	87.78	100.4%
	BBH (3-shot)	40.40	40.37	99.9%
	Math-lvl-5 (4-shot)	54.18	54.31	100.2%
	GPQA (0-shot)	0.30	0.00	---
	MuSR (0-shot)	5.74	5.07	---
	Average	38.78	38.81	100.1%
Multilingual	MGSM (0-shot)	26.17	21.77	83.2%
Reasoning (generation)	AIME 2024	76.56	76.88	100.4%
	AIME 2025	66.35	66.98	101.0%
	GPQA diamond	61.62	64.14	104.1%
	Math-lvl-5	96.80	97.20	100.4%
	LiveCodeBench	60.84	60.56	99.5%

🔧 Technical Details

No specific technical details (more than 50 words) are provided in the original document, so this section is skipped.

📄 License

The model is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご