Qwen3-4B-FP8 Open Source Large Language Model - Supports Mode Switching, Excellent Inference and Agent Capabilities

Qwen3 4B FP8

Developed by Qwen

Qwen3-4B-FP8 is the latest large language model in the Qwen series, offering a 4-billion-parameter FP8 quantized version that supports switching between thinking and non-thinking modes, excelling in reasoning, instruction following, and agent capabilities.

Large Language Model

Transformers

Open Source License:Apache-2.0 #Mode Switching #Multilingual Reasoning #Agent Tool Integration

Downloads 23.95k

Release Time : 4/28/2025

Model Overview

A causal language model trained on large-scale data, supporting complex logical reasoning, mathematical calculations, programming, and multilingual tasks, with strong text generation and agent capabilities.

Model Features

Dual Mode Switching

Supports seamless switching between thinking mode (complex reasoning) and non-thinking mode (efficient dialogue), controlled via the enable_thinking parameter or /think, /no_think commands.

Enhanced Reasoning

Outperforms previous models in mathematics, code generation, and commonsense logical reasoning, especially suitable for tasks requiring step-by-step reasoning.

FP8 Quantization

Provides a fine-grained FP8 quantized version with a block size of 128, maintaining performance while reducing GPU memory requirements.

Extended Context Support

Natively supports 32,768 tokens, extendable to 131,072 tokens via YaRN.

Agent Integration

Optimized for tool calling, seamlessly integrates with the Qwen-Agent framework for complex agent tasks.

Model Capabilities

Complex Logical Reasoning

Mathematical Calculations

Code Generation

Multi-turn Dialogue

Multilingual Translation

Tool Calling

Creative Writing

Role-playing

Use Cases

Education & Research

Math Problem Solving

Solves math competition problems step-by-step with detailed derivations.

Excels in benchmarks like GSM8K.

Programming Tutorial

Generates executable code from natural language descriptions and explains implementation logic.

Supports multiple programming languages like Python.

Business Applications

Multilingual Customer Service

Handles customer inquiries in 100+ languages with localized responses.

Reduces manual workload for customer support.

Smart Assistant

Integrates external tools to complete complex tasks like booking and queries.

Automates workflows via Qwen-Agent.

Content Creation

Creative Writing

Generates literary works like poems and stories tailored to specific styles.

Produces natural, fluent, and creative outputs.

Role-playing

Maintains character consistency for multi-turn interactive dialogues.

Provides immersive interaction experiences.

🚀 Qwen3-4B-FP8

Qwen3-4B-FP8 is an FP8 version of the Qwen3-4B large language model, offering advanced reasoning, multilingual support, and agent capabilities.

🚀 Quick Start

The code of Qwen3 has been integrated into the latest Hugging Face transformers. We strongly recommend using the latest version of transformers.

If you use transformers<4.51.0, you'll encounter the following error:

KeyError: 'qwen3'

Here is a code snippet demonstrating how to use the model to generate content based on given inputs:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-4B-FP8"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

For deployment, you can use sglang>=0.4.6.post1 or vllm>=0.8.5 to create an OpenAI-compatible API endpoint:

SGLang:

python -m sglang.launch_server --model-path Qwen/Qwen3-4B-FP8 --reasoning-parser qwen3

vLLM:

vllm serve Qwen/Qwen3-4B-FP8 --enable-reasoning --reasoning-parser deepseek_r1

For local use, applications like Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers also support Qwen3.

✨ Features

Qwen3 Highlights

Qwen3 is the latest generation of large language models in the Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Through extensive training, Qwen3 achieves groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features:

Unique Support for Seamless Mode Switching: It allows seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within a single model, ensuring optimal performance across various scenarios.
Significant Enhancement in Reasoning Capabilities: It surpasses previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) in mathematics, code generation, and commonsense logical reasoning.
Superior Human Preference Alignment: It excels in creative writing, role-playing, multi-turn dialogues, and instruction following, delivering a more natural, engaging, and immersive conversational experience.
Expertise in Agent Capabilities: It enables precise integration with external tools in both thinking and non-thinking modes and achieves leading performance among open-source models in complex agent-based tasks.
Multilingual Support: It supports over 100 languages and dialects, with strong capabilities for multilingual instruction following and translation.

📦 Installation

There is no specific installation content provided in the original README. If you want to use Qwen3-4B-FP8, make sure you have the latest version of transformers installed. You can install it using the following command:

pip install --upgrade transformers

💻 Usage Examples

Basic Usage

The above quick start code is a basic usage example, showing how to load the model and generate text based on given inputs.

Advanced Usage

Switching Between Thinking and Non-Thinking Modes

from transformers import AutoModelForCausalLM, AutoTokenizer

class QwenChatbot:
    def __init__(self, model_name="Qwen/Qwen3-4B-FP8"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.history = []

    def generate_response(self, user_input):
        messages = self.history + [{"role": "user", "content": user_input}]

        text = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        inputs = self.tokenizer(text, return_tensors="pt")
        response_ids = self.model.generate(**inputs, max_new_tokens=32768)[0][len(inputs.input_ids[0]):].tolist()
        response = self.tokenizer.decode(response_ids, skip_special_tokens=True)

        # Update history
        self.history.append({"role": "user", "content": user_input})
        self.history.append({"role": "assistant", "content": response})

        return response

# Example Usage
if __name__ == "__main__":
    chatbot = QwenChatbot()

    # First input (without /think or /no_think tags, thinking mode is enabled by default)
    user_input_1 = "How many r's in strawberries?"
    print(f"User: {user_input_1}")
    response_1 = chatbot.generate_response(user_input_1)
    print(f"Bot: {response_1}")
    print("----------------------")

    # Second input with /no_think
    user_input_2 = "Then, how many r's in blueberries? /no_think"
    print(f"User: {user_input_2}")
    response_2 = chatbot.generate_response(user_input_2)
    print(f"Bot: {response_2}") 
    print("----------------------")

    # Third input with /think
    user_input_3 = "Really? /think"
    print(f"User: {user_input_3}")
    response_3 = chatbot.generate_response(user_input_3)
    print(f"Bot: {response_3}")

Agentic Use

from qwen_agent.agents import Assistant

# Define LLM
llm_cfg = {
    'model': 'Qwen3-4B-FP8',

    # Use the endpoint provided by Alibaba Model Studio:
    # 'model_type': 'qwen_dashscope',
    # 'api_key': os.getenv('DASHSCOPE_API_KEY'),

    # Use a custom endpoint compatible with OpenAI API:
    'model_server': 'http://localhost:8000/v1',  # api_base
    'api_key': 'EMPTY',

    # Other parameters:
    # 'generate_cfg': {
    #         # Add: When the response content is `<think>this is the thought</think>this is the answer;
    #         # Do not add: When the response has been separated by reasoning_content and content.
    #         'thought_in_content': True,
    #     },
}

# Define Tools
tools = [
    {'mcpServers': {  # You can specify the MCP configuration file
            'time': {
                'command': 'uvx',
                'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']
            },
            "fetch": {
                "command": "uvx",
                "args": ["mcp-server-fetch"]
            }
        }
    },
  'code_interpreter',  # Built-in tools
]

# Define Agent
bot = Assistant(llm=llm_cfg, function_list=tools)

# Streaming generation
messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

📚 Documentation

Model Overview

This repo contains the FP8 version of Qwen3-4B, with the following features:

Property	Details
Model Type	Causal Language Models
Training Stage	Pretraining & Post-training
Number of Parameters	4.0B
Number of Paramaters (Non-Embedding)	3.6B
Number of Layers	36
Number of Attention Heads (GQA)	32 for Q and 8 for KV
Context Length	32,768 natively and 131,072 tokens with YaRN

For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.

Note on FP8

For convenience and performance, we provide a fp8-quantized model checkpoint for Qwen3, named with -FP8 at the end. The quantization method is fine-grained fp8 quantization with a block size of 128. You can find more details in the quantization_config field in config.json.

You can use the Qwen3-4B-FP8 model with several inference frameworks, including transformers, sglang, and vllm, just like the original bfloat16 model. However, please note the following known issues:

transformers:
- Currently, there are issues with the "fine-grained fp8" method in transformers for distributed inference. You may need to set the environment variable CUDA_LAUNCH_BLOCKING=1 if multiple devices are used in inference.

Switching Between Thinking and Non-Thinking Mode

⚠️ Important Note

The enable_thinking switch is also available in APIs created by SGLang and vLLM. Please refer to our documentation for SGLang and vLLM users.

`enable_thinking=True`

By default, Qwen3 has thinking capabilities enabled, similar to QwQ-32B. This means the model will use its reasoning abilities to enhance the quality of generated responses. For example, when explicitly setting enable_thinking=True or leaving it as the default value in tokenizer.apply_chat_template, the model will enter thinking mode.

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # True is the default value for enable_thinking
)

In this mode, the model will generate think content wrapped in a <think>...</think> block, followed by the final response.

⚠️ Important Note

For thinking mode, use Temperature=0.6, TopP=0.95, TopK=20, and MinP=0 (the default setting in generation_config.json). DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions. For more detailed guidance, please refer to the Best Practices section.

`enable_thinking=False`

We provide a hard switch to strictly disable the model's thinking behavior, aligning its functionality with the previous Qwen2.5-Instruct models. This mode is particularly useful in scenarios where disabling thinking is essential for enhancing efficiency.

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Setting enable_thinking=False disables thinking mode
)

In this mode, the model will not generate any think content and will not include a <think>...</think> block.

⚠️ Important Note

For non-thinking mode, we suggest using Temperature=0.7, TopP=0.8, TopK=20, and MinP=0. For more detailed guidance, please refer to the Best Practices section.

Processing Long Texts

Qwen3 natively supports context lengths of up to 32,768 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 131,072 tokens using the YaRN method.

🔧 Technical Details

There is no specific technical details content that meets the requirement (>50 words) in the original README, so this section is skipped.

📄 License

This project is licensed under the Apache-2.0 license.

💡 Usage Tip

If you encounter significant endless repetitions, please refer to the Best Practices section for optimal sampling parameters, and set the presence_penalty to 1.5.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご