Qwen3-0.6B-unsloth-bnb-4bit Open Source Large Language Model - Multi-language Support and Strong Instruction-following Ability

Qwen3 0.6B Unsloth Bnb 4bit

Developed by unsloth

Qwen3 is the latest generation of the Qwen series large language model, offering a comprehensive set of dense and mixture-of-experts (MoE) models. Based on extensive training, Qwen3 achieves groundbreaking progress in reasoning, instruction following, agent capabilities, and multilingual support.

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #Multilingual Reasoning #Thinking Mode Switching #Efficient Fine-tuning

Downloads 50.36k

Release Time : 4/28/2025

Model Overview

Qwen3-0.6B is a 0.6B-parameter causal language model that supports switching between thinking and non-thinking modes, suitable for complex logical reasoning, mathematical and coding tasks, as well as efficient general-purpose conversations.

Model Features

Thinking and Non-Thinking Mode Switching

Supports seamless switching between thinking mode (for complex logical reasoning, mathematics, and coding) and non-thinking mode (for efficient general-purpose conversations) within a single model.

Enhanced Reasoning Capabilities

Outperforms previous QwQ and Qwen2.5 instruction models in mathematics, code generation, and commonsense logical reasoning.

Human Preference Alignment

Excels in creative writing, role-playing, multi-turn conversations, and instruction following, delivering more natural, engaging, and immersive conversational experiences.

Agent Capabilities

Capable of precisely integrating external tools in both thinking and non-thinking modes, achieving leading performance among open-source models in agent-based complex tasks.

Multilingual Support

Supports over 100 languages and dialects, with strong multilingual instruction following and translation capabilities.

Model Capabilities

Text generation

Complex logical reasoning

Mathematical computation

Code generation

Multi-turn conversations

Instruction following

Multilingual translation

Tool invocation

Use Cases

Education and Learning

Math Problem Solving

Solves complex mathematical problems, providing step-by-step reasoning processes.

Excels in mathematical reasoning tasks.

Programming Learning Assistance

Generates code examples and explains programming concepts.

Capable of producing high-quality code and explanations.

Creative Writing

Story Generation

Generates creative stories based on prompts.

Produces engaging and coherent stories.

Role-Playing

Simulates dialogues for different characters.

Provides immersive conversational experiences.

Business Applications

Customer Service

Handles customer inquiries and provides support.

Capable of understanding and accurately answering customer questions.

Document Generation

Generates business documents based on instructions.

Produces well-structured and accurate documents.

🚀 Qwen3-0.6B

Qwen3-0.6B is a powerful causal language model in the Qwen series, offering advanced reasoning, instruction-following, and multilingual support capabilities.

🚀 Quick Start

The code of Qwen3 has been integrated into the latest Hugging Face transformers. We recommend using the latest version of transformers.

If you use transformers<4.51.0, you will encounter the following error:

KeyError: 'qwen3'

The following code snippet demonstrates how to use the model to generate content based on given inputs:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-0.6B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

For deployment, you can use vllm>=0.8.5 or sglang>=0.4.5.post2 to create an OpenAI-compatible API endpoint:

vLLM:

vllm serve Qwen/Qwen3-0.6B --enable-reasoning --reasoning-parser deepseek_r1

SGLang:

python -m sglang.launch_server --model-path Qwen/Qwen3-0.6B --reasoning-parser deepseek-r1

✨ Features

See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats.
Learn to run Qwen3 correctly - Read our Guide.
Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants.

Unsloth Support

Unsloth supports	Free Notebooks	Performance	Memory use
Qwen3 (14B)	▶️ Start on Colab	3x faster	70% less
GRPO with Qwen3 (8B)	▶️ Start on Colab	3x faster	80% less
Llama-3.2 (3B)	▶️ Start on Colab	2.4x faster	58% less
Llama-3.2 (11B vision)	▶️ Start on Colab	2x faster	60% less
Qwen2.5 (7B)	▶️ Start on Colab	2x faster	60% less
Phi-4 (14B)	▶️ Start on Colab	2x faster	50% less

Qwen3 Highlights

Unique seamless switching: Support seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within a single model, ensuring optimal performance across various scenarios.
Enhanced reasoning capabilities: Significantly enhance reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
Superior human preference alignment: Excel in creative writing, role-playing, multi-turn dialogues, and instruction following, delivering a more natural, engaging, and immersive conversational experience.
Expertise in agent capabilities: Enable precise integration with external tools in both thinking and unthinking modes and achieve leading performance among open-source models in complex agent-based tasks.
Multilingual support: Support 100+ languages and dialects with strong capabilities for multilingual instruction following and translation.

📚 Documentation

Model Overview

Property	Details
Model Type	Causal Language Models
Training Stage	Pretraining & Post-training
Number of Parameters	0.6B
Number of Paramaters (Non-Embedding)	0.44B
Number of Layers	28
Number of Attention Heads (GQA)	16 for Q and 8 for KV
Context Length	32,768

For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.

Switching Between Thinking and Non-Thinking Mode

⚠️ Important Note

The enable_thinking switch is also available in APIs created by vLLM and SGLang. Please refer to our documentation for more details.

Basic Usage

`enable_thinking=True`

By default, Qwen3 has thinking capabilities enabled, similar to QwQ-32B. This means the model will use its reasoning abilities to enhance the quality of generated responses. For example, when explicitly setting enable_thinking=True or leaving it as the default value in tokenizer.apply_chat_template, the model will engage its thinking mode.

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # True is the default value for enable_thinking
)

In this mode, the model will generate think content wrapped in a <think>...</think> block, followed by the final response.

⚠️ Important Note

For thinking mode, use Temperature=0.6, TopP=0.95, TopK=20, and MinP=0 (the default setting in generation_config.json). DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions. For more detailed guidance, please refer to the Best Practices section.

`enable_thinking=False`

We provide a hard switch to strictly disable the model's thinking behavior, aligning its functionality with the previous Qwen2.5-Instruct models. This mode is particularly useful in scenarios where disabling thinking is essential for enhancing efficiency.

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Setting enable_thinking=False disables thinking mode
)

In this mode, the model will not generate any think content and will not include a <think>...</think> block.

⚠️ Important Note

For non-thinking mode, we suggest using Temperature=0.7, TopP=0.8, TopK=20, and MinP=0. For more detailed guidance, please refer to the Best Practices section.

Advanced Usage: Switching Between Thinking and Non-Thinking Modes via User Input

We provide a soft switch mechanism that allows users to dynamically control the model's behavior when enable_thinking=True. Specifically, you can add /think and /no_think to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations.

Here is an example of a multi-turn conversation:

from transformers import AutoModelForCausalLM, AutoTokenizer

class QwenChatbot:
    def __init__(self, model_name="Qwen/Qwen3-0.6B"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.history = []

    def generate_response(self, user_input):
        messages = self.history + [{"role": "user", "content": user_input}]

        text = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        inputs = self.tokenizer(text, return_tensors="pt")
        response_ids = self.model.generate(**inputs, max_new_tokens=32768)[0][len(inputs.input_ids[0]):].tolist()
        response = self.tokenizer.decode(response_ids, skip_special_tokens=True)

        # Update history
        self.history.append({"role": "user", "content": user_input})
        self.history.append({"role": "assistant", "content": response})

        return response

# Example Usage
if __name__ == "__main__":
    chatbot = QwenChatbot()

    # First input (without /think or /no_think tags, thinking mode is enabled by default)
    user_input_1 = "How many r's in strawberries?"
    print(f"User: {user_input_1}")
    response_1 = chatbot.generate_response(user_input_1)
    print(f"Bot: {response_1}")
    print("----------------------")

    # Second input with /no_think
    user_input_2 = "Then, how many r's in blueberries? /no_think"
    print(f"User: {user_input_2}")
    response_2 = chatbot.generate_response(user_input_2)
    print(f"Bot: {response_2}") 
    print("----------------------")

    # Third input with /think
    user_input_3 = "Really? /think"
    print(f"User: {user_input_3}")
    response_3 = chatbot.generate_response(user_input_3)
    print(f"Bot: {response_3}")

⚠️ Important Note

For API compatibility, when enable_thinking=True, regardless of whether the user uses /think or /no_think, the model will always output a block wrapped in <think>...</think>. However, the content inside this block may be empty if thinking is disabled. When enable_thinking=False, the soft switches are not valid. Regardless of any /think or /no_think tags input by the user, the model will not generate think content and will not include a <think>...</think> block.

Agentic Use

Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of the agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.

To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself.

from qwen_agent.agents import Assistant

# Define LLM
llm_cfg = {
    'model': 'Qwen3-0.6B',

    # Use the endpoint provided by Alibaba Model Studio:
    # 'model_type': 'qwen_dashscope',
    # 'api_key'

📄 License

This model is licensed under the Apache 2.0 License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご