🚀 Qwen3-0.6B
Qwen3-0.6B is a powerful causal language model in the Qwen series, offering advanced reasoning, instruction-following, and multilingual support capabilities.
🚀 Quick Start
The code of Qwen3 has been integrated into the latest Hugging Face transformers
. We recommend using the latest version of transformers
.
If you use transformers<4.51.0
, you will encounter the following error:
KeyError: 'qwen3'
The following code snippet demonstrates how to use the model to generate content based on given inputs:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
try:
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content)
print("content:", content)
For deployment, you can use vllm>=0.8.5
or sglang>=0.4.5.post2
to create an OpenAI-compatible API endpoint:
✨ Features
Unsloth Support
Qwen3 Highlights
- Unique seamless switching: Support seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within a single model, ensuring optimal performance across various scenarios.
- Enhanced reasoning capabilities: Significantly enhance reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
- Superior human preference alignment: Excel in creative writing, role-playing, multi-turn dialogues, and instruction following, delivering a more natural, engaging, and immersive conversational experience.
- Expertise in agent capabilities: Enable precise integration with external tools in both thinking and unthinking modes and achieve leading performance among open-source models in complex agent-based tasks.
- Multilingual support: Support 100+ languages and dialects with strong capabilities for multilingual instruction following and translation.
📚 Documentation
Model Overview
Property |
Details |
Model Type |
Causal Language Models |
Training Stage |
Pretraining & Post-training |
Number of Parameters |
0.6B |
Number of Paramaters (Non-Embedding) |
0.44B |
Number of Layers |
28 |
Number of Attention Heads (GQA) |
16 for Q and 8 for KV |
Context Length |
32,768 |
For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.
Switching Between Thinking and Non-Thinking Mode
⚠️ Important Note
The enable_thinking
switch is also available in APIs created by vLLM and SGLang. Please refer to our documentation for more details.
Basic Usage
enable_thinking=True
By default, Qwen3 has thinking capabilities enabled, similar to QwQ-32B. This means the model will use its reasoning abilities to enhance the quality of generated responses. For example, when explicitly setting enable_thinking=True
or leaving it as the default value in tokenizer.apply_chat_template
, the model will engage its thinking mode.
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True
)
In this mode, the model will generate think content wrapped in a <think>...</think>
block, followed by the final response.
⚠️ Important Note
For thinking mode, use Temperature=0.6
, TopP=0.95
, TopK=20
, and MinP=0
(the default setting in generation_config.json
). DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions. For more detailed guidance, please refer to the Best Practices section.
enable_thinking=False
We provide a hard switch to strictly disable the model's thinking behavior, aligning its functionality with the previous Qwen2.5-Instruct models. This mode is particularly useful in scenarios where disabling thinking is essential for enhancing efficiency.
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
In this mode, the model will not generate any think content and will not include a <think>...</think>
block.
⚠️ Important Note
For non-thinking mode, we suggest using Temperature=0.7
, TopP=0.8
, TopK=20
, and MinP=0
. For more detailed guidance, please refer to the Best Practices section.
Advanced Usage: Switching Between Thinking and Non-Thinking Modes via User Input
We provide a soft switch mechanism that allows users to dynamically control the model's behavior when enable_thinking=True
. Specifically, you can add /think
and /no_think
to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations.
Here is an example of a multi-turn conversation:
from transformers import AutoModelForCausalLM, AutoTokenizer
class QwenChatbot:
def __init__(self, model_name="Qwen/Qwen3-0.6B"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name)
self.history = []
def generate_response(self, user_input):
messages = self.history + [{"role": "user", "content": user_input}]
text = self.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = self.tokenizer(text, return_tensors="pt")
response_ids = self.model.generate(**inputs, max_new_tokens=32768)[0][len(inputs.input_ids[0]):].tolist()
response = self.tokenizer.decode(response_ids, skip_special_tokens=True)
self.history.append({"role": "user", "content": user_input})
self.history.append({"role": "assistant", "content": response})
return response
if __name__ == "__main__":
chatbot = QwenChatbot()
user_input_1 = "How many r's in strawberries?"
print(f"User: {user_input_1}")
response_1 = chatbot.generate_response(user_input_1)
print(f"Bot: {response_1}")
print("----------------------")
user_input_2 = "Then, how many r's in blueberries? /no_think"
print(f"User: {user_input_2}")
response_2 = chatbot.generate_response(user_input_2)
print(f"Bot: {response_2}")
print("----------------------")
user_input_3 = "Really? /think"
print(f"User: {user_input_3}")
response_3 = chatbot.generate_response(user_input_3)
print(f"Bot: {response_3}")
⚠️ Important Note
For API compatibility, when enable_thinking=True
, regardless of whether the user uses /think
or /no_think
, the model will always output a block wrapped in <think>...</think>
. However, the content inside this block may be empty if thinking is disabled. When enable_thinking=False
, the soft switches are not valid. Regardless of any /think
or /no_think
tags input by the user, the model will not generate think content and will not include a <think>...</think>
block.
Agentic Use
Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of the agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.
To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself.
from qwen_agent.agents import Assistant
llm_cfg = {
'model': 'Qwen3-0.6B',
📄 License
This model is licensed under the Apache 2.0 License.