Qwen3-30B-A3B-llamafile Open-Source Large Language Model - Excellent Inference and Instruction-Following Capabilities!

Qwen3 30B A3B Llamafile

Developed by Mozilla

Qwen3 is the latest generation of large language models in the Qwen series, offering a range of dense and mixture-of-experts (MoE) models. Based on extensive training, Qwen3 has achieved groundbreaking progress in reasoning, instruction following, agent capabilities, and multilingual support.

Large Language Model Open Source License:Apache-2.0 #Mind Mode Switching #128k Long Context Processing #Multilingual Agent

Downloads 143

Release Time : 5/14/2025

Model Overview

Qwen3-30B-A3B is a causal language model with 30.5B parameters and 3.3B active parameters, supporting a context window of 128k tokens. It excels in reasoning, instruction following, agent capabilities, and multilingual support.

Model Features

Mind and Non-Mind Mode Switching

Supports seamless switching between mind mode (for complex logical reasoning, mathematics, and coding) and non-mind mode (for efficient general conversation) within a single model.

Enhanced Reasoning Capabilities

Surpasses previous QwQ and Qwen2.5 instruction models in mathematics, code generation, and commonsense logical reasoning.

Human Preference Alignment

Excels in creative writing, role-playing, multi-turn conversations, and instruction following, delivering more natural and engaging conversational experiences.

Agent Capabilities

Can precisely integrate external tools in both mind and non-mind modes, achieving leading performance in complex agent-based tasks.

Multilingual Support

Supports over 100 languages and dialects, with robust multilingual instruction following and translation capabilities.

Model Capabilities

Text Generation

Complex Logical Reasoning

Mathematical Computation

Code Generation

Multilingual Translation

Tool Usage

Multi-turn Conversations

Use Cases

Education

Math Problem Solving

Solves complex math problems with step-by-step reasoning.

Performs excellently in math competition problems.

Programming

Code Generation

Generates code snippets based on natural language descriptions.

Produces high-quality code supporting multiple programming languages.

Multilingual Applications

Multilingual Translation

Provides high-quality translations between different languages.

Supports translation for over 100 languages and dialects.

🚀 Qwen3-30B-A3B - llamafile

Mozilla packaged the Qwen 3 models into executable weights (llamafiles), offering an easy and fast way to use the model on multiple systems.

🚀 Quick Start

To get started, you need both the Qwen 3 weights and the llamafile software. Both are included in a single file, which can be downloaded and run as follows:

wget https://huggingface.co/Mozilla/Qwen3-0.6B-llamafile/resolve/main/Qwen_Qwen3-30B-A3B-Q4_K_M.llamafile
chmod +x Qwen_Qwen3-30B-A3B-Q4_K_M.llamafile
./Qwen_Qwen3-30B-A3B-Q4_K_M.llamafile

The default mode of operation for these llamafiles is our new command line chatbot interface.

✨ Features

Llamafile Features

Mozilla packaged the Qwen 3 models into llamafiles, providing an easy and fast way to use the model on Linux, MacOS, Windows, FreeBSD, OpenBSD, and NetBSD systems on both AMD64 and ARM64.
The default mode is a command - line chatbot interface, and it also supports web GUI (--server mode) and advanced CLI mode (--cli flag).

Qwen3 Features

Seamless Mode Switching: Uniquely support seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non - thinking mode (for efficient, general - purpose dialogue) within a single model.
Enhanced Reasoning: Significantly enhanced reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non - thinking mode) on mathematics, code generation, and commonsense logical reasoning.
Human Preference Alignment: Superior human preference alignment, excelling in creative writing, role - playing, multi - turn dialogues, and instruction following.
Agent Capabilities: Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open - source models in complex agent - based tasks.
Multilingual Support: Support 100+ languages and dialects with strong capabilities for multilingual instruction following and translation.

📦 Installation

Llamafile Installation

wget https://huggingface.co/Mozilla/Qwen3-0.6B-llamafile/resolve/main/Qwen_Qwen3-30B-A3B-Q4_K_M.llamafile
chmod +x Qwen_Qwen3-30B-A3B-Q4_K_M.llamafile

Qwen3 Installation

The code of Qwen3 - MoE has been in the latest Hugging Face transformers, and it is recommended to use the latest version of transformers.

💻 Usage Examples

Llamafile Usage

Basic Usage

./Qwen_Qwen3-30B-A3B-Q4_K_M.llamafile

This runs the llamafile in the default command - line chatbot interface.

Advanced Usage - Web GUI

./Qwen_Qwen3-30B-A3B-Q4_K_M.llamafile --server

This opens a tab with a chatbot and completion interface in your browser.

Advanced Usage - CLI Mode

./Qwen_Qwen3-30B-A3B-Q4_K_M.llamafile --cli -p 'four score and seven' --log-disable

This is useful for shell scripting.

Qwen3 Usage

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-30B-A3B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non - thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

Advanced Usage - Deployment

SGLang:

python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B --reasoning-parser qwen3

vLLM:

vllm serve Qwen/Qwen3-30B-A3B --enable-reasoning --reasoning-parser deepseek_r1

Advanced Usage - Switching Modes

Thinking Mode

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # True is the default value for enable_thinking
)

Non - Thinking Mode

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Setting enable_thinking=False disables thinking mode
)

Dynamic Mode Switching

from transformers import AutoModelForCausalLM, AutoTokenizer

class QwenChatbot:
    def __init__(self, model_name="Qwen/Qwen3-30B-A3B"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.history = []

    def generate_response(self, user_input):
        messages = self.history + [{"role": "user", "content": user_input}]

        text = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        inputs = self.tokenizer(text, return_tensors="pt")
        response_ids = self.model.generate(**inputs, max_new_tokens=32768)[0][len(inputs.input_ids[0]):].tolist()
        response = self.tokenizer.decode(response_ids, skip_special_tokens=True)

        # Update history
        self.history.append({"role": "user", "content": user_input})
        self.history.append({"role": "assistant", "content": response})

        return response

# Example Usage
if __name__ == "__main__":
    chatbot = QwenChatbot()

    # First input (without /think or /no_think tags, thinking mode is enabled by default)
    user_input_1 = "How many r's in strawberries?"
    print(f"User: {user_input_1}")
    response_1 = chatbot.generate_response(user_input_1)
    print(f"Bot: {response_1}")
    print("----------------------")

    # Second input with /no_think
    user_input_2 = "Then, how many r's in blueberries? /no_think"
    print(f"User: {user_input_2}")
    response_2 = chatbot.generate_response(user_input_2)
    print(f"Bot: {response_2}") 
    print("----------------------")

    # Third input with /think
    user_input_3 = "Really? /think"

📚 Documentation

Llamafile

If you have trouble using llamafile, see the "Gotchas" section of the README.

Qwen3

For more details about Qwen3, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.

🔧 Technical Details

Llamafile

Linux: To avoid run - detector errors, install the APE interpreter:

sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
sudo chmod +x /usr/bin/ape
sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"

Windows: There's a 4GB limit on executable sizes.
GPU Acceleration: On GPUs with sufficient RAM, pass the -ngl 999 flag to use the system's NVIDIA or AMD GPU(s). On Windows, if you have an AMD GPU, install the ROCm SDK v6.1 and pass the flags --recompile --gpu amd the first time you run your llamafile.

Qwen3

Qwen3 - 30B - A3B features:

Property	Details
Model Type	Causal Language Models
Training Stage	Pretraining & Post - training
Number of Parameters	30.5B in total and 3.3B activated
Number of Paramaters (Non - Embedding)	29.9B
Number of Layers	48
Number of Attention Heads (GQA)	32 for Q and 4 for KV
Number of Experts	128
Number of Activated Experts	8
Context Length	32,768 natively and 131,072 tokens with YaRN

📄 License

The project uses the Apache - 2.0 license. For more details, see LICENSE.

⚠️ Important Note

For thinking mode, use Temperature = 0.6, TopP = 0.95, TopK = 20, and MinP = 0 (the default setting in generation_config.json). DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions. For non - thinking mode, we suggest using Temperature = 0.7, TopP = 0.8, TopK = 20, and MinP = 0.

💡 Usage Tip

The enable_thinking switch is also available in APIs created by SGLang and vLLM. Please refer to our documentation for SGLang and vLLM users.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご