đ Qwen3-30B-A3B - llamafile
Mozilla packaged the Qwen 3 models into executable weights (llamafiles), offering an easy and fast way to use the model on multiple systems.
đ Quick Start
To get started, you need both the Qwen 3 weights and the llamafile software. Both are included in a single file, which can be downloaded and run as follows:
wget https://huggingface.co/Mozilla/Qwen3-0.6B-llamafile/resolve/main/Qwen_Qwen3-30B-A3B-Q4_K_M.llamafile
chmod +x Qwen_Qwen3-30B-A3B-Q4_K_M.llamafile
./Qwen_Qwen3-30B-A3B-Q4_K_M.llamafile
The default mode of operation for these llamafiles is our new command line chatbot interface.
⨠Features
Llamafile Features
- Mozilla packaged the Qwen 3 models into llamafiles, providing an easy and fast way to use the model on Linux, MacOS, Windows, FreeBSD, OpenBSD, and NetBSD systems on both AMD64 and ARM64.
- The default mode is a command - line chatbot interface, and it also supports web GUI (
--server
mode) and advanced CLI mode (--cli
flag).
Qwen3 Features
- Seamless Mode Switching: Uniquely support seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non - thinking mode (for efficient, general - purpose dialogue) within a single model.
- Enhanced Reasoning: Significantly enhanced reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non - thinking mode) on mathematics, code generation, and commonsense logical reasoning.
- Human Preference Alignment: Superior human preference alignment, excelling in creative writing, role - playing, multi - turn dialogues, and instruction following.
- Agent Capabilities: Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open - source models in complex agent - based tasks.
- Multilingual Support: Support 100+ languages and dialects with strong capabilities for multilingual instruction following and translation.
đĻ Installation
Llamafile Installation
wget https://huggingface.co/Mozilla/Qwen3-0.6B-llamafile/resolve/main/Qwen_Qwen3-30B-A3B-Q4_K_M.llamafile
chmod +x Qwen_Qwen3-30B-A3B-Q4_K_M.llamafile
Qwen3 Installation
The code of Qwen3 - MoE has been in the latest Hugging Face transformers
, and it is recommended to use the latest version of transformers
.
đģ Usage Examples
Llamafile Usage
Basic Usage
./Qwen_Qwen3-30B-A3B-Q4_K_M.llamafile
This runs the llamafile in the default command - line chatbot interface.
Advanced Usage - Web GUI
./Qwen_Qwen3-30B-A3B-Q4_K_M.llamafile --server
This opens a tab with a chatbot and completion interface in your browser.
Advanced Usage - CLI Mode
./Qwen_Qwen3-30B-A3B-Q4_K_M.llamafile --cli -p 'four score and seven' --log-disable
This is useful for shell scripting.
Qwen3 Usage
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-30B-A3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
try:
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content)
print("content:", content)
Advanced Usage - Deployment
python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B --reasoning-parser qwen3
vllm serve Qwen/Qwen3-30B-A3B --enable-reasoning --reasoning-parser deepseek_r1
Advanced Usage - Switching Modes
Thinking Mode
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True
)
Non - Thinking Mode
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
Dynamic Mode Switching
from transformers import AutoModelForCausalLM, AutoTokenizer
class QwenChatbot:
def __init__(self, model_name="Qwen/Qwen3-30B-A3B"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name)
self.history = []
def generate_response(self, user_input):
messages = self.history + [{"role": "user", "content": user_input}]
text = self.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = self.tokenizer(text, return_tensors="pt")
response_ids = self.model.generate(**inputs, max_new_tokens=32768)[0][len(inputs.input_ids[0]):].tolist()
response = self.tokenizer.decode(response_ids, skip_special_tokens=True)
self.history.append({"role": "user", "content": user_input})
self.history.append({"role": "assistant", "content": response})
return response
if __name__ == "__main__":
chatbot = QwenChatbot()
user_input_1 = "How many r's in strawberries?"
print(f"User: {user_input_1}")
response_1 = chatbot.generate_response(user_input_1)
print(f"Bot: {response_1}")
print("----------------------")
user_input_2 = "Then, how many r's in blueberries? /no_think"
print(f"User: {user_input_2}")
response_2 = chatbot.generate_response(user_input_2)
print(f"Bot: {response_2}")
print("----------------------")
user_input_3 = "Really? /think"
đ Documentation
Llamafile
If you have trouble using llamafile, see the "Gotchas" section of the README.
Qwen3
For more details about Qwen3, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.
đ§ Technical Details
Llamafile
- Linux: To avoid run - detector errors, install the APE interpreter:
sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
sudo chmod +x /usr/bin/ape
sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
- Windows: There's a 4GB limit on executable sizes.
- GPU Acceleration: On GPUs with sufficient RAM, pass the
-ngl 999
flag to use the system's NVIDIA or AMD GPU(s). On Windows, if you have an AMD GPU, install the ROCm SDK v6.1 and pass the flags --recompile --gpu amd
the first time you run your llamafile.
Qwen3
Qwen3 - 30B - A3B features:
Property |
Details |
Model Type |
Causal Language Models |
Training Stage |
Pretraining & Post - training |
Number of Parameters |
30.5B in total and 3.3B activated |
Number of Paramaters (Non - Embedding) |
29.9B |
Number of Layers |
48 |
Number of Attention Heads (GQA) |
32 for Q and 4 for KV |
Number of Experts |
128 |
Number of Activated Experts |
8 |
Context Length |
32,768 natively and 131,072 tokens with YaRN |
đ License
The project uses the Apache - 2.0 license. For more details, see LICENSE.
â ī¸ Important Note
For thinking mode, use Temperature = 0.6
, TopP = 0.95
, TopK = 20
, and MinP = 0
(the default setting in generation_config.json
). DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions.
For non - thinking mode, we suggest using Temperature = 0.7
, TopP = 0.8
, TopK = 20
, and MinP = 0
.
đĄ Usage Tip
The enable_thinking
switch is also available in APIs created by SGLang and vLLM. Please refer to our documentation for SGLang and vLLM users.