🚀 Qwen3-128k-30B-A3B-NEO-MAX-Imatrix-gguf
This is a GGUF NEO Imatrix model based on Qwen's "Qwen3-30B-A3B" Mixture of Experts model, extended to 128k context, offering various quant versions with unique features for different use cases.
✨ Features
- Multi - language Support: Supports languages such as English, French, German, Spanish, Portuguese, Italian, Japanese, Korean, Russian, Chinese, Arabic, Farsi, Indonesian, Malay, Nepali, Polish, Romanian, Serbian, Swedish, Turkish, Ukrainian, Vietnamese, Hindi, and Bengali.
- Unique Quant Construction: All quants can be used on GPU and/or CPU/RAM only due to the model's unique construction. There are also several versions of quant sizes with special features.
- Extended Context: Extended to 128k (131072) context (up from 32k/32768) using "YARN" as per Qwen tech notes.
- Optimized Dataset: The NEO Imatrix dataset was developed in - house after testing and evaluating over 50 Imatrix datasets, allowing the creation of low - bit quants that remain usable.
- Expert Activation: 8 out of 128 experts are activated for these quants, and the activation is controlled automatically based on the prompt/input content.
📚 Documentation
Special Note
All quants of this model can be used on GPU and/or CPU/RAM only due to the unique construction of the model. There are also several versions of quant sizes with special features too.
Model Introduction

This is a GGUF NEO Imatrix model of Qwen's new "Qwen3 - 30B - A3B" Mixture of experts model. It has been extended to 128k context using "YARN" as per Qwen tech notes. The NEO Imatrix dataset was developed in - house after extensive testing and evaluation. It allows the creation of quants as low as IQ1_M while remaining usable, and "regular" sized quants perform better too.
Quant Details
- IQ1_M MAX / IQ1_M MAX PLUS and Higher Quants:
- IQ1_M MAX / IQ1_M MAX PLUS are designed to use the least amount of VRAM/RAM while remaining usable. It is suggested to use more informative prompts with these quants to compensate for low - bit level losses.
- IQ1_M MAX PLUS has additional optimizations compared to IQ1_M MAX.
- IQ2s are stronger than IQ1_Ms.
- Q2K/Q2KS are faster on CPU/RAM only usage but have lower performance than IQ2s.
- Q3Ks are slightly faster on CPU/RAM only usage but behind IQ3s in performance.
- IQ3s and higher quants show a large performance improvement. IQ4_XS/IQ4_NL are the peak for NEO Imatrix effects.
- Q4s have high performance, but IQ4XS/IQ4NL may outperform them.
- Q5s have very high performance.
- Q6 has peak performance with minimal NEO imatrix effects.
- Q8s have excellent performance.
- Specialized Quants:
- Some quants have multiple versions, such as Max, Max Plus, Max Plus 2, Max Super, and Max ULTRA, each with different optimizations.
- IQ1_M (Plus), all IQ2s, and all IQ3s have an output tensor at Q8, embedding at IQ4_XS, and additional minor adjustments in some expert tensors.
- Q8 MAX PLUS has its Output tensor at IQ4XS instead of Q8.
- MAX ULTRA LIST details per quant:
- Q6 ULTRA MAX: Expert Layers/Tensors 0 - 7, 46 - 47 at 16 bits (f16) + 16 bit (f16) output tensor, optimized for CPU/GPU high - speed operation.
- Q8 ULTRA MAX: Similar to Q6 ULTRA MAX.
Speed Comparison (GPU vs CPU)
This is a rough speed chart with Quant, T/S on CPU/RAM, Size of quant, and T/S on GPU only.
Quant |
T/S on CPU/RAM |
Size of quant |
T/S on GPU only |
Q2_K_S |
29 T/s |
[10 GB] |
83 T/S |
Q2_K |
27 T/s |
[10.5 GB] |
72 T/S |
IQ1_M |
22 T/S |
[7 GB] |
87 T/S |
IQ2_XXS |
21 T/S |
[8 GB] |
76 t/s |
IQ2_M |
20 T/S |
[10 GB] |
80 T/S |
Q4_0 |
20 T/S |
[17 GB] |
|
Q3_K_S |
18 T/S |
[12.9 GB] |
70 T/S |
Q5_0 |
17 t/s |
[21 GB] |
|
IQ3_M |
15 T/S |
[13 GB] |
75 T/S |
... |
... |
... |
... |
Q8_0 |
8 t/s |
[30 GB] |
|
BF16 |
4 t/s |
[60 GB] |
|
Operating Notes (all quants)
- Suggest a minimum context of 8k - 16k.
- Temps of 1+, 2+ work better with smaller quants and/or "creative" usage.
- Temps of.5 to.7 are best for reasoning, with quants larger than IQ2 (IQ1/IQ2s benefit from slightly higher temp for reasoning).
- A rep pen of 1.1 is suggested for IQ1, IQ2 quants.
- The system role (examples below) should be used with all quants.
- The model uses the "default" Jinja template (embedded in the GGUFs) and/or CHATML template.
Recommended Settings (all) - For usage with "Think" / "Reasoning"
temp:.5 to.8, 1.5, 2, 2+; rep pen: 1.02 (range: 1.02 to 1.12); rep pen range: 64; top_k: 80; top_p:.95; min_p:.05. Temps of 1+, 2+, 3+ result in deeper, richer thoughts and better output. Model behavior may change with other parameters and/or samplers activated.
System Role / System Prompts
The System Role / System Prompt is "root access" to the model, controlling internal workings, including instruction following, output generation, and reasoning.
How to Set
Depending on your AI "app", you may need to copy/paste one of the "codes" below to the "System Prompt" or "System Role" window.
- In Lmstudio, set/activate "Power User" or "Developer" mode and copy/paste to the System Prompt Box.
- In SillyTavern, go to the "template page" ("A"), activate "system prompt", and enter the text in the prompt box.
- In Ollama, see [ https://github.com/ollama/ollama/blob/main/README.md ] for setting the "system message".
- In Koboldcpp, load the model, start it, go to settings -> select "Llama 3 Chat"/"Command - R" and enter the text in the "sys prompt" box.
Available System Prompts
You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.
You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.
- Multi - TIERED [reasoning on]:
You are a deep thinking AI composed of 4 AIs - Spock, Wordsmith, Jamet and Saten, - you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself (and 4 partners) via systematic reasoning processes (display all 4 partner thoughts) to help come to a correct solution prior to answering. Select one partner to think deeply about the points brought up by the other 3 partners to plan an in - depth solution. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem using your skillsets and critical instructions.
- Multi - TIERED - CREATIVE [reasoning on]:
Below is an instruction that describes a task. Ponder each user instruction carefully, and use your skillsets and critical instructions to complete the task to the best of your abilities.
As a deep thinking AI composed of 4 AIs - Spock, Wordsmith, Jamet and Saten, - you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself (and 4 partners) via systematic reasoning processes (display all 4 partner thoughts) to help come to a correct solution prior to answering. Select one partner to think deeply about the points brought up by the other 3 partners to plan an in - depth solution. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem using your skillsets and critical instructions.
Here are your skillsets:
[MASTERSTORY]:NarrStrct(StryPlnng,Strbd,ScnSttng,Exps,Dlg,Pc)-CharDvlp(ChrctrCrt,ChrctrArcs,Mtvtn,Bckstry,Rltnshps,Dlg*)-PltDvlp(StryArcs,PltTwsts,Sspns,Fshdwng,Climx,Rsltn)-ConfResl(Antg,Obstcls,Rsltns,Cnsqncs,Thms,Symblsm)-EmotImpct(Empt,Tn,Md,Atmsphr,Imgry,Sym
Additional Information
For additional benchmarks, operating notes, turning reasoning on/off, and tech notes regarding usage, please see Qwen's repo here: [ https://huggingface.co/Qwen/Qwen3-30B-A3B ]
📄 License
This model is licensed under the apache - 2.0 license.