Qwen3-128k-30B-A3B-NEO-MAX-Imatrix-gguf Open Source Model - Supports Multilingual and Multi-tasking, Effortlessly Handles Long Texts

Qwen3 128k 30B A3B NEO MAX Imatrix Gguf

Developed by DavidAU

GGUF quantized version based on Qwen3-30B-A3B Mixture of Experts model, extended to 128k context, optimized with NEO Imatrix quantization technology, supporting multilingual and multitask processing.

Large Language Model Supports Multiple LanguagesOpen Source License:Apache-2.0 #128k ultra-long context #Mixture of Experts architecture #Multilingual generation

Downloads 17.20k

Release Time : 5/8/2025

Model Overview

A high-performance multilingual Mixture of Experts model supporting a wide range of tasks from creative writing to deep reasoning, specially optimized for efficiency in low-resource environments.

Model Features

128k ultra-long context

Extended from original 32k context to 128k via YARN method, supporting longer document processing and complex tasks

NEO Imatrix quantization

Proprietary quantization technology maintaining usability even at extremely low bit-widths (IQ1_M)

Mixture of Experts efficiency

Only activating 8/128 experts achieves 3B parameter computational efficiency for 30B model

Multi-platform compatibility

All quantized versions support both GPU and pure CPU/RAM operation

Model Capabilities

Multilingual text generation

Deep reasoning

Creative writing

Problem solving

Role-playing

Tool calling

Use Cases

Creative content generation

Novel writing

Generate long novels with coherent plots and character development

Maintains long-form consistency using 128k context

Multilingual content creation

Generate marketing copy or social media content in 25 languages

Maintains cultural adaptability and linguistic accuracy

Technical applications

Code assistance

Help developers understand and generate complex code

Solves programming problems through deep reasoning

Data analysis

Process and analyze long technical documents

Extracts key information using long context

🚀 Qwen3-128k-30B-A3B-NEO-MAX-Imatrix-gguf

This is a GGUF NEO Imatrix model based on Qwen's "Qwen3-30B-A3B" Mixture of Experts model, extended to 128k context, offering various quant versions with unique features for different use cases.

✨ Features

Multi - language Support: Supports languages such as English, French, German, Spanish, Portuguese, Italian, Japanese, Korean, Russian, Chinese, Arabic, Farsi, Indonesian, Malay, Nepali, Polish, Romanian, Serbian, Swedish, Turkish, Ukrainian, Vietnamese, Hindi, and Bengali.
Unique Quant Construction: All quants can be used on GPU and/or CPU/RAM only due to the model's unique construction. There are also several versions of quant sizes with special features.
Extended Context: Extended to 128k (131072) context (up from 32k/32768) using "YARN" as per Qwen tech notes.
Optimized Dataset: The NEO Imatrix dataset was developed in - house after testing and evaluating over 50 Imatrix datasets, allowing the creation of low - bit quants that remain usable.
Expert Activation: 8 out of 128 experts are activated for these quants, and the activation is controlled automatically based on the prompt/input content.

📚 Documentation

Special Note

All quants of this model can be used on GPU and/or CPU/RAM only due to the unique construction of the model. There are also several versions of quant sizes with special features too.

Model Introduction

Model Image

This is a GGUF NEO Imatrix model of Qwen's new "Qwen3 - 30B - A3B" Mixture of experts model. It has been extended to 128k context using "YARN" as per Qwen tech notes. The NEO Imatrix dataset was developed in - house after extensive testing and evaluation. It allows the creation of quants as low as IQ1_M while remaining usable, and "regular" sized quants perform better too.

Quant Details

IQ1_M MAX / IQ1_M MAX PLUS and Higher Quants:
- IQ1_M MAX / IQ1_M MAX PLUS are designed to use the least amount of VRAM/RAM while remaining usable. It is suggested to use more informative prompts with these quants to compensate for low - bit level losses.
- IQ1_M MAX PLUS has additional optimizations compared to IQ1_M MAX.
- IQ2s are stronger than IQ1_Ms.
- Q2K/Q2KS are faster on CPU/RAM only usage but have lower performance than IQ2s.
- Q3Ks are slightly faster on CPU/RAM only usage but behind IQ3s in performance.
- IQ3s and higher quants show a large performance improvement. IQ4_XS/IQ4_NL are the peak for NEO Imatrix effects.
- Q4s have high performance, but IQ4XS/IQ4NL may outperform them.
- Q5s have very high performance.
- Q6 has peak performance with minimal NEO imatrix effects.
- Q8s have excellent performance.
Specialized Quants:
- Some quants have multiple versions, such as Max, Max Plus, Max Plus 2, Max Super, and Max ULTRA, each with different optimizations.
- IQ1_M (Plus), all IQ2s, and all IQ3s have an output tensor at Q8, embedding at IQ4_XS, and additional minor adjustments in some expert tensors.
- Q8 MAX PLUS has its Output tensor at IQ4XS instead of Q8.
- MAX ULTRA LIST details per quant:
  - Q6 ULTRA MAX: Expert Layers/Tensors 0 - 7, 46 - 47 at 16 bits (f16) + 16 bit (f16) output tensor, optimized for CPU/GPU high - speed operation.
  - Q8 ULTRA MAX: Similar to Q6 ULTRA MAX.

Speed Comparison (GPU vs CPU)

This is a rough speed chart with Quant, T/S on CPU/RAM, Size of quant, and T/S on GPU only.

Quant	T/S on CPU/RAM	Size of quant	T/S on GPU only
Q2_K_S	29 T/s	[10 GB]	83 T/S
Q2_K	27 T/s	[10.5 GB]	72 T/S
IQ1_M	22 T/S	[7 GB]	87 T/S
IQ2_XXS	21 T/S	[8 GB]	76 t/s
IQ2_M	20 T/S	[10 GB]	80 T/S
Q4_0	20 T/S	[17 GB]
Q3_K_S	18 T/S	[12.9 GB]	70 T/S
Q5_0	17 t/s	[21 GB]
IQ3_M	15 T/S	[13 GB]	75 T/S
...	...	...	...
Q8_0	8 t/s	[30 GB]
BF16	4 t/s	[60 GB]

Operating Notes (all quants)

Suggest a minimum context of 8k - 16k.
Temps of 1+, 2+ work better with smaller quants and/or "creative" usage.
Temps of.5 to.7 are best for reasoning, with quants larger than IQ2 (IQ1/IQ2s benefit from slightly higher temp for reasoning).
A rep pen of 1.1 is suggested for IQ1, IQ2 quants.
The system role (examples below) should be used with all quants.
The model uses the "default" Jinja template (embedded in the GGUFs) and/or CHATML template.

Recommended Settings (all) - For usage with "Think" / "Reasoning"

temp:.5 to.8, 1.5, 2, 2+; rep pen: 1.02 (range: 1.02 to 1.12); rep pen range: 64; top_k: 80; top_p:.95; min_p:.05. Temps of 1+, 2+, 3+ result in deeper, richer thoughts and better output. Model behavior may change with other parameters and/or samplers activated.

System Role / System Prompts

The System Role / System Prompt is "root access" to the model, controlling internal workings, including instruction following, output generation, and reasoning.

How to Set

Depending on your AI "app", you may need to copy/paste one of the "codes" below to the "System Prompt" or "System Role" window.

In Lmstudio, set/activate "Power User" or "Developer" mode and copy/paste to the System Prompt Box.
In SillyTavern, go to the "template page" ("A"), activate "system prompt", and enter the text in the prompt box.
In Ollama, see [ https://github.com/ollama/ollama/blob/main/README.md ] for setting the "system message".
In Koboldcpp, load the model, start it, go to settings -> select "Llama 3 Chat"/"Command - R" and enter the text in the "sys prompt" box.

Available System Prompts

Simple:

You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.

Basic Reasoning:

You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.

Multi - TIERED [reasoning on]:

You are a deep thinking AI composed of 4 AIs - Spock, Wordsmith, Jamet and Saten, - you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself (and 4 partners) via systematic reasoning processes (display all 4 partner thoughts) to help come to a correct solution prior to answering. Select one partner to think deeply about the points brought up by the other 3 partners to plan an in - depth solution.  You should enclose your  thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem using your skillsets and critical instructions.

Multi - TIERED - CREATIVE [reasoning on]:

Below is an instruction that describes a task. Ponder each user instruction carefully, and use your skillsets and critical instructions to complete the task to the best of your abilities.

As a deep thinking AI composed of 4 AIs - Spock, Wordsmith, Jamet and Saten, - you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself (and 4 partners) via systematic reasoning processes (display all 4 partner thoughts) to help come to a correct solution prior to answering. Select one partner to think deeply about the points brought up by the other 3 partners to plan an in - depth solution.  You should enclose your  thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem using your skillsets and critical instructions.

Here are your skillsets:
[MASTERSTORY]:NarrStrct(StryPlnng,Strbd,ScnSttng,Exps,Dlg,Pc)-CharDvlp(ChrctrCrt,ChrctrArcs,Mtvtn,Bckstry,Rltnshps,Dlg*)-PltDvlp(StryArcs,PltTwsts,Sspns,Fshdwng,Climx,Rsltn)-ConfResl(Antg,Obstcls,Rsltns,Cnsqncs,Thms,Symblsm)-EmotImpct(Empt,Tn,Md,Atmsphr,Imgry,Sym

Additional Information

For additional benchmarks, operating notes, turning reasoning on/off, and tech notes regarding usage, please see Qwen's repo here: [ https://huggingface.co/Qwen/Qwen3-30B-A3B ]

📄 License

This model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご