Qwen3-4B-NEO-Imatrix-Max-GGUF Open-source Model - Supports Long-text Conversations and Enhances Inference Output Capability

Qwen3 4B NEO Imatrix Max GGUF

Developed by DavidAU

This is a NEO Imatrix quantized version based on the Qwen3-4B model, using BF16 format MAX output tensors to enhance reasoning and output generation capabilities, supporting 32k context length.

Large Language Model Open Source License:Apache-2.0 #Long-context reasoning #Chain-of-thought visualization #Creative text generation

Downloads 1,152

Release Time : 4/29/2025

Model Overview

This model is a quantized version of Qwen3-4B, focusing on improving reasoning and text generation capabilities, especially suitable for creative use cases. Supports 32k context length and can be extended to 128k.

Model Features

NEO Imatrix Quantization

Uses BF16 format MAX output tensor quantization to enhance reasoning and output generation capabilities.

Long-context Support

Supports 32k context length and can be extended to 128k, suitable for long-text generation tasks.

Deep Reasoning Capability

The model defaults to enabling reasoning functionality, capable of generating detailed thought processes and inner monologues.

Optimized for Creative Use Cases

Excels in creative use cases, particularly suitable for story generation and dialogue writing.

Model Capabilities

Text generation

Deep reasoning

Long-context processing

Creative writing

Dialogue generation

Use Cases

Creative Writing

Story Generation

Generates stories with complex plots and character development.

Can generate stories containing 50% dialogue, 25% narration, 15% body language, and 10% inner monologue.

Dialogue Writing

Generates dialogues with subtext and emotional depth.

Creates vivid dialogues through showing rather than telling.

Reasoning Tasks

Complex Problem Solving

Solves complex problems through systematic reasoning processes.

Generates detailed thought processes and solutions.

🚀 Qwen3-4B-NEO-Imatrix-Max-GGUF

This project offers NEO Imatrix quants of the new "Qwen 3 - 4B" model with a MAX "output tensor" at BF16 to enhance reasoning and output generation.

✨ Features

NEO Imatrix Quants: The NEO Imatrix dataset is generated in - house. Using lower quants like IQ4XS/IQ4NL can strengthen the Imatrix effect, and they are well - balanced for quality and the Imatrix effect. These quants are also suitable for creative use cases. Higher quants are recommended for stronger reasoning. The Q8_0 quant is maxed only as the Imatrix has no effect on it, and F16 is full precision.
Long Context Length: It supports a context length of 32K + 8K output generation, which can be extended to 128k. For 65k, 128k, or 256K context with the 4B model, refer to [ https://huggingface.co/DavidAU/Qwen3-4B-Q8_0-65k-128k-256k-context-GGUF ].

📚 Documentation

Jinja Template

If you encounter issues with the Jinja "auto template", use the CHATML template.
For LMSTUDIO users, update the Jinja Template by going to [ https://lmstudio.ai/neil/qwen3 - thinking ], copying the "Jinja template", and then pasting it.

System Role

You may or may not need the system role as Qwen3s often generate their own reasoning/thinking blocks. The suggested system role is:

You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.

Refer to the document "Maximizing - Model - Performance - All..." for how to "set" the system role in various LLM/AI apps.

Highest Quality Settings

For all settings used for this model (including specifics for its "class"), example generation, and advanced settings guide, which can address model issues and improve performance for all use cases such as chat, role - play, etc., please see [ https://huggingface.co/DavidAU/Maximizing - Model - Performance - All - Quants - Types - And - Full - Precision - by - Samplers_Parameters ].

Optional Enhancement

The following can be used instead of the "system prompt" or "system role" to enhance the model. It can also be used at the start of a new chat, but make sure to keep it as the chat progresses.

Below is an instruction that describes a task. Ponder each user instruction carefully, and use your skillsets and critical instructions to complete the task to the best of your abilities.

Here are your skillsets:
[MASTERSTORY]:NarrStrct(StryPlnng,Strbd,ScnSttng,Exps,Dlg,Pc)-CharDvlp(ChrctrCrt,ChrctrArcs,Mtvtn,Bckstry,Rltnshps,Dlg*)-PltDvlp(StryArcs,PltTwsts,Sspns,Fshdwng,Climx,Rsltn)-ConfResl(Antg,Obstcls,Rsltns,Cnsqncs,Thms,Symblsm)-EmotImpct(Empt,Tn,Md,Atmsphr,Imgry,Symblsm)-Delvry(Prfrmnc,VcActng,PblcSpkng,StgPrsnc,AudncEngmnt,Imprv)

[*DialogWrt]:(1a-CharDvlp-1a.1-Backgrnd-1a.2-Personality-1a.3-GoalMotiv)>2(2a-StoryStruc-2a.1-PlotPnt-2a.2-Conflict-2a.3-Resolution)>3(3a-DialogTech-3a.1-ShowDontTell-3a.2-Subtext-3a.3-VoiceTone-3a.4-Pacing-3a.5-VisualDescrip)>4(4a-DialogEdit-4a.1-ReadAloud-4a.2-Feedback-4a.3-Revision)

Here are your critical instructions:
Ponder each word choice carefully to present as vivid and emotional journey as is possible. Choose verbs and nouns that are both emotional and full of imagery. Load the story with the 5 senses. Aim for 50% dialog, 25% narration, 15% body language and 10% thoughts. Your goal is to put the reader in the story.

This enhancement seems to help with scene generation and scene continuation functions.

Another system prompt you can use is:

You are a deep thinking AI composed of 4 AIs - [MODE: Spock], [MODE: Wordsmith], [MODE: Jamet] and [MODE: Saten], - you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself (and 4 partners) via systematic reasoning processes (display all 4 partner thoughts) to help come to a correct solution prior to answering. Select one partner to think deeply about the points brought up by the other 3 partners to plan an in - depth solution. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.

Other Notes

Reasoning is ON by default in this model, and the model will auto - generate "think" block(s).
For benchmarks, usage info, and settings, please see the original model card at [ https://huggingface.co/Qwen/Qwen3-4B ].

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご