Qwen3-30B-A3B-GGUF Open Source Model - Quantized Version, Offering the Best Quality under Given Memory

Qwen3 30B A3B GGUF

Developed by ubergarm

A quantized version of Qwen3-30B-A3B, utilizing advanced nonlinear SotA quantization technology to deliver best-in-class quality within given memory constraints.

Large Language Model Open Source License:Apache-2.0 #Efficient Mixed Quantization #Large Context Support #GPU-Optimized Inference

Downloads 780

Release Time : 5/2/2025

Model Overview

This is a quantized version based on the Qwen/Qwen3-30B-A3B model, designed for efficient inference, supporting conversational interactions, and suitable for text generation tasks.

Model Features

Advanced Nonlinear Quantization

Utilizes the ik_llama.cpp branch to support advanced nonlinear SotA quantization, enabling high-quality inference.

Efficient Memory Usage

Capable of running over 32k context on a 24GB VRAM GPU, optimizing memory consumption.

High-Performance Inference

Achieves over 1600 tok/sec PP and 105 tok/sec TG on a 3090TI FE with 24GB VRAM.

Model Capabilities

Text Generation

Conversational Interaction

Long Context Processing

Use Cases

Text Generation

Dialogue Systems

Used to build efficient dialogue systems supporting long-context interactions.

Maintains high-quality generation under 32k context

Content Creation

Assists in generating high-quality text content such as articles, stories, etc.

🚀 ik_llama.cpp imatrix Quantizations of Qwen/Qwen3-30B-A3B

This project offers quantized versions of the Qwen/Qwen3-30B-A3B model, providing high - quality performance within specific memory footprints. It requires the ik_llama.cpp fork for advanced non - linear state - of - the - art quantizations.

📋 Information Table

Property	Details
Quantized By	ubergarm
Pipeline Tag	text - generation
Base Model	Qwen/Qwen3-30B-A3B
License	apache - 2.0
License Link	https://huggingface.co/Qwen/Qwen3-30B-A3B/blob/main/LICENSE
Base Model Relation	quantized
Tags	imatrix, qwen3_moe, conversational, ik_llama.cpp

✨ Features

This quant collection REQUIRES ik_llama.cpp fork to support advanced non - linear SotA quants. Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

These quants provide best in class quality for the given memory footprint.

🙏 Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community here and on r/LocalLLaMA for tips and tricks helping each other run all the fun new models!

Excited to share and learn together. Thanks!

📦 Quant Collection

So far these are my best recipes offering the great quality in good memory footprint breakpoints.

ubergarm/Qwen3-30B-A3B-mix-IQ4_K

This quant is provides the best in class quality while providing good speed performance. This quant is designed to run with over 32k context using GPU performant f16 KV - Cache in under 24GB VRAM GPU. You could also try offload to CPU using -nkvo -ctk q8_0 -ctv q8_0 and use -rtr for RAM optimized tensor packing on startup (without mmap() support) taking ~18396MiB of VRAM or less by offloading repeating layers to CPU as well at decreased speed.

17.679 GiB (4.974 BPW)

  f32:  241 tensors
 q8_0:    6 tensors
iq4_k:   96 tensors
iq5_k:   48 tensors
iq6_k:  188 tensors

Final estimate: PPL = 9.1184 +/- 0.07278 (wiki - test.raw, compare to BF16 at 9.0703 +/- 0.07223)
*NOTE*: Benchmarks including PPL with `wiki.test.raw` and KLD with `ubergarm - kld - test - corpus.txt` are looking interesting! Will publish soon!

🚀 Quick Start

`ik_llama.cpp` API server for GPU inferencing

# This example for ~21468MiB VRAM Usage
./build/bin/llama-server
  --model ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-mix-IQ4_K \
  --alias ubergarm/Qwen3-30B-A3B-mix-IQ4_K \
  -fa \
  -ctk f16 -ctv f16 \
  -c 32768 \
  -fmoe \
  -ngl 99 \
  --threads 1
  --host 127.0.0.1 \
  --port 8080

💡 Usage Tip

If you want more context and/or less VRAM usage, you can try smaller KV Cache quantization -ctk q4_0 -ctv q4_0. If you want more throughput you could try increasing context to max limit for your VRAM, using --parallel N to have (context / N) available per slot, and using an asyncio client and keeping the queue full.

🔧 Quantization

👆Secret Recipe

#!/usr/bin/env bash

custom="
# Attention (give Layer 0 a little extra as it scores lowest on cosine - similarity score)
blk\.0\.attn_k.*=q8_0
blk\.0\.attn_q.*=q8_0
blk\.0\.attn_v.*=q8_0
blk\.0\.attn_output.*=q8_0

blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq6_k

# Token Embedding (put these second so attn_output regex doesn catch too early)
token_embd\.weight=q8_0
output\.weight=q8_0

# Experts
blk\..*\.ffn_down_exps\.weight=iq5_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/imatrix-Qwen3-30B-A3B.dat \
    /mnt/raid/models/Qwen/Qwen3-30B-A3B/Qwen3-30B-A3B-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-mix-IQ4_K.gguf \
    IQ4_K \
    24

💬 Discussion

TODO: Discuss some about comparing quants e.g. bartowski, unsloth, and mradermacher including "quality" and "speed".

📊 Benchmarks

In first tests with llama - sweep - bench I'm getting over 1600 tok/sec PP and 105 tok/sec TG on my 3090TI FE 24GB VRAM. It does slow down of course as it gets deeper into the full 32k context. Check the linked Benchmarks Discussion for updates as this is all pretty fresh right now. Pretty amazing performance both in terms of generation quality and speed for a model this size!

Benchmarks showing these peak 1600 tok/sec PP and 105 tok/sec TG fully offloaded on 3090TI FE 24GB VRAM