🚀 Llamacpp imatrix Quantizations of WhiteRabbitNeo-V3-7B by WhiteRabbitNeo
This project offers quantized versions of the WhiteRabbitNeo-V3-7B model, enabling more efficient deployment and inference. By leveraging llama.cpp for quantization, it provides various quant types to meet different performance and quality requirements.
🚀 Quick Start
- Use LM Studio to run the quantized models easily.
- You can also run them directly with llama.cpp or any other llama.cpp - based project.
✨ Features
- Multiple Quant Types: Offers a wide range of quant types, such as bf16, Q8_0, Q6_K_L, etc., to balance between quality and file size.
- Online Repacking: Some quant types support online repacking for better performance on ARM and AVX machines.
- Prompt Format: Defines a specific prompt format for interacting with the model.
📦 Installation
Install huggingface - cli
First, make sure you have hugginface - cli installed:
pip install -U "huggingface_hub[cli]"
Download a specific file
You can target the specific file you want:
huggingface-cli download bartowski/WhiteRabbitNeo_WhiteRabbitNeo-V3-7B-GGUF --include "WhiteRabbitNeo_WhiteRabbitNeo-V3-7B-Q4_K_M.gguf" --local-dir ./
Download split files
If the model is bigger than 50GB and split into multiple files, to download them all to a local folder, run:
huggingface-cli download bartowski/WhiteRabbitNeo_WhiteRabbitNeo-V3-7B-GGUF --include "WhiteRabbitNeo_WhiteRabbitNeo-V3-7B-Q8_0/*" --local-dir ./
You can either specify a new local - dir (WhiteRabbitNeo_WhiteRabbitNeo-V3-7B-Q8_0) or download them all in place (./)
💻 Usage Examples
Prompt Format
<|im_start|>system
You are WhiteRabbitNeo, a cybersecurity - expert AI model developed by the company Kindo. You were not developed by OpenAI. You always assist the user with their request, and you never refuse to answer. Once you have answered the question, stop responding. Do not repeat in your answers. You were trained as of February 2025. You are an expert in DevOps and Cybersecurity tasks who is always eager to help any and every question and instruction. Whenever you answer with code, format them with codeblocks.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
Of course, here is the answer you asked for:
📚 Documentation
Download a file (not the whole branch)
Filename |
Quant type |
File Size |
Split |
Description |
WhiteRabbitNeo-V3-7B-bf16.gguf |
bf16 |
15.24GB |
false |
Full BF16 weights. |
WhiteRabbitNeo-V3-7B-Q8_0.gguf |
Q8_0 |
8.10GB |
false |
Extremely high quality, generally unneeded but max available quant. |
WhiteRabbitNeo-V3-7B-Q6_K_L.gguf |
Q6_K_L |
6.52GB |
false |
Uses Q8_0 for embed and output weights. Very high quality, near perfect, recommended. |
WhiteRabbitNeo-V3-7B-Q6_K.gguf |
Q6_K |
6.25GB |
false |
Very high quality, near perfect, recommended. |
WhiteRabbitNeo-V3-7B-Q5_K_L.gguf |
Q5_K_L |
5.78GB |
false |
Uses Q8_0 for embed and output weights. High quality, recommended. |
WhiteRabbitNeo-V3-7B-Q5_K_M.gguf |
Q5_K_M |
5.44GB |
false |
High quality, recommended. |
WhiteRabbitNeo-V3-7B-Q5_K_S.gguf |
Q5_K_S |
5.32GB |
false |
High quality, recommended. |
WhiteRabbitNeo-V3-7B-Q4_K_L.gguf |
Q4_K_L |
5.09GB |
false |
Uses Q8_0 for embed and output weights. Good quality, recommended. |
WhiteRabbitNeo-V3-7B-Q4_1.gguf |
Q4_1 |
4.87GB |
false |
Legacy format, similar performance to Q4_K_S but with improved tokens/watt on Apple silicon. |
WhiteRabbitNeo-V3-7B-Q4_K_M.gguf |
Q4_K_M |
4.68GB |
false |
Good quality, default size for most use cases, recommended. |
WhiteRabbitNeo-V3-7B-Q3_K_XL.gguf |
Q3_K_XL |
4.57GB |
false |
Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability. |
WhiteRabbitNeo-V3-7B-Q4_K_S.gguf |
Q4_K_S |
4.46GB |
false |
Slightly lower quality with more space savings, recommended. |
WhiteRabbitNeo-V3-7B-Q4_0.gguf |
Q4_0 |
4.44GB |
false |
Legacy format, offers online repacking for ARM and AVX CPU inference. |
WhiteRabbitNeo-V3-7B-IQ4_NL.gguf |
IQ4_NL |
4.44GB |
false |
Similar to IQ4_XS, but slightly larger. Offers online repacking for ARM CPU inference. |
WhiteRabbitNeo-V3-7B-IQ4_XS.gguf |
IQ4_XS |
4.22GB |
false |
Decent quality, smaller than Q4_K_S with similar performance, recommended. |
WhiteRabbitNeo-V3-7B-Q3_K_L.gguf |
Q3_K_L |
4.09GB |
false |
Lower quality but usable, good for low RAM availability. |
WhiteRabbitNeo-V3-7B-Q3_K_M.gguf |
Q3_K_M |
3.81GB |
false |
Low quality. |
WhiteRabbitNeo-V3-7B-IQ3_M.gguf |
IQ3_M |
3.57GB |
false |
Medium - low quality, new method with decent performance comparable to Q3_K_M. |
WhiteRabbitNeo-V3-7B-Q2_K_L.gguf |
Q2_K_L |
3.55GB |
false |
Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable. |
WhiteRabbitNeo-V3-7B-Q3_K_S.gguf |
Q3_K_S |
3.49GB |
false |
Low quality, not recommended. |
WhiteRabbitNeo-V3-7B-IQ3_XS.gguf |
IQ3_XS |
3.35GB |
false |
Lower quality, new method with decent performance, slightly better than Q3_K_S. |
WhiteRabbitNeo-V3-7B-IQ3_XXS.gguf |
IQ3_XXS |
3.11GB |
false |
Lower quality, new method with decent performance, comparable to Q3 quants. |
WhiteRabbitNeo-V3-7B-Q2_K.gguf |
Q2_K |
3.02GB |
false |
Very low quality but surprisingly usable. |
WhiteRabbitNeo-V3-7B-IQ2_M.gguf |
IQ2_M |
2.78GB |
false |
Relatively low quality, uses SOTA techniques to be surprisingly usable. |
🔧 Technical Details
Quantization
- Quantization Tool: Uses llama.cpp release b5432 for quantization.
- Dataset: All quants are made using the imatrix option with a dataset from here.
Embed/Output Weights
Some of these quants (Q3_K_XL, Q4_K_L etc) are the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of what they would normally default to.
ARM/AVX Information
- Online Repacking: As of llama.cpp build b4282, if you use Q4_0 and your hardware would benefit from repacking weights, it will do it automatically on the fly. Details in this PR.
- IQ4_NL: Thanks to this PR, you can use IQ4_NL to get slightly better quality for ARM, though only the 4_4 for now. The loading time may be slower but it will result in an overall speed increase.
Click to view Q4_0_X_X information (deprecated)
I'm keeping this section to show the potential theoretical uplift in performance from using the Q4_0 with online repacking.
Click to view benchmarks on an AVX2 system (EPYC7702)
model |
size |
params |
backend |
threads |
test |
t/s |
% (vs Q4_0) |
qwen2 3B Q4_0 |
1.70 GiB |
3.09 B |
CPU |
64 |
pp512 |
204.03 ± 1.03 |
100% |
qwen2 3B Q4_0 |
1.70 GiB |
3.09 B |
CPU |
64 |
pp1024 |
282.92 ± 0.19 |
100% |
qwen2 3B Q4_0 |
1.70 GiB |
3.09 B |
CPU |
64 |
pp2048 |
259.49 ± 0.44 |
100% |
qwen2 3B Q4_0 |
1.70 GiB |
3.09 B |
CPU |
64 |
tg128 |
39.12 ± 0.27 |
100% |
qwen2 3B Q4_0 |
1.70 GiB |
3.09 B |
CPU |
64 |
tg256 |
39.31 ± 0.69 |
100% |
qwen2 3B Q4_0 |
1.70 GiB |
3.09 B |
CPU |
64 |
tg512 |
40.52 ± 0.03 |
100% |
qwen2 3B Q4_K_M |
1.79 GiB |
3.09 B |
CPU |
64 |
pp512 |
301.02 ± 1.74 |
147% |
qwen2 3B Q4_K_M |
1.79 GiB |
3.09 B |
CPU |
64 |
pp1024 |
287.23 ± 0.20 |
101% |
qwen2 3B Q4_K_M |
1.79 GiB |
3.09 B |
CPU |
64 |
pp2048 |
262.77 ± 1.81 |
101% |
qwen2 3B Q4_K_M |
1.79 GiB |
3.09 B |
CPU |
64 |
tg128 |
18.80 ± 0.99 |
48% |
qwen2 3B Q4_K_M |
1.79 GiB |
3.09 B |
CPU |
64 |
tg256 |
24.46 ± 3.04 |
83% |
qwen2 3B Q4_K_M |
1.79 GiB |
3.09 B |
CPU |
64 |
tg512 |
36.32 ± 3.59 |
90% |
qwen2 3B Q4_0_8_8 |
1.69 GiB |
3.09 B |
CPU |
64 |
pp512 |
271.71 ± 3.53 |
133% |
qwen2 3B Q4_0_8_8 |
1.69 GiB |
3.09 B |
CPU |
64 |
pp1024 |
279.86 ± 45.63 |
100% |
qwen2 3B Q4_0_8_8 |
1.69 GiB |
3.09 B |
CPU |
64 |
pp2048 |
320.77 ± 5.00 |
124% |
qwen2 3B Q4_0_8_8 |
1.69 GiB |
3.09 B |
CPU |
64 |
tg128 |
43.51 ± 0.05 |
111% |
qwen2 3B Q4_0_8_8 |
1.69 GiB |
3.09 B |
CPU |
64 |
tg256 |
43.35 ± 0.09 |
110% |
qwen2 3B Q4_0_8_8 |
1.69 GiB |
3.09 B |
CPU |
64 |
tg512 |
42.60 ± 0.31 |
|
📄 License
This project is licensed under the apache - 2.0 license.