đ Llamacpp imatrix Quantizations of Lucy by Menlo
This project provides quantized versions of the Lucy model by Menlo using the llamacpp
library. It offers various quantization types to suit different hardware and performance requirements, enabling efficient text generation.
đ Quick Start
- You can run the quantized models in LM Studio.
- Run them directly with llama.cpp, or any other llama.cpp based project.
⨠Features
- Multiple Quantization Types: Offers a wide range of quantization types, such as
bf16
, Q8_0
, Q6_K_L
, etc., to balance model quality and file size.
- Online Repacking: Some quantization types support online repacking, which can improve performance on ARM and AVX machines.
- Flexible Download Options: Allows downloading specific files or split files using the
huggingface-cli
.
đĻ Installation
Prerequisites
Make sure you have huggingface-cli
installed:
pip install -U "huggingface_hub[cli]"
Download a Specific File
huggingface-cli download bartowski/Menlo_Lucy-GGUF --include "Menlo_Lucy-Q4_K_M.gguf" --local-dir ./
Download Split Files
If the model is split into multiple files, run:
huggingface-cli download bartowski/Menlo_Lucy-GGUF --include "Menlo_Lucy-Q8_0/*" --local-dir ./
đģ Usage Examples
Prompt Format
No chat template is specified, so the default is used. This may be incorrect. Check the original model card for details.
<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
đ Documentation
Model Information
- Quantized By: bartowski
- Pipeline Tag: text-generation
- Base Model: Menlo/Lucy
- Base Model Relation: quantized
Downloadable Files
Filename |
Quant type |
File Size |
Split |
Description |
Lucy-bf16.gguf |
bf16 |
3.45GB |
false |
Full BF16 weights. |
Lucy-Q8_0.gguf |
Q8_0 |
1.83GB |
false |
Extremely high quality, generally unneeded but max available quant. |
Lucy-Q6_K_L.gguf |
Q6_K_L |
1.49GB |
false |
Uses Q8_0 for embed and output weights. Very high quality, near perfect, recommended. |
Lucy-Q6_K.gguf |
Q6_K |
1.42GB |
false |
Very high quality, near perfect, recommended. |
Lucy-Q5_K_L.gguf |
Q5_K_L |
1.33GB |
false |
Uses Q8_0 for embed and output weights. High quality, recommended. |
Lucy-Q5_K_M.gguf |
Q5_K_M |
1.26GB |
false |
High quality, recommended. |
Lucy-Q5_K_S.gguf |
Q5_K_S |
1.23GB |
false |
High quality, recommended. |
Lucy-Q4_K_L.gguf |
Q4_K_L |
1.18GB |
false |
Uses Q8_0 for embed and output weights. Good quality, recommended. |
Lucy-Q4_1.gguf |
Q4_1 |
1.14GB |
false |
Legacy format, similar performance to Q4_K_S but with improved tokens/watt on Apple silicon. |
Lucy-Q4_K_M.gguf |
Q4_K_M |
1.11GB |
false |
Good quality, default size for most use cases, recommended. |
Lucy-Q3_K_XL.gguf |
Q3_K_XL |
1.08GB |
false |
Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability. |
Lucy-Q4_K_S.gguf |
Q4_K_S |
1.06GB |
false |
Slightly lower quality with more space savings, recommended. |
Lucy-Q4_0.gguf |
Q4_0 |
1.06GB |
false |
Legacy format, offers online repacking for ARM and AVX CPU inference. |
Lucy-IQ4_NL.gguf |
IQ4_NL |
1.05GB |
false |
Similar to IQ4_XS, but slightly larger. Offers online repacking for ARM CPU inference. |
Lucy-IQ4_XS.gguf |
IQ4_XS |
1.01GB |
false |
Decent quality, smaller than Q4_K_S with similar performance, recommended. |
Lucy-Q3_K_L.gguf |
Q3_K_L |
1.00GB |
false |
Lower quality but usable, good for low RAM availability. |
Lucy-Q3_K_M.gguf |
Q3_K_M |
0.94GB |
false |
Low quality. |
Lucy-IQ3_M.gguf |
IQ3_M |
0.90GB |
false |
Medium-low quality, new method with decent performance comparable to Q3_K_M. |
Lucy-Q3_K_S.gguf |
Q3_K_S |
0.87GB |
false |
Low quality, not recommended. |
Lucy-Q2_K_L.gguf |
Q2_K_L |
0.85GB |
false |
Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable. |
Lucy-IQ3_XS.gguf |
IQ3_XS |
0.83GB |
false |
Lower quality, new method with decent performance, slightly better than Q3_K_S. |
Lucy-Q2_K.gguf |
Q2_K |
0.78GB |
false |
Very low quality but surprisingly usable. |
Lucy-IQ3_XXS.gguf |
IQ3_XXS |
0.75GB |
false |
Lower quality, new method with decent performance, comparable to Q3 quants. |
Embed/Output Weights
Some of these quants (Q3_K_XL, Q4_K_L etc) use the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of the normal default.
ARM/AVX Information
Previously, you would download Q4_0_4_4/4_8/8_8, and their weights would be interleaved in memory to improve performance on ARM and AVX machines. Now, there is "online repacking" for weights. Details can be found in this PR. If you use Q4_0 and your hardware would benefit from repacking weights, it will do it automatically.
As of llama.cpp build b4282, you cannot run the Q4_0_X_X files and need to use Q4_0 instead.
Additionally, you can use IQ4_NL for slightly better quality. Thanks to this PR, it will also repack the weights for ARM, though only the 4_4 for now. The loading time may be slower, but it will result in an overall speed increase.
Which File to Choose
A great write-up with charts showing various performances is provided by Artefact2 here.
First, determine how big a model you can run by checking your available RAM and/or VRAM.
If you want the model to run as fast as possible, choose a quant with a file size 1 - 2GB smaller than your GPU's total VRAM.
If you want the absolute maximum quality, add your system RAM and your GPU's VRAM together, and then choose a quant with a file size 1 - 2GB smaller than that total.
Next, decide whether to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, choose a K-quant (e.g., Q5_K_M). If you want more details, check the llama.cpp feature matrix. Generally, if you're aiming for below Q4 and using cuBLAS (Nvidia) or rocBLAS (AMD), consider the I-quants (e.g., IQ3_M), which are newer and offer better performance for their size. However, I-quants on CPU are slower than their K-quant equivalents, so you need to balance speed and performance.
đ§ Technical Details
This project uses llama.cpp release b5924 for quantization. All quants are made using the imatrix option with a dataset from here.
đ License
No license information is provided in the original document.
Credits
- Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset.
- Thank you ZeroWw for the inspiration to experiment with embed/output.
- Thank you to LM Studio for sponsoring the work.
If you want to support the work, visit the ko-fi page here: https://ko-fi.com/bartowski