Menlo_Lucy-GGUF Open-Source Large Language Model - Quantization processing saves resources, enabling more efficient operation and greater practicality

Menlo Lucy GGUF

Developed by bartowski

The Lucy model is a large language model developed by Menlo. After quantization, it can reduce resource requirements while ensuring performance and improve operating efficiency.

Large Language Model #Efficient quantization #Low memory optimization #Multi - platform compatibility

Downloads 674

Release Time : 7/18/2025

Model Overview

This project quantizes Menlo's Lucy model. With quantization technology, it effectively reduces the storage and computing resource requirements of the model and improves the operating efficiency of the model.

Model Features

Efficient quantization

Use llama.cpp for quantization to effectively reduce the model's storage and computing resource requirements.

Multiple quantization options

Provide multiple quantization levels from Q2 to Q8 to meet different performance and resource requirements.

Online repackaging

Some quantized files support online repackaging to improve performance on ARM and AVX machines.

High - quality recommendation

Provide recommendations for multiple high - quality quantized versions, such as Q6_K_L, Q5_K_M, etc.

Model Capabilities

Text generation

Dialogue system

Content creation

Use Cases

Dialogue system

Intelligent assistant

Can be used to build an intelligent dialogue assistant to provide a natural and smooth interaction experience.

Content generation

Text creation

Can be used to generate various types of text content, such as articles, stories, etc.

🚀 Llamacpp imatrix Quantizations of Lucy by Menlo

This project provides quantized versions of the Lucy model by Menlo using the llamacpp library. It offers various quantization types to suit different hardware and performance requirements, enabling efficient text generation.

🚀 Quick Start

You can run the quantized models in LM Studio.
Run them directly with llama.cpp, or any other llama.cpp based project.

✨ Features

Multiple Quantization Types: Offers a wide range of quantization types, such as bf16, Q8_0, Q6_K_L, etc., to balance model quality and file size.
Online Repacking: Some quantization types support online repacking, which can improve performance on ARM and AVX machines.
Flexible Download Options: Allows downloading specific files or split files using the huggingface-cli.

📦 Installation

Prerequisites

Make sure you have huggingface-cli installed:

pip install -U "huggingface_hub[cli]"

Download a Specific File

huggingface-cli download bartowski/Menlo_Lucy-GGUF --include "Menlo_Lucy-Q4_K_M.gguf" --local-dir ./

Download Split Files

If the model is split into multiple files, run:

huggingface-cli download bartowski/Menlo_Lucy-GGUF --include "Menlo_Lucy-Q8_0/*" --local-dir ./

💻 Usage Examples

Prompt Format

No chat template is specified, so the default is used. This may be incorrect. Check the original model card for details.

<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

📚 Documentation

Model Information

Quantized By: bartowski
Pipeline Tag: text-generation
Base Model: Menlo/Lucy
Base Model Relation: quantized

Downloadable Files

Filename	Quant type	File Size	Split	Description
Lucy-bf16.gguf	bf16	3.45GB	false	Full BF16 weights.
Lucy-Q8_0.gguf	Q8_0	1.83GB	false	Extremely high quality, generally unneeded but max available quant.
Lucy-Q6_K_L.gguf	Q6_K_L	1.49GB	false	Uses Q8_0 for embed and output weights. Very high quality, near perfect, recommended.
Lucy-Q6_K.gguf	Q6_K	1.42GB	false	Very high quality, near perfect, recommended.
Lucy-Q5_K_L.gguf	Q5_K_L	1.33GB	false	Uses Q8_0 for embed and output weights. High quality, recommended.
Lucy-Q5_K_M.gguf	Q5_K_M	1.26GB	false	High quality, recommended.
Lucy-Q5_K_S.gguf	Q5_K_S	1.23GB	false	High quality, recommended.
Lucy-Q4_K_L.gguf	Q4_K_L	1.18GB	false	Uses Q8_0 for embed and output weights. Good quality, recommended.
Lucy-Q4_1.gguf	Q4_1	1.14GB	false	Legacy format, similar performance to Q4_K_S but with improved tokens/watt on Apple silicon.
Lucy-Q4_K_M.gguf	Q4_K_M	1.11GB	false	Good quality, default size for most use cases, recommended.
Lucy-Q3_K_XL.gguf	Q3_K_XL	1.08GB	false	Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability.
Lucy-Q4_K_S.gguf	Q4_K_S	1.06GB	false	Slightly lower quality with more space savings, recommended.
Lucy-Q4_0.gguf	Q4_0	1.06GB	false	Legacy format, offers online repacking for ARM and AVX CPU inference.
Lucy-IQ4_NL.gguf	IQ4_NL	1.05GB	false	Similar to IQ4_XS, but slightly larger. Offers online repacking for ARM CPU inference.
Lucy-IQ4_XS.gguf	IQ4_XS	1.01GB	false	Decent quality, smaller than Q4_K_S with similar performance, recommended.
Lucy-Q3_K_L.gguf	Q3_K_L	1.00GB	false	Lower quality but usable, good for low RAM availability.
Lucy-Q3_K_M.gguf	Q3_K_M	0.94GB	false	Low quality.
Lucy-IQ3_M.gguf	IQ3_M	0.90GB	false	Medium-low quality, new method with decent performance comparable to Q3_K_M.
Lucy-Q3_K_S.gguf	Q3_K_S	0.87GB	false	Low quality, not recommended.
Lucy-Q2_K_L.gguf	Q2_K_L	0.85GB	false	Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable.
Lucy-IQ3_XS.gguf	IQ3_XS	0.83GB	false	Lower quality, new method with decent performance, slightly better than Q3_K_S.
Lucy-Q2_K.gguf	Q2_K	0.78GB	false	Very low quality but surprisingly usable.
Lucy-IQ3_XXS.gguf	IQ3_XXS	0.75GB	false	Lower quality, new method with decent performance, comparable to Q3 quants.

Embed/Output Weights

Some of these quants (Q3_K_XL, Q4_K_L etc) use the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of the normal default.

ARM/AVX Information

Previously, you would download Q4_0_4_4/4_8/8_8, and their weights would be interleaved in memory to improve performance on ARM and AVX machines. Now, there is "online repacking" for weights. Details can be found in this PR. If you use Q4_0 and your hardware would benefit from repacking weights, it will do it automatically.

As of llama.cpp build b4282, you cannot run the Q4_0_X_X files and need to use Q4_0 instead.

Additionally, you can use IQ4_NL for slightly better quality. Thanks to this PR, it will also repack the weights for ARM, though only the 4_4 for now. The loading time may be slower, but it will result in an overall speed increase.

Which File to Choose

A great write-up with charts showing various performances is provided by Artefact2 here.

First, determine how big a model you can run by checking your available RAM and/or VRAM.

If you want the model to run as fast as possible, choose a quant with a file size 1 - 2GB smaller than your GPU's total VRAM.

If you want the absolute maximum quality, add your system RAM and your GPU's VRAM together, and then choose a quant with a file size 1 - 2GB smaller than that total.

Next, decide whether to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, choose a K-quant (e.g., Q5_K_M). If you want more details, check the llama.cpp feature matrix. Generally, if you're aiming for below Q4 and using cuBLAS (Nvidia) or rocBLAS (AMD), consider the I-quants (e.g., IQ3_M), which are newer and offer better performance for their size. However, I-quants on CPU are slower than their K-quant equivalents, so you need to balance speed and performance.

🔧 Technical Details

This project uses llama.cpp release b5924 for quantization. All quants are made using the imatrix option with a dataset from here.

📄 License

No license information is provided in the original document.

Credits

Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset.
Thank you ZeroWw for the inspiration to experiment with embed/output.
Thank you to LM Studio for sponsoring the work.

If you want to support the work, visit the ko-fi page here: https://ko-fi.com/bartowski