đ Llamacpp imatrix Quantizations of bagel-8b-v1.0
This project provides LlamaCpp imatrix quantizations of the bagel-8b-v1.0 model. It offers various quantization options for different performance and quality requirements, making it easier to run the model on diverse hardware.
đ Quick Start
Downloading using huggingface-cli
First, ensure you have huggingface-cli
installed:
pip install -U "huggingface_hub[cli]"
Then, you can target the specific file you want:
huggingface-cli download bartowski/bagel-8b-v1.0-GGUF --include "bagel-8b-v1.0-Q4_K_M.gguf" --local-dir ./ --local-dir-use-symlinks False
If the model is bigger than 50GB, it will have been split into multiple files. To download them all to a local folder, run:
huggingface-cli download bartowski/bagel-8b-v1.0-GGUF --include "bagel-8b-v1.0-Q8_0.gguf/*" --local-dir bagel-8b-v1.0-Q8_0 --local-dir-use-symlinks False
You can either specify a new local-dir
(e.g., bagel-8b-v1.0-Q8_0
) or download them all in place (./
).
⨠Features
Quantization
- Multiple Quantization Options: Offers a wide range of quantization types, such as Q8_0, Q6_K, Q5_K_M, etc., to balance between model quality and file size.
- Using LlamaCpp: Utilizes the llama.cpp release b2854 for quantization.
Prompt Format
The specific prompt format is as follows:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Downloadable Files
You can choose from a variety of quantized files based on your hardware and performance requirements. Here is a detailed table:
Property |
Details |
Model Type |
Llamacpp imatrix Quantizations of bagel-8b-v1.0 |
Training Data |
ai2_arc, allenai/ultrafeedback_binarized_cleaned, argilla/distilabel-intel-orca-dpo-pairs, etc. |
Filename |
Quant type |
File Size |
Description |
bagel-8b-v1.0-Q8_0.gguf |
Q8_0 |
8.54GB |
Extremely high quality, generally unneeded but max available quant. |
bagel-8b-v1.0-Q6_K.gguf |
Q6_K |
6.59GB |
Very high quality, near perfect, recommended. |
bagel-8b-v1.0-Q5_K_M.gguf |
Q5_K_M |
5.73GB |
High quality, recommended. |
bagel-8b-v1.0-Q5_K_S.gguf |
Q5_K_S |
5.59GB |
High quality, recommended. |
bagel-8b-v1.0-Q4_K_M.gguf |
Q4_K_M |
4.92GB |
Good quality, uses about 4.83 bits per weight, recommended. |
bagel-8b-v1.0-Q4_K_S.gguf |
Q4_K_S |
4.69GB |
Slightly lower quality with more space savings, recommended. |
bagel-8b-v1.0-IQ4_NL.gguf |
IQ4_NL |
4.67GB |
Decent quality, slightly smaller than Q4_K_S with similar performance recommended. |
bagel-8b-v1.0-IQ4_XS.gguf |
IQ4_XS |
4.44GB |
Decent quality, smaller than Q4_K_S with similar performance, recommended. |
bagel-8b-v1.0-Q3_K_L.gguf |
Q3_K_L |
4.32GB |
Lower quality but usable, good for low RAM availability. |
bagel-8b-v1.0-Q3_K_M.gguf |
Q3_K_M |
4.01GB |
Even lower quality. |
bagel-8b-v1.0-IQ3_M.gguf |
IQ3_M |
3.78GB |
Medium-low quality, new method with decent performance comparable to Q3_K_M. |
bagel-8b-v1.0-IQ3_S.gguf |
IQ3_S |
3.68GB |
Lower quality, new method with decent performance, recommended over Q3_K_S quant, same size with better performance. |
bagel-8b-v1.0-Q3_K_S.gguf |
Q3_K_S |
3.66GB |
Low quality, not recommended. |
bagel-8b-v1.0-IQ3_XS.gguf |
IQ3_XS |
3.51GB |
Lower quality, new method with decent performance, slightly better than Q3_K_S. |
bagel-8b-v1.0-IQ3_XXS.gguf |
IQ3_XXS |
3.27GB |
Lower quality, new method with decent performance, comparable to Q3 quants. |
bagel-8b-v1.0-Q2_K.gguf |
Q2_K |
3.17GB |
Very low quality but surprisingly usable. |
bagel-8b-v1.0-IQ2_M.gguf |
IQ2_M |
2.94GB |
Very low quality, uses SOTA techniques to also be surprisingly usable. |
bagel-8b-v1.0-IQ2_S.gguf |
IQ2_S |
2.75GB |
Very low quality, uses SOTA techniques to be usable. |
bagel-8b-v1.0-IQ2_XS.gguf |
IQ2_XS |
2.60GB |
Very low quality, uses SOTA techniques to be usable. |
bagel-8b-v1.0-IQ2_XXS.gguf |
IQ2_XXS |
2.39GB |
Lower quality, uses SOTA techniques to be usable. |
bagel-8b-v1.0-IQ1_M.gguf |
IQ1_M |
2.16GB |
Extremely low quality, not recommended. |
bagel-8b-v1.0-IQ1_S.gguf |
IQ1_S |
2.01GB |
Extremely low quality, not recommended. |
đ Documentation
Which file should I choose?
A great write - up with charts showing various performances is provided by Artefact2 here
The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.
If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1 - 2GB smaller than your GPU's total VRAM.
If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1 - 2GB smaller than that total.
Next, you'll need to decide if you want to use an 'I - quant' or a 'K - quant'.
If you don't want to think too much, grab one of the K - quants. These are in format 'QX_K_X', like Q5_K_M.
If you want to get more into the weeds, you can check out this extremely useful feature chart:
llama.cpp feature matrix
But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I - quants. These are in format IQX_X, like IQ3_M. These are newer and offer better performance for their size.
These I - quants can also be used on CPU and Apple Metal, but will be slower than their K - quant equivalent, so speed vs performance is a tradeoff you'll have to decide.
The I - quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double - check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm.
Support
Want to support my work? Visit my ko - fi page here: https://ko-fi.com/bartowski
đ License
This project uses the llama3 license.