đ Llamacpp imatrix Quantizations of llama-3-cat-8b-instruct-v1
This project focuses on the quantization of the llama-3-cat-8b-instruct-v1
model. It uses llama.cpp
for quantization, providing various quantized versions with different file sizes and qualities to meet diverse user needs.
đ Quick Start
Prerequisites
Ensure you have huggingface-cli
installed. You can install it using the following command:
pip install -U "huggingface_hub[cli]"
Download a Specific File
To download a specific file, use the following command. For example, to download llama-3-cat-8b-instruct-v1-Q4_K_M.gguf
:
huggingface-cli download bartowski/llama-3-cat-8b-instruct-v1-GGUF --include "llama-3-cat-8b-instruct-v1-Q4_K_M.gguf" --local-dir ./ --local-dir-use-symlinks False
Download Split Files
If the model is larger than 50GB and split into multiple files, you can download all of them to a local folder using the following command:
huggingface-cli download bartowski/llama-3-cat-8b-instruct-v1-GGUF --include "llama-3-cat-8b-instruct-v1-Q8_0.gguf/*" --local-dir llama-3-cat-8b-instruct-v1-Q8_0 --local-dir-use-symlinks False
⨠Features
- Multiple Quantization Types: Offers a wide range of quantized models, including Q8_0, Q6_K, Q5_K_M, etc., to balance between quality and file size.
- Easy Download: Provides clear instructions on how to download files using
huggingface-cli
.
- Performance Guidance: Offers guidance on choosing the appropriate quantized file based on available RAM/VRAM and performance requirements.
đĻ Installation
The main installation step is to install huggingface-cli
using the command:
pip install -U "huggingface_hub[cli]"
đģ Usage Examples
Download a Specific File
huggingface-cli download bartowski/llama-3-cat-8b-instruct-v1-GGUF --include "llama-3-cat-8b-instruct-v1-Q4_K_M.gguf" --local-dir ./ --local-dir-use-symlinks False
Download Split Files
huggingface-cli download bartowski/llama-3-cat-8b-instruct-v1-GGUF --include "llama-3-cat-8b-instruct-v1-Q8_0.gguf/*" --local-dir llama-3-cat-8b-instruct-v1-Q8_0 --local-dir-use-symlinks False
đ Documentation
Prompt Format
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
File Download Table
Filename |
Quant type |
File Size |
Description |
llama-3-cat-8b-instruct-v1-Q8_0.gguf |
Q8_0 |
8.54GB |
Extremely high quality, generally unneeded but max available quant. |
llama-3-cat-8b-instruct-v1-Q6_K.gguf |
Q6_K |
6.59GB |
Very high quality, near perfect, recommended. |
llama-3-cat-8b-instruct-v1-Q5_K_M.gguf |
Q5_K_M |
5.73GB |
High quality, recommended. |
llama-3-cat-8b-instruct-v1-Q5_K_S.gguf |
Q5_K_S |
5.59GB |
High quality, recommended. |
llama-3-cat-8b-instruct-v1-Q4_K_M.gguf |
Q4_K_M |
4.92GB |
Good quality, uses about 4.83 bits per weight, recommended. |
llama-3-cat-8b-instruct-v1-Q4_K_S.gguf |
Q4_K_S |
4.69GB |
Slightly lower quality with more space savings, recommended. |
llama-3-cat-8b-instruct-v1-IQ4_NL.gguf |
IQ4_NL |
4.67GB |
Decent quality, slightly smaller than Q4_K_S with similar performance recommended. |
llama-3-cat-8b-instruct-v1-IQ4_XS.gguf |
IQ4_XS |
4.44GB |
Decent quality, smaller than Q4_K_S with similar performance, recommended. |
llama-3-cat-8b-instruct-v1-Q3_K_L.gguf |
Q3_K_L |
4.32GB |
Lower quality but usable, good for low RAM availability. |
llama-3-cat-8b-instruct-v1-Q3_K_M.gguf |
Q3_K_M |
4.01GB |
Even lower quality. |
llama-3-cat-8b-instruct-v1-IQ3_M.gguf |
IQ3_M |
3.78GB |
Medium-low quality, new method with decent performance comparable to Q3_K_M. |
llama-3-cat-8b-instruct-v1-IQ3_S.gguf |
IQ3_S |
3.68GB |
Lower quality, new method with decent performance, recommended over Q3_K_S quant, same size with better performance. |
llama-3-cat-8b-instruct-v1-Q3_K_S.gguf |
Q3_K_S |
3.66GB |
Low quality, not recommended. |
llama-3-cat-8b-instruct-v1-IQ3_XS.gguf |
IQ3_XS |
3.51GB |
Lower quality, new method with decent performance, slightly better than Q3_K_S. |
llama-3-cat-8b-instruct-v1-IQ3_XXS.gguf |
IQ3_XXS |
3.27GB |
Lower quality, new method with decent performance, comparable to Q3 quants. |
llama-3-cat-8b-instruct-v1-Q2_K.gguf |
Q2_K |
3.17GB |
Very low quality but surprisingly usable. |
llama-3-cat-8b-instruct-v1-IQ2_M.gguf |
IQ2_M |
2.94GB |
Very low quality, uses SOTA techniques to also be surprisingly usable. |
llama-3-cat-8b-instruct-v1-IQ2_S.gguf |
IQ2_S |
2.75GB |
Very low quality, uses SOTA techniques to be usable. |
llama-3-cat-8b-instruct-v1-IQ2_XS.gguf |
IQ2_XS |
2.60GB |
Very low quality, uses SOTA techniques to be usable. |
llama-3-cat-8b-instruct-v1-IQ2_XXS.gguf |
IQ2_XXS |
2.39GB |
Lower quality, uses SOTA techniques to be usable. |
llama-3-cat-8b-instruct-v1-IQ1_M.gguf |
IQ1_M |
2.16GB |
Extremely low quality, not recommended. |
llama-3-cat-8b-instruct-v1-IQ1_S.gguf |
IQ1_S |
2.01GB |
Extremely low quality, not recommended. |
Which File to Choose
A great write - up with charts showing various performances is provided by Artefact2 here.
- Determine Available Resources: First, figure out how much RAM and/or VRAM you have. If you want the model to run as fast as possible, choose a quant with a file size 1 - 2GB smaller than your GPU's total VRAM. If you want the maximum quality, add your system RAM and GPU's VRAM together and choose a quant 1 - 2GB smaller than that total.
- Choose between 'I - quant' and 'K - quant': If you don't want to think too much, choose a K - quant (e.g., Q5_K_M). If you're aiming for below Q4 and using cuBLAS (Nvidia) or rocBLAS (AMD), consider I - quants (e.g., IQ3_M). Note that I - quants are not compatible with Vulcan.
đ§ Technical Details
- Quantization Tool: Using llama.cpp release b2854 for quantization.
- Original Model: https://huggingface.co/TheSkullery/llama-3-cat-8b-instruct-v1
- Quantization Option: All quants are made using the imatrix option with the dataset provided by Kalomaze here
đ License
The model uses the llama3
license.
đĄ Usage Tip
If you want to support the author's work, you can visit the ko - fi page: https://ko-fi.com/bartowski