đ Llamacpp imatrix Quantizations of LLAMA-3_8B_Unaligned_BETA
This project provides quantized versions of the LLAMA-3_8B_Unaligned_BETA model using the llama.cpp library. It aims to offer different quantization types to balance model quality and resource requirements, making it suitable for various hardware configurations.
đ Quick Start
Prerequisites
First, ensure you have the huggingface-cli
installed:
pip install -U "huggingface_hub[cli]"
Downloading a Specific File
You can target the specific file you want to download:
huggingface-cli download bartowski/LLAMA-3_8B_Unaligned_BETA-GGUF --include "LLAMA-3_8B_Unaligned_BETA-Q4_K_M.gguf" --local-dir ./
Downloading Split Files
If the model is bigger than 50GB and split into multiple files, download them all to a local folder:
huggingface-cli download bartowski/LLAMA-3_8B_Unaligned_BETA-GGUF --include "LLAMA-3_8B_Unaligned_BETA-Q8_0/*" --local-dir ./
Running the Model
You can run the quantized models in LM Studio.
⨠Features
- Multiple Quantization Types: Offers a wide range of quantization types (e.g., f16, Q8_0, Q6_K_L, etc.) to meet different quality and resource requirements.
- Embed/Output Weight Optimization: Some quantizations use Q8_0 for embed and output weights, potentially improving model quality.
- ARM Optimization: Provides specific quantizations optimized for ARM chips.
đĻ Installation
Installing huggingface-cli
pip install -U "huggingface_hub[cli]"
đģ Usage Examples
Prompt Format
<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
đ Documentation
Model Information
Property |
Details |
Base Model |
SicariusSicariiStuff/LLAMA-3_8B_Unaligned_BETA |
Pipeline Tag |
text-generation |
Quantized By |
bartowski |
Quantization Tool |
llama.cpp release b3901 |
Original Model |
https://huggingface.co/SicariusSicariiStuff/LLAMA-3_8B_Unaligned_BETA |
Calibration Dataset |
here |
Available Quantized Files
Filename |
Quant type |
File Size |
Split |
Description |
LLAMA-3_8B_Unaligned_BETA-f16.gguf |
f16 |
16.07GB |
false |
Full F16 weights. |
LLAMA-3_8B_Unaligned_BETA-Q8_0.gguf |
Q8_0 |
8.54GB |
false |
Extremely high quality, generally unneeded but max available quant. |
LLAMA-3_8B_Unaligned_BETA-Q6_K_L.gguf |
Q6_K_L |
6.85GB |
false |
Uses Q8_0 for embed and output weights. Very high quality, near perfect, recommended. |
LLAMA-3_8B_Unaligned_BETA-Q6_K.gguf |
Q6_K |
6.60GB |
false |
Very high quality, near perfect, recommended. |
LLAMA-3_8B_Unaligned_BETA-Q5_K_L.gguf |
Q5_K_L |
6.06GB |
false |
Uses Q8_0 for embed and output weights. High quality, recommended. |
LLAMA-3_8B_Unaligned_BETA-Q5_K_M.gguf |
Q5_K_M |
5.73GB |
false |
High quality, recommended. |
LLAMA-3_8B_Unaligned_BETA-Q5_K_S.gguf |
Q5_K_S |
5.60GB |
false |
High quality, recommended. |
LLAMA-3_8B_Unaligned_BETA-Q4_K_L.gguf |
Q4_K_L |
5.31GB |
false |
Uses Q8_0 for embed and output weights. Good quality, recommended. |
LLAMA-3_8B_Unaligned_BETA-Q4_K_M.gguf |
Q4_K_M |
4.92GB |
false |
Good quality, default size for must use cases, recommended. |
LLAMA-3_8B_Unaligned_BETA-Q3_K_XL.gguf |
Q3_K_XL |
4.78GB |
false |
Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability. |
LLAMA-3_8B_Unaligned_BETA-Q4_K_S.gguf |
Q4_K_S |
4.69GB |
false |
Slightly lower quality with more space savings, recommended. |
LLAMA-3_8B_Unaligned_BETA-Q4_0.gguf |
Q4_0 |
4.68GB |
false |
Legacy format, generally not worth using over similarly sized formats |
LLAMA-3_8B_Unaligned_BETA-Q4_0_8_8.gguf |
Q4_0_8_8 |
4.66GB |
false |
Optimized for ARM inference. Requires 'sve' support (see link below). Don't use on Mac or Windows. |
LLAMA-3_8B_Unaligned_BETA-Q4_0_4_8.gguf |
Q4_0_4_8 |
4.66GB |
false |
Optimized for ARM inference. Requires 'i8mm' support (see link below). Don't use on Mac or Windows. |
LLAMA-3_8B_Unaligned_BETA-Q4_0_4_4.gguf |
Q4_0_4_4 |
4.66GB |
false |
Optimized for ARM inference. Should work well on all ARM chips, pick this if you're unsure. Don't use on Mac or Windows. |
LLAMA-3_8B_Unaligned_BETA-IQ4_XS.gguf |
IQ4_XS |
4.45GB |
false |
Decent quality, smaller than Q4_K_S with similar performance, recommended. |
LLAMA-3_8B_Unaligned_BETA-Q3_K_L.gguf |
Q3_K_L |
4.32GB |
false |
Lower quality but usable, good for low RAM availability. |
LLAMA-3_8B_Unaligned_BETA-Q3_K_M.gguf |
Q3_K_M |
4.02GB |
false |
Low quality. |
LLAMA-3_8B_Unaligned_BETA-IQ3_M.gguf |
IQ3_M |
3.78GB |
false |
Medium-low quality, new method with decent performance comparable to Q3_K_M. |
LLAMA-3_8B_Unaligned_BETA-Q2_K_L.gguf |
Q2_K_L |
3.69GB |
false |
Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable. |
LLAMA-3_8B_Unaligned_BETA-Q3_K_S.gguf |
Q3_K_S |
3.66GB |
false |
Low quality, not recommended. |
LLAMA-3_8B_Unaligned_BETA-IQ3_XS.gguf |
IQ3_XS |
3.52GB |
false |
Lower quality, new method with decent performance, slightly better than Q3_K_S. |
LLAMA-3_8B_Unaligned_BETA-Q2_K.gguf |
Q2_K |
3.18GB |
false |
Very low quality but surprisingly usable. |
LLAMA-3_8B_Unaligned_BETA-IQ2_M.gguf |
IQ2_M |
2.95GB |
false |
Relatively low quality, uses SOTA techniques to be surprisingly usable. |
Embed/Output Weights
Some of these quants (Q3_K_XL, Q4_K_L etc) are the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of what they would normally default to. Some say that this improves the quality, others don't notice any difference. If you use these models PLEASE COMMENT with your findings. I would like feedback that these are actually used and useful so I don't keep uploading quants no one is using.
Q4_0_X_X
These are NOT for Metal (Apple) offloading, only ARM chips. If you're using an ARM chip, the Q4_0_X_X quants will have a substantial speedup. Check out Q4_0_4_4 speed comparisons on the original pull request. To check which one would work best for your ARM chip, you can check AArch64 SoC features (thanks EloyOn!).
Which File to Choose
A great write up with charts showing various performances is provided by Artefact2 here.
- Determine Model Size: Figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then grab a quant with a file size 1-2GB smaller than that total.
- Choose between 'I-quant' and 'K-quant': If you don't want to think too much, grab one of the K-quants (e.g., Q5_K_M). If you want to get more into the weeds, check out the llama.cpp feature matrix. Generally, if you're aiming for below Q4 and running cuBLAS (Nvidia) or rocBLAS (AMD), look towards the I-quants (e.g., IQ3_M). These are newer and offer better performance for their size. Note that I-quants can be used on CPU and Apple Metal but will be slower than their K-quant equivalent, and they are not compatible with Vulcan.
đ§ Technical Details
Quantization Process
The quantization is performed using llama.cpp release b3901. The calibration dataset is sourced from here.
ARM Optimization
The Q4_0_X_X quants are optimized for ARM chips. They require specific features such as 'sve' or 'i8mm' support. You can check AArch64 SoC features to determine the best fit for your ARM chip.
đ License
No license information provided in the original document.
Credits
Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
đĄ Usage Tip
If you use the models with embed/output weights quantized to Q8_0, please share your findings in the comments. This will help determine if these quantizations are actually useful.
â ī¸ Important Note
The Q4_0_X_X quants are only for ARM chips and not for Metal (Apple) offloading. Also, the I-quants are not compatible with Vulcan. Make sure to double-check your hardware and software configurations before using these models.