🚀 Llamacpp imatrix Quantizations of Phi-3.5-mini-instruct_Uncensored
This project provides llama.cpp imatrix quantizations of the Phi-3.5-mini-instruct_Uncensored model. It offers various quantization types to balance between model quality and resource usage.
🚀 Quick Start
Prerequisites
- Ensure you have
huggingface-cli
installed. You can install it using the following command:
pip install -U "huggingface_hub[cli]"
Download a Model
You can download a specific model file using the huggingface-cli
. For example, to download the Phi-3.5-mini-instruct_Uncensored-Q4_K_M.gguf
file:
huggingface-cli download bartowski/Phi-3.5-mini-instruct_Uncensored-GGUF --include "Phi-3.5-mini-instruct_Uncensored-Q4_K_M.gguf" --local-dir ./
If the model is split into multiple files (models larger than 50GB), you can download all the relevant files using:
huggingface-cli download bartowski/Phi-3.5-mini-instruct_Uncensored-GGUF --include "Phi-3.5-mini-instruct_Uncensored-Q8_0/*" --local-dir ./
Run the Model
You can run the quantized models in LM Studio.
✨ Features
- Multiple Quantization Types: Offers a wide range of quantization types (e.g., f16, Q8_0, Q6_K_L, etc.) to meet different resource and quality requirements.
- Embed/Output Weights Optimization: Some quantizations use Q8_0 for embeddings and output weights, potentially improving model quality.
📦 Installation
Install huggingface-cli
pip install -U "huggingface_hub[cli]"
💻 Usage Examples
Prompt Format
<s><|system|> {system_prompt}<|end|><|user|> {prompt}<|end|><|assistant|><|end|>
📚 Documentation
Model Information
Property |
Details |
Base Model |
SicariusSicariiStuff/Phi-3.5-mini-instruct_Uncensored |
License |
apache-2.0 |
Pipeline Tag |
text-generation |
Quantized By |
bartowski |
Model Download Table
Filename |
Quant type |
File Size |
Split |
Description |
Phi-3.5-mini-instruct_Uncensored-f16.gguf |
f16 |
7.64GB |
false |
Full F16 weights. |
Phi-3.5-mini-instruct_Uncensored-Q8_0.gguf |
Q8_0 |
4.06GB |
false |
Extremely high quality, generally unneeded but max available quant. |
Phi-3.5-mini-instruct_Uncensored-Q6_K_L.gguf |
Q6_K_L |
3.18GB |
false |
Uses Q8_0 for embed and output weights. Very high quality, near perfect, recommended. |
Phi-3.5-mini-instruct_Uncensored-Q6_K.gguf |
Q6_K |
3.14GB |
false |
Very high quality, near perfect, recommended. |
Phi-3.5-mini-instruct_Uncensored-Q5_K_L.gguf |
Q5_K_L |
2.88GB |
false |
Uses Q8_0 for embed and output weights. High quality, recommended. |
Phi-3.5-mini-instruct_Uncensored-Q5_K_M.gguf |
Q5_K_M |
2.82GB |
false |
High quality, recommended. |
Phi-3.5-mini-instruct_Uncensored-Q5_K_S.gguf |
Q5_K_S |
2.64GB |
false |
High quality, recommended. |
Phi-3.5-mini-instruct_Uncensored-Q4_K_L.gguf |
Q4_K_L |
2.47GB |
false |
Uses Q8_0 for embed and output weights. Good quality, recommended. |
Phi-3.5-mini-instruct_Uncensored-Q4_K_M.gguf |
Q4_K_M |
2.39GB |
false |
Good quality, default size for must use cases, recommended. |
Phi-3.5-mini-instruct_Uncensored-Q4_K_S.gguf |
Q4_K_S |
2.19GB |
false |
Slightly lower quality with more space savings, recommended. |
Phi-3.5-mini-instruct_Uncensored-Q3_K_XL.gguf |
Q3_K_XL |
2.17GB |
false |
Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability. |
Phi-3.5-mini-instruct_Uncensored-Q3_K_L.gguf |
Q3_K_L |
2.09GB |
false |
Lower quality but usable, good for low RAM availability. |
Phi-3.5-mini-instruct_Uncensored-IQ4_XS.gguf |
IQ4_XS |
2.06GB |
false |
Decent quality, smaller than Q4_K_S with similar performance, recommended. |
Phi-3.5-mini-instruct_Uncensored-Q3_K_M.gguf |
Q3_K_M |
1.96GB |
false |
Low quality. |
Phi-3.5-mini-instruct_Uncensored-IQ3_M.gguf |
IQ3_M |
1.86GB |
false |
Medium-low quality, new method with decent performance comparable to Q3_K_M. |
Phi-3.5-mini-instruct_Uncensored-Q3_K_S.gguf |
Q3_K_S |
1.68GB |
false |
Low quality, not recommended. |
Phi-3.5-mini-instruct_Uncensored-IQ3_XS.gguf |
IQ3_XS |
1.63GB |
false |
Lower quality, new method with decent performance, slightly better than Q3_K_S. |
Phi-3.5-mini-instruct_Uncensored-Q2_K_L.gguf |
Q2_K_L |
1.51GB |
false |
Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable. |
Phi-3.5-mini-instruct_Uncensored-Q2_K.gguf |
Q2_K |
1.42GB |
false |
Very low quality but surprisingly usable. |
Phi-3.5-mini-instruct_Uncensored-IQ2_M.gguf |
IQ2_M |
1.32GB |
false |
Relatively low quality, uses SOTA techniques to be surprisingly usable. |
Embed/Output Weights
Some of these quants (Q3_K_XL, Q4_K_L etc) use the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of the normal default. There is a debate on whether this improves the quality. If you use these models, please comment with your findings.
Model Selection Guide
A great write - up with charts showing various performances is provided by Artefact2 here.
To choose a model, first determine how much RAM and/or VRAM you have. If you want the model to run as fast as possible, choose a quant with a file size 1 - 2GB smaller than your GPU's total VRAM. If you want the maximum quality, add your system RAM and GPU's VRAM together and choose a quant 1 - 2GB smaller than that total.
You also need to decide between 'I - quants' (e.g., IQ3_M) and 'K - quants' (e.g., Q5_K_M). If you don't want to think too much, choose a K - quant. If you're aiming for below Q4 and running cuBLAS (Nvidia) or rocBLAS (AMD), consider I - quants, which are newer and offer better performance for their size. Note that I - quants are not compatible with Vulcan.
🔧 Technical Details
- Quantization Method: Uses llama.cpp release b3600 for quantization.
- Calibration Dataset: All quants are made using the imatrix option with the dataset from here.
📄 License
This project is licensed under the apache - 2.0 license.
👨💻 Credits
- Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset.
- Thank you ZeroWw for the inspiration to experiment with embed/output.
💡 Usage Tip
If you want to support the developer's work, visit the ko - fi page: https://ko-fi.com/bartowski