🚀 Llamacpp imatrix Quantizations of NeuralDaredevil-8B-abliterated
This project provides Llama.cpp imatrix quantizations of the NeuralDaredevil-8B-abliterated model, offering various quantization options for different performance and quality requirements.
📚 Documentation
Model Information
Property |
Details |
Model Type |
Llamacpp imatrix Quantizations of NeuralDaredevil-8B-abliterated |
Training Data |
mlabonne/orpo-dpo-mix-40k |
Model Performance
The model has been evaluated on several benchmarks. Here are the results:
- AI2 Reasoning Challenge (25-Shot): Normalized accuracy of 69.28%
- HellaSwag (10-Shot): Normalized accuracy of 85.05%
- MMLU (5-Shot): Accuracy of 69.1%
- TruthfulQA (0-shot): mc2 score of 60.0
- Winogrande (5-shot): Accuracy of 78.69%
- GSM8k (5-shot): Accuracy of 71.8%
You can find more details on the Open LLM Leaderboard.
Quantization Details
The quantizations are created using llama.cpp release b3086. All quants are made using the imatrix option with a dataset from here.
The original model can be found here.
Prompt Format
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Download Options
You can download a specific file from the following table:
Filename |
Quant type |
File Size |
Description |
NeuralDaredevil-8B-abliterated-Q8_0.gguf |
Q8_0 |
8.54GB |
Extremely high quality, generally unneeded but max available quant. |
NeuralDaredevil-8B-abliterated-Q6_K.gguf |
Q6_K |
6.59GB |
Very high quality, near perfect, recommended. |
NeuralDaredevil-8B-abliterated-Q5_K_M.gguf |
Q5_K_M |
5.73GB |
High quality, recommended. |
NeuralDaredevil-8B-abliterated-Q5_K_S.gguf |
Q5_K_S |
5.59GB |
High quality, recommended. |
NeuralDaredevil-8B-abliterated-Q4_K_M.gguf |
Q4_K_M |
4.92GB |
Good quality, uses about 4.83 bits per weight, recommended. |
NeuralDaredevil-8B-abliterated-Q4_K_S.gguf |
Q4_K_S |
4.69GB |
Slightly lower quality with more space savings, recommended. |
NeuralDaredevil-8B-abliterated-IQ4_XS.gguf |
IQ4_XS |
4.44GB |
Decent quality, smaller than Q4_K_S with similar performance, recommended. |
NeuralDaredevil-8B-abliterated-Q3_K_L.gguf |
Q3_K_L |
4.32GB |
Lower quality but usable, good for low RAM availability. |
NeuralDaredevil-8B-abliterated-Q3_K_M.gguf |
Q3_K_M |
4.01GB |
Even lower quality. |
NeuralDaredevil-8B-abliterated-IQ3_M.gguf |
IQ3_M |
3.78GB |
Medium-low quality, new method with decent performance comparable to Q3_K_M. |
NeuralDaredevil-8B-abliterated-Q3_K_S.gguf |
Q3_K_S |
3.66GB |
Low quality, not recommended. |
NeuralDaredevil-8B-abliterated-IQ3_XS.gguf |
IQ3_XS |
3.51GB |
Lower quality, new method with decent performance, slightly better than Q3_K_S. |
NeuralDaredevil-8B-abliterated-IQ3_XXS.gguf |
IQ3_XXS |
3.27GB |
Lower quality, new method with decent performance, comparable to Q3 quants. |
NeuralDaredevil-8B-abliterated-Q2_K.gguf |
Q2_K |
3.17GB |
Very low quality but surprisingly usable. |
NeuralDaredevil-8B-abliterated-IQ2_M.gguf |
IQ2_M |
2.94GB |
Very low quality, uses SOTA techniques to also be surprisingly usable. |
NeuralDaredevil-8B-abliterated-IQ2_S.gguf |
IQ2_S |
2.75GB |
Very low quality, uses SOTA techniques to be usable. |
NeuralDaredevil-8B-abliterated-IQ2_XS.gguf |
IQ2_XS |
2.60GB |
Very low quality, uses SOTA techniques to be usable. |
Downloading using huggingface-cli
Installation
First, make sure you have huggingface-cli
installed:
pip install -U "huggingface_hub[cli]"
Download a Specific File
You can target the specific file you want:
huggingface-cli download bartowski/NeuralDaredevil-8B-abliterated-GGUF --include "NeuralDaredevil-8B-abliterated-Q4_K_M.gguf" --local-dir ./
Download Split Files
If the model is bigger than 50GB, it will have been split into multiple files. To download them all to a local folder, run:
huggingface-cli download bartowski/NeuralDaredevil-8B-abliterated-GGUF --include "NeuralDaredevil-8B-abliterated-Q8_0.gguf/*" --local-dir NeuralDaredevil-8B-abliterated-Q8_0
You can either specify a new local-dir (e.g., NeuralDaredevil-8B-abliterated-Q8_0
) or download them all in place (./
).
Which File Should I Choose?
A great write-up with charts showing various performances is provided by Artefact2 here.
Determine Model Size
The first thing to figure out is how big a model you can run. You'll need to determine how much RAM and/or VRAM you have.
- Fastest Performance: If you want your model running as fast as possible, aim to fit the whole thing on your GPU's VRAM. Choose a quant with a file size 1-2GB smaller than your GPU's total VRAM.
- Maximum Quality: If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then select a quant with a file size 1-2GB smaller than that total.
Choose between 'I-quant' and 'K-quant'
- K-quants: If you don't want to think too much, grab one of the K-quants. These are in the format 'QX_K_X', like Q5_K_M.
- I-quants: If you're aiming for below Q4 and running cuBLAS (Nvidia) or rocBLAS (AMD), you should consider the I-quants. These are in the format IQX_X, like IQ3_M. They are newer and offer better performance for their size.
The I-quants can also be used on CPU and Apple Metal, but they will be slower than their K-quant equivalents.
⚠️ Important Note
The I-quants are not compatible with Vulcan, which is also for AMD. So, if you have an AMD card, double-check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm.
💡 Usage Tip
You can check out the llama.cpp feature matrix for more detailed information.
📄 License
The license for this project is other
.
If you want to support the author's work, visit the ko-fi page.