LLAMA-3_8B_Unaligned_BETA-GGUF Open-source Model - Multi-quantization Versions for Different Hardware Requirements

LLAMA 3 8B Unaligned BETA GGUF

Developed by bartowski

An 8B-parameter unaligned beta model based on the LLaMA-3 architecture, offering multiple quantization versions to suit different hardware needs

Large Language Model #Multiple quantization versions #Lightweight deployment #ARM optimization

Downloads 542

Release Time : 10/12/2024

Model Overview

This is an 8B-parameter unaligned beta version of the LLaMA-3 model, processed with various quantization methods to run on different hardware configurations, ideal for local deployment and experimental purposes

Model Features

Multiple quantization options

Offers 20 different quantization versions from f16 to IQ2_M, catering to needs ranging from high performance to low resources

imatrix quantization technology

Uses llama.cpp's imatrix option for quantization to improve post-quantization model quality

ARM-optimized version

Provides a specially optimized version (Q4_0_X_X) for ARM chips, significantly boosting inference speed on ARM devices

Embedding/output weight optimization

Certain quantization versions (Q3_K_XL, Q4_K_L, etc.) use Q8_0 quantization for embedding and output weights, potentially enhancing model quality

Model Capabilities

Text generation

Dialogue systems

Content creation

Code generation

Use Cases

Local AI applications

Personal AI assistant

Run a personal AI assistant on local devices for privacy protection

Can operate smoothly on consumer-grade hardware

Content creation tool

Used for generating creative writing, stories, and poetry

Provides creative text output

Development & research

Model quantization research

Study the impact of different quantization methods on model performance

Offers multiple quantization versions for comparison

Edge AI experiments

Deploy large language models on resource-constrained devices

Quantized versions as small as 3GB can run on low-end devices

🚀 Llamacpp imatrix Quantizations of LLAMA-3_8B_Unaligned_BETA

This project provides quantized versions of the LLAMA-3_8B_Unaligned_BETA model using the llama.cpp library. It aims to offer different quantization types to balance model quality and resource requirements, making it suitable for various hardware configurations.

🚀 Quick Start

Prerequisites

First, ensure you have the huggingface-cli installed:

pip install -U "huggingface_hub[cli]"

Downloading a Specific File

You can target the specific file you want to download:

huggingface-cli download bartowski/LLAMA-3_8B_Unaligned_BETA-GGUF --include "LLAMA-3_8B_Unaligned_BETA-Q4_K_M.gguf" --local-dir ./

Downloading Split Files

If the model is bigger than 50GB and split into multiple files, download them all to a local folder:

huggingface-cli download bartowski/LLAMA-3_8B_Unaligned_BETA-GGUF --include "LLAMA-3_8B_Unaligned_BETA-Q8_0/*" --local-dir ./

Running the Model

You can run the quantized models in LM Studio.

✨ Features

Multiple Quantization Types: Offers a wide range of quantization types (e.g., f16, Q8_0, Q6_K_L, etc.) to meet different quality and resource requirements.
Embed/Output Weight Optimization: Some quantizations use Q8_0 for embed and output weights, potentially improving model quality.
ARM Optimization: Provides specific quantizations optimized for ARM chips.

📦 Installation

Installing huggingface-cli

pip install -U "huggingface_hub[cli]"

💻 Usage Examples

Prompt Format

<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

📚 Documentation

Model Information

Property	Details
Base Model	SicariusSicariiStuff/LLAMA-3_8B_Unaligned_BETA
Pipeline Tag	text-generation
Quantized By	bartowski
Quantization Tool	llama.cpp release b3901
Original Model	https://huggingface.co/SicariusSicariiStuff/LLAMA-3_8B_Unaligned_BETA
Calibration Dataset	here

Available Quantized Files

Filename	Quant type	File Size	Split	Description
LLAMA-3_8B_Unaligned_BETA-f16.gguf	f16	16.07GB	false	Full F16 weights.
LLAMA-3_8B_Unaligned_BETA-Q8_0.gguf	Q8_0	8.54GB	false	Extremely high quality, generally unneeded but max available quant.
LLAMA-3_8B_Unaligned_BETA-Q6_K_L.gguf	Q6_K_L	6.85GB	false	Uses Q8_0 for embed and output weights. Very high quality, near perfect, recommended.
LLAMA-3_8B_Unaligned_BETA-Q6_K.gguf	Q6_K	6.60GB	false	Very high quality, near perfect, recommended.
LLAMA-3_8B_Unaligned_BETA-Q5_K_L.gguf	Q5_K_L	6.06GB	false	Uses Q8_0 for embed and output weights. High quality, recommended.
LLAMA-3_8B_Unaligned_BETA-Q5_K_M.gguf	Q5_K_M	5.73GB	false	High quality, recommended.
LLAMA-3_8B_Unaligned_BETA-Q5_K_S.gguf	Q5_K_S	5.60GB	false	High quality, recommended.
LLAMA-3_8B_Unaligned_BETA-Q4_K_L.gguf	Q4_K_L	5.31GB	false	Uses Q8_0 for embed and output weights. Good quality, recommended.
LLAMA-3_8B_Unaligned_BETA-Q4_K_M.gguf	Q4_K_M	4.92GB	false	Good quality, default size for must use cases, recommended.
LLAMA-3_8B_Unaligned_BETA-Q3_K_XL.gguf	Q3_K_XL	4.78GB	false	Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability.
LLAMA-3_8B_Unaligned_BETA-Q4_K_S.gguf	Q4_K_S	4.69GB	false	Slightly lower quality with more space savings, recommended.
LLAMA-3_8B_Unaligned_BETA-Q4_0.gguf	Q4_0	4.68GB	false	Legacy format, generally not worth using over similarly sized formats
LLAMA-3_8B_Unaligned_BETA-Q4_0_8_8.gguf	Q4_0_8_8	4.66GB	false	Optimized for ARM inference. Requires 'sve' support (see link below). Don't use on Mac or Windows.
LLAMA-3_8B_Unaligned_BETA-Q4_0_4_8.gguf	Q4_0_4_8	4.66GB	false	Optimized for ARM inference. Requires 'i8mm' support (see link below). Don't use on Mac or Windows.
LLAMA-3_8B_Unaligned_BETA-Q4_0_4_4.gguf	Q4_0_4_4	4.66GB	false	Optimized for ARM inference. Should work well on all ARM chips, pick this if you're unsure. Don't use on Mac or Windows.
LLAMA-3_8B_Unaligned_BETA-IQ4_XS.gguf	IQ4_XS	4.45GB	false	Decent quality, smaller than Q4_K_S with similar performance, recommended.
LLAMA-3_8B_Unaligned_BETA-Q3_K_L.gguf	Q3_K_L	4.32GB	false	Lower quality but usable, good for low RAM availability.
LLAMA-3_8B_Unaligned_BETA-Q3_K_M.gguf	Q3_K_M	4.02GB	false	Low quality.
LLAMA-3_8B_Unaligned_BETA-IQ3_M.gguf	IQ3_M	3.78GB	false	Medium-low quality, new method with decent performance comparable to Q3_K_M.
LLAMA-3_8B_Unaligned_BETA-Q2_K_L.gguf	Q2_K_L	3.69GB	false	Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable.
LLAMA-3_8B_Unaligned_BETA-Q3_K_S.gguf	Q3_K_S	3.66GB	false	Low quality, not recommended.
LLAMA-3_8B_Unaligned_BETA-IQ3_XS.gguf	IQ3_XS	3.52GB	false	Lower quality, new method with decent performance, slightly better than Q3_K_S.
LLAMA-3_8B_Unaligned_BETA-Q2_K.gguf	Q2_K	3.18GB	false	Very low quality but surprisingly usable.
LLAMA-3_8B_Unaligned_BETA-IQ2_M.gguf	IQ2_M	2.95GB	false	Relatively low quality, uses SOTA techniques to be surprisingly usable.

Embed/Output Weights

Some of these quants (Q3_K_XL, Q4_K_L etc) are the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of what they would normally default to. Some say that this improves the quality, others don't notice any difference. If you use these models PLEASE COMMENT with your findings. I would like feedback that these are actually used and useful so I don't keep uploading quants no one is using.

Q4_0_X_X

These are NOT for Metal (Apple) offloading, only ARM chips. If you're using an ARM chip, the Q4_0_X_X quants will have a substantial speedup. Check out Q4_0_4_4 speed comparisons on the original pull request. To check which one would work best for your ARM chip, you can check AArch64 SoC features (thanks EloyOn!).

Which File to Choose

A great write up with charts showing various performances is provided by Artefact2 here.

Determine Model Size: Figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then grab a quant with a file size 1-2GB smaller than that total.
Choose between 'I-quant' and 'K-quant': If you don't want to think too much, grab one of the K-quants (e.g., Q5_K_M). If you want to get more into the weeds, check out the llama.cpp feature matrix. Generally, if you're aiming for below Q4 and running cuBLAS (Nvidia) or rocBLAS (AMD), look towards the I-quants (e.g., IQ3_M). These are newer and offer better performance for their size. Note that I-quants can be used on CPU and Apple Metal but will be slower than their K-quant equivalent, and they are not compatible with Vulcan.

🔧 Technical Details

Quantization Process

The quantization is performed using llama.cpp release b3901. The calibration dataset is sourced from here.

ARM Optimization

The Q4_0_X_X quants are optimized for ARM chips. They require specific features such as 'sve' or 'i8mm' support. You can check AArch64 SoC features to determine the best fit for your ARM chip.

📄 License

No license information provided in the original document.

Credits

Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

💡 Usage Tip

If you use the models with embed/output weights quantized to Q8_0, please share your findings in the comments. This will help determine if these quantizations are actually useful.

⚠️ Important Note

The Q4_0_X_X quants are only for ARM chips and not for Metal (Apple) offloading. Also, the I-quants are not compatible with Vulcan. Make sure to double-check your hardware and software configurations before using these models.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご