Model Overview
Model Features
Model Capabilities
Use Cases
đ Athena v4 - GPTQ
This repository provides GPTQ model files for Athena v4, offering multiple quantization options for different hardware and requirements.
đ Quick Start
If you want to quickly start using the Athena v4 - GPTQ model, you can follow the download and usage instructions below.
⨠Features
- Multiple Quantization Options: Multiple GPTQ parameter permutations are provided, allowing you to choose the best one for your hardware and requirements.
- Multiple Download Methods: Support downloading from different branches through text - generation - webui, command line, etc.
- Easy to Use in text - generation - webui: You can easily download and use this model in [text - generation - webui](https://github.com/oobabooga/text - generation - webui).
- Serving with TGI: It can be served from Text Generation Inference (TGI).
đĻ Installation
In text - generation - webui
To download from the main
branch, enter TheBloke/Athena - v4 - GPTQ
in the "Download model" box.
To download from another branch, add :branchname
to the end of the download name, eg TheBloke/Athena - v4 - GPTQ:gptq - 4bit - 32g - actorder_True
From the command line
I recommend using the huggingface - hub
Python library:
pip3 install huggingface - hub
To download the main
branch to a folder called Athena - v4 - GPTQ
:
mkdir Athena - v4 - GPTQ
huggingface - cli download TheBloke/Athena - v4 - GPTQ --local - dir Athena - v4 - GPTQ --local - dir - use - symlinks False
To download from a different branch, add the --revision
parameter:
mkdir Athena - v4 - GPTQ
huggingface - cli download TheBloke/Athena - v4 - GPTQ --revision gptq - 4bit - 32g - actorder_True --local - dir Athena - v4 - GPTQ --local - dir - use - symlinks False
With git
(not recommended)
To clone a specific branch with git
, use a command like this:
git clone --single - branch --branch gptq - 4bit - 32g - actorder_True https://huggingface.co/TheBloke/Athena - v4 - GPTQ
đģ Usage Examples
Use in text - generation - webui
- Click the Model tab.
- Under Download custom model or LoRA, enter
TheBloke/Athena - v4 - GPTQ
.
- To download from a specific branch, enter for example
TheBloke/Athena - v4 - GPTQ:gptq - 4bit - 32g - actorder_True
- see Provided Files below for the list of branches for each option.
- Click Download.
- The model will start downloading. Once it's finished it will say "Done".
- In the top left, click the refresh icon next to Model.
- In the Model dropdown, choose the model you just downloaded:
Athena - v4 - GPTQ
- The model will automatically load, and is now ready for use!
- If you want any custom settings, set them and then click Save settings for this model followed by Reload the Model in the top right.
- Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file
quantize_config.json
.
- Once you're ready, click the Text Generation tab and enter a prompt to get started!
Serving this model from Text Generation Inference (TGI)
It's recommended to use TGI version 1.1.0 or later. The official Docker container is: ghcr.io/huggingface/text - generation - inference:1.1.0
Example Docker parameters:
--model - id TheBloke/Athena - v4 - GPTQ --port 3000 --quantize awq --max - input - length 3696 --max - total - tokens 4096 --max - batch - prefill - tokens 4096
đ Documentation
Model Information
- Model creator: IkariDev + Undi95
- Original model: [Athena v4](https://huggingface.co/IkariDev/Athena - v4)
- Model type: llama
- Prompt template: Alpaca
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{prompt}
### Response:
Repositories available
- [AWQ model(s) for GPU inference.](https://huggingface.co/TheBloke/Athena - v4 - AWQ)
- [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Athena - v4 - GPTQ)
- [2, 3, 4, 5, 6 and 8 - bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Athena - v4 - GGUF)
- [IkariDev + Undi95's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/IkariDev/Athena - v4)
Provided files, and GPTQ parameters
Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
Each separate quant is in a different branch. See below for instructions on fetching from different branches.
Most GPTQ files are made with AutoGPTQ. Mistral models are currently made with Transformers.
Explanation of GPTQ parameters
- Bits: The bit size of the quantised model.
- GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
- Act Order: True or False. Also known as
desc_act
. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now. - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
- GPTQ dataset: The calibration dataset used during quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ calibration dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
- Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
- ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama models in 4 - bit.
Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
---|---|---|---|---|---|---|---|---|---|
[main](https://huggingface.co/TheBloke/Athena - v4 - GPTQ/tree/main) | 4 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test) | 4096 | 7.26 GB | Yes | 4 - bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. |
[gptq - 4bit - 32g - actorder_True](https://huggingface.co/TheBloke/Athena - v4 - GPTQ/tree/gptq - 4bit - 32g - actorder_True) | 4 | 32 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test) | 4096 | 8.00 GB | Yes | 4 - bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. |
[gptq - 8bit--1g - actorder_True](https://huggingface.co/TheBloke/Athena - v4 - GPTQ/tree/gptq - 8bit--1g - actorder_True) | 8 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test) | 4096 | 13.36 GB | No | 8 - bit, with Act Order. No group size, to lower VRAM requirements. |
[gptq - 8bit - 128g - actorder_True](https://huggingface.co/TheBloke/Athena - v4 - GPTQ/tree/gptq - 8bit - 128g - actorder_True) | 8 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test) | 4096 | 13.65 GB | No | 8 - bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. |
[gptq - 8bit - 32g - actorder_True](https://huggingface.co/TheBloke/Athena - v4 - GPTQ/tree/gptq - 8bit - 32g - actorder_True) | 8 | 32 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test) | 4096 | 14.54 GB | No | 8 - bit, with group size 32g and Act Order for maximum inference quality. |
[gptq - 4bit - 64g - actorder_True](https://huggingface.co/TheBloke/Athena - v4 - GPTQ/tree/gptq - 4bit - 64g - actorder_True) | 4 | 64 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test) | 4096 | 7.51 GB | Yes | 4 - bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. |
đ§ Technical Details
This section mainly involves the implementation details of GPTQ quantization, including the processing of quantization parameters, the selection of calibration datasets, etc. For specific information, please refer to the above description of GPTQ parameters.
đ License
The creator of the source model has listed its license as cc - by - nc - 4.0
, and this quantization has therefore used that same license.
As this model is based on Llama 2, it is also subject to the Meta Llama 2 license terms, and the license files for that are additionally included. It should therefore be considered as being claimed to be licensed under both licenses. I contacted Hugging Face for clarification on dual licensing but they do not yet have an official position. Should this change, or should Meta provide any feedback on this situation, I will update this section accordingly.
In the meantime, any questions regarding licensing, and in particular how these two licenses might interact, should be directed to the original model repository: [IkariDev + Undi95's Athena v4](https://huggingface.co/IkariDev/Athena - v4).

