Model Overview
Model Features
Model Capabilities
Use Cases
đ Dolphin Llama 13B - GPTQ
This repository offers GPTQ model files for Eric Hartford's Dolphin Llama 13B, providing multiple quantisation parameter options to suit different hardware and requirements.
⨠Features
- Multiple GPTQ parameter permutations are provided to meet various hardware and requirement needs.
- Compatibility with multiple inference tools, including AutoGPTQ, ExLlama, and Huggingface Text Generation Inference (TGI).
đĻ Installation
Install the necessary packages
Requires: Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
pip3 install transformers>=4.32.0 optimum>=1.12.0
pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead:
pip3 uninstall -y auto-gptq
git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
pip3 install .
For CodeLlama models only: you must use Transformers 4.33.0 or later.
If 4.33.0 is not yet released when you read this, you will need to install Transformers from source:
pip3 uninstall -y transformers
pip3 install git+https://github.com/huggingface/transformers.git
đģ Usage Examples
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model_name_or_path = "TheBloke/Dolphin-Llama-13B-GPTQ"
# To use a different branch, change revision
# For example: revision="main"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
device_map="auto",
trust_remote_code=False,
revision="main")
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
prompt = "Tell me about AI"
prompt_template=f'''SYSTEM: {system_message}
USER: {prompt}
ASSISTANT:
'''
print("\n\n*** Generate:")
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))
# Inference can also be done using transformers' pipeline
print("*** Pipeline:")
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.95,
top_k=40,
repetition_penalty=1.1
)
print(pipe(prompt_template)[0]['generated_text'])
Advanced Usage
You can adjust the parameters in the code according to your specific needs, such as changing the temperature
, top_p
, top_k
, and max_new_tokens
to control the generation results.
đ Documentation
Model Information
Property | Details |
---|---|
Model Type | Llama |
Model Creator | Eric Hartford |
Original Model | Dolphin Llama 13B |
Prompt Template | Orca-Vicuna |
Repositories available
- AWQ model(s) for GPU inference.
- GPTQ models for GPU inference, with multiple quantisation parameter options.
- 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference
- Eric Hartford's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions
Prompt template: Orca-Vicuna
SYSTEM: {system_message}
USER: {prompt}
ASSISTANT:
Provided files and GPTQ parameters
Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
Each separate quant is in a different branch. See below for instructions on fetching from different branches.
All recent GPTQ files are made with AutoGPTQ, and all files in non-main branches are made with AutoGPTQ. Files in the main
branch which were uploaded before August 2023 were made with GPTQ-for-LLaMa.
Explanation of GPTQ parameters
- Bits: The bit size of the quantised model.
- GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
- Act Order: True or False. Also known as
desc_act
. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now. - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
- GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
- Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
- ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama models in 4-bit.
Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
---|---|---|---|---|---|---|---|---|---|
main | 4 | 128 | No | 0.1 | wikitext | 2048 | 7.26 GB | Yes | 4-bit, without Act Order and group size 128g. |
gptq-4bit-32g-actorder_True | 4 | 32 | Yes | 0.1 | wikitext | 2048 | 8.00 GB | Yes | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. |
gptq-4bit-64g-actorder_True | 4 | 64 | Yes | 0.1 | wikitext | 2048 | 7.51 GB | Yes | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. |
gptq-4bit-128g-actorder_True | 4 | 128 | Yes | 0.1 | wikitext | 2048 | 7.26 GB | Yes | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. |
gptq-8bit--1g-actorder_True | 8 | None | Yes | 0.1 | wikitext | 2048 | 13.36 GB | No | 8-bit, with Act Order. No group size, to lower VRAM requirements. |
gptq-8bit-128g-actorder_False | 8 | 128 | No | 0.1 | wikitext | 2048 | 13.65 GB | No | 8-bit, with group size 128g for higher inference quality and without Act Order to improve AutoGPTQ speed. |
gptq-8bit-128g-actorder_True | 8 | 128 | Yes | 0.1 | wikitext | 2048 | 13.65 GB | No | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. |
gptq-8bit-64g-actorder_True | 8 | 64 | Yes | 0.1 | wikitext | 2048 | 13.95 GB | No | 8-bit, with group size 64g and Act Order for even higher inference quality. Poor AutoGPTQ CUDA speed. |
How to download from branches
- In text-generation-webui, you can add
:branch
to the end of the download name, egTheBloke/Dolphin-Llama-13B-GPTQ:main
- With Git, you can clone a branch with:
git clone --single-branch --branch main https://huggingface.co/TheBloke/Dolphin-Llama-13B-GPTQ
- In Python Transformers code, the branch is the
revision
parameter; see below.
How to easily download and use this model in text-generation-webui.
Please make sure you're using the latest version of text-generation-webui. It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
- Click the Model tab.
- Under Download custom model or LoRA, enter
TheBloke/Dolphin-Llama-13B-GPTQ
.
- To download from a specific branch, enter for example
TheBloke/Dolphin-Llama-13B-GPTQ:main
- see Provided Files above for the list of branches for each option.
- Click Download.
- The model will start downloading. Once it's finished it will say "Done".
- In the top left, click the refresh icon next to Model.
- In the Model dropdown, choose the model you just downloaded:
Dolphin-Llama-13B-GPTQ
- The model will automatically load, and is now ready for use!
- If you want any custom settings, set them and then click Save settings for this model followed by Reload the Model in the top right.
- Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file
quantize_config.json
.
- Once you're ready, click the Text Generation tab and enter a prompt to get started!
đ§ Technical Details
The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with Occ4m's GPTQ-for-LLaMa fork. ExLlama is compatible with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility. Huggingface Text Generation Inference (TGI) is compatible with all GPTQ models.
đ License
License: other
Discord
For further support, and discussions on these models and AI in general, join us at: TheBloke AI's Discord server
Thanks, and how to contribute
Thanks to the chirper.ai team! Thanks to Clay from gpus.llm-utils.org! I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding the offerings.

