Model Overview
Model Features
Model Capabilities
Use Cases
đ OpenChat 3.5 7B - GPTQ
This repository provides GPTQ model files for OpenChat's OpenChat 3.5 7B, offering multiple quantisation options to suit different hardware and requirements.
đ Quick Start
Downloading the Model
You can download the model in different ways:
In text - generation - webui
To download from the main
branch, enter TheBloke/openchat_3.5 - GPTQ
in the "Download model" box. To download from another branch, add :branchname
to the end of the download name, e.g., TheBloke/openchat_3.5 - GPTQ:gptq - 4bit - 32g - actorder_True
.
From the command line
First, install the huggingface - hub
Python library:
pip3 install huggingface - hub
To download the main
branch to a folder called openchat_3.5 - GPTQ
:
mkdir openchat_3.5 - GPTQ
huggingface - cli download TheBloke/openchat_3.5 - GPTQ --local - dir openchat_3.5 - GPTQ --local - dir - use - symlinks False
To download from a different branch, add the --revision
parameter:
mkdir openchat_3.5 - GPTQ
huggingface - cli download TheBloke/openchat_3.5 - GPTQ --revision gptq - 4bit - 32g - actorder_True --local - dir openchat_3.5 - GPTQ --local - dir - use - symlinks False
With git
(not recommended)
git clone --single - branch --branch gptq - 4bit - 32g - actorder_True https://huggingface.co/TheBloke/openchat_3.5 - GPTQ
Using the Model in text - generation - webui
- Make sure you're using the latest version of [text - generation - webui](https://github.com/oobabooga/text - generation - webui).
- Click the Model tab.
- Under Download custom model or LoRA, enter
TheBloke/openchat_3.5 - GPTQ
.- To download from a specific branch, enter for example
TheBloke/openchat_3.5 - GPTQ:gptq - 4bit - 32g - actorder_True
.
- To download from a specific branch, enter for example
- Click Download.
- Once the model is downloaded, click the refresh icon next to Model in the top left.
- In the Model dropdown, choose the model you just downloaded:
openchat_3.5 - GPTQ
. - The model will automatically load and be ready for use.
- If you want custom settings, set them and then click Save settings for this model followed by Reload the Model in the top right.
- Click the Text Generation tab and enter a prompt to start.
Serving the Model from Text Generation Inference (TGI)
It's recommended to use TGI version 1.1.0 or later. The official Docker container is: ghcr.io/huggingface/text - generation - inference:1.1.0
Example Docker parameters:
--model - id TheBloke/openchat_3.5 - GPTQ --port 3000 --quantize gptq --max - input - length 3696 --max - total - tokens 4096 --max - batch - prefill - tokens 4096
Example Python code for interfacing with TGI (requires huggingface - hub 0.17.0 or later):
pip3 install huggingface - hub
from huggingface_hub import InferenceClient
endpoint_url = "https://your - endpoint - url - here"
prompt = "Tell me about AI"
prompt_template=f'''GPT4 User: {prompt}<|end_of_turn|>GPT4 Assistant:
'''
# Note: The original code had an incomplete line 'clien', which might be a mistake.
⨠Features
- Multiple quantisation parameters are provided, allowing users to choose the best option for their hardware and requirements.
- Compatibility with multiple inference servers/webuis, including [text - generation - webui](https://github.com/oobabooga/text - generation - webui), KoboldAI United, [LoLLMS Web UI](https://github.com/ParisNeo/lollms - webui), and [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text - generation - inference).
đĻ Installation
Prerequisites
- Install the
huggingface - hub
Python library if downloading from the command line:
pip3 install huggingface - hub
Model Download
See the "Quick Start" section for detailed download instructions.
đģ Usage Examples
Prompt Template
GPT4 User: {prompt}<|end_of_turn|>GPT4 Assistant:
Using the Model in Python with TGI
from huggingface_hub import InferenceClient
endpoint_url = "https://your - endpoint - url - here"
prompt = "Tell me about AI"
prompt_template=f'''GPT4 User: {prompt}<|end_of_turn|>GPT4 Assistant:
'''
# Note: The original code had an incomplete line 'clien', which might be a mistake.
đ Documentation
Model Information
- Model creator: OpenChat
- Original model: OpenChat 3.5 7B
- Model type: mistral
- License: apache - 2.0
Repositories Available
- [AWQ model(s) for GPU inference.](https://huggingface.co/TheBloke/openchat_3.5 - AWQ)
- [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/openchat_3.5 - GPTQ)
- [2, 3, 4, 5, 6 and 8 - bit GGUF models for CPU + GPU inference](https://huggingface.co/TheBloke/openchat_3.5 - GGUF)
- OpenChat's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions
Provided Files and GPTQ Parameters
Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
---|---|---|---|---|---|---|---|---|---|
[main](https://huggingface.co/TheBloke/openchat_3.5 - GPTQ/tree/main) | 4 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test) | 4096 | 4.16 GB | Yes | 4 - bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. |
[gptq - 4bit - 32g - actorder_True](https://huggingface.co/TheBloke/openchat_3.5 - GPTQ/tree/gptq - 4bit - 32g - actorder_True) | 4 | 32 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test) | 4096 | 4.57 GB | Yes | 4 - bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. |
[gptq - 8bit--1g - actorder_True](https://huggingface.co/TheBloke/openchat_3.5 - GPTQ/tree/gptq - 8bit--1g - actorder_True) | 8 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test) | 4096 | 4.95 GB | No | 8 - bit, with Act Order. No group size, to lower VRAM requirements. |
[gptq - 8bit - 128g - actorder_True](https://huggingface.co/TheBloke/openchat_3.5 - GPTQ/tree/gptq - 8bit - 128g - actorder_True) | 8 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test) | 4096 | 5.00 GB | No | 8 - bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. |
[gptq - 8bit - 32g - actorder_True](https://huggingface.co/TheBloke/openchat_3.5 - GPTQ/tree/gptq - 8bit - 32g - actorder_True) | 8 | 32 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test) | 4096 | 4.97 GB | No | 8 - bit, with group size 32g and Act Order for maximum inference quality. |
[gptq - 4bit - 64g - actorder_True](https://huggingface.co/TheBloke/openchat_3.5 - GPTQ/tree/gptq - 4bit - 64g - actorder_True) | 4 | 64 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test) | 4096 | 4.30 GB | Yes | 4 - bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. |
Explanation of GPTQ parameters
- Bits: The bit size of the quantised model.
- GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
- Act Order: True or False. Also known as
desc_act
. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now. - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
- GPTQ dataset: The calibration dataset used during quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ calibration dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
- Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
- ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama and Mistral models in 4 - bit.
đ§ Technical Details
These files were quantised using hardware kindly provided by Massed Compute. Most GPTQ files are made with AutoGPTQ. Mistral models are currently made with Transformers.
đ License
This model is licensed under the [apache - 2.0](https://www.apache.org/licenses/LICENSE - 2.0) license.

