OpenChat 3.5 7B Open-Source Large Language Model - Based on Mistral Architecture, Free and Unlimited Chatting Usage

Openchat 3.5 GPTQ

Developed by TheBloke

OpenChat 3.5 7B is a 7B-parameter large language model based on the Mistral architecture, developed by the OpenChat team and released under the Apache 2.0 license.

Large Language Model

Transformers

Open Source License:Apache-2.0 #Dialogue Optimization #Low-resource Efficiency #Multi-turn Interaction

Downloads 107

Release Time : 11/2/2023

Model Overview

This is a quantized GPTQ version of the model, optimized for efficient inference and supporting text generation tasks.

Model Features

Efficient Quantization

Offers multiple GPTQ quantization versions (4-bit and 8-bit), optimizing VRAM usage and inference speed.

Dialogue Optimization

Specifically optimized for dialogue scenarios, supporting GPT4-style user-assistant interaction modes.

Long Context Support

Supports a context length of 4096 tokens, suitable for long conversations and complex tasks.

Model Capabilities

Text Generation

Dialogue Systems

Instruction Following

Content Creation

Use Cases

Dialogue Systems

Smart Customer Service

Used to build automated customer service systems for handling user queries.

Capable of generating coherent and helpful responses.

Personal Assistant

Functions as a personal digital assistant to answer questions and provide suggestions.

Content Generation

Creative Writing

Assists in story creation, poetry writing, and other creative tasks.

🚀 OpenChat 3.5 7B - GPTQ

This repository provides GPTQ model files for OpenChat's OpenChat 3.5 7B, offering multiple quantisation options to suit different hardware and requirements.

🚀 Quick Start

Downloading the Model

You can download the model in different ways:

In text - generation - webui

To download from the main branch, enter TheBloke/openchat_3.5 - GPTQ in the "Download model" box. To download from another branch, add :branchname to the end of the download name, e.g., TheBloke/openchat_3.5 - GPTQ:gptq - 4bit - 32g - actorder_True.

From the command line

First, install the huggingface - hub Python library:

pip3 install huggingface - hub

To download the main branch to a folder called openchat_3.5 - GPTQ:

mkdir openchat_3.5 - GPTQ
huggingface - cli download TheBloke/openchat_3.5 - GPTQ --local - dir openchat_3.5 - GPTQ --local - dir - use - symlinks False

To download from a different branch, add the --revision parameter:

mkdir openchat_3.5 - GPTQ
huggingface - cli download TheBloke/openchat_3.5 - GPTQ --revision gptq - 4bit - 32g - actorder_True --local - dir openchat_3.5 - GPTQ --local - dir - use - symlinks False

With `git` (not recommended)

git clone --single - branch --branch gptq - 4bit - 32g - actorder_True https://huggingface.co/TheBloke/openchat_3.5 - GPTQ

Using the Model in text - generation - webui

Make sure you're using the latest version of [text - generation - webui](https://github.com/oobabooga/text - generation - webui).
Click the Model tab.
Under Download custom model or LoRA, enter TheBloke/openchat_3.5 - GPTQ.
- To download from a specific branch, enter for example TheBloke/openchat_3.5 - GPTQ:gptq - 4bit - 32g - actorder_True.
Click Download.
Once the model is downloaded, click the refresh icon next to Model in the top left.
In the Model dropdown, choose the model you just downloaded: openchat_3.5 - GPTQ.
The model will automatically load and be ready for use.
If you want custom settings, set them and then click Save settings for this model followed by Reload the Model in the top right.
Click the Text Generation tab and enter a prompt to start.

Serving the Model from Text Generation Inference (TGI)

It's recommended to use TGI version 1.1.0 or later. The official Docker container is: ghcr.io/huggingface/text - generation - inference:1.1.0

Example Docker parameters:

--model - id TheBloke/openchat_3.5 - GPTQ --port 3000 --quantize gptq --max - input - length 3696 --max - total - tokens 4096 --max - batch - prefill - tokens 4096

Example Python code for interfacing with TGI (requires huggingface - hub 0.17.0 or later):

pip3 install huggingface - hub

from huggingface_hub import InferenceClient

endpoint_url = "https://your - endpoint - url - here"

prompt = "Tell me about AI"
prompt_template=f'''GPT4 User: {prompt}<|end_of_turn|>GPT4 Assistant:
'''

# Note: The original code had an incomplete line 'clien', which might be a mistake.

✨ Features

Multiple quantisation parameters are provided, allowing users to choose the best option for their hardware and requirements.
Compatibility with multiple inference servers/webuis, including [text - generation - webui](https://github.com/oobabooga/text - generation - webui), KoboldAI United, [LoLLMS Web UI](https://github.com/ParisNeo/lollms - webui), and [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text - generation - inference).

📦 Installation

Prerequisites

Install the huggingface - hub Python library if downloading from the command line:

pip3 install huggingface - hub

Model Download

See the "Quick Start" section for detailed download instructions.

💻 Usage Examples

Prompt Template

GPT4 User: {prompt}<|end_of_turn|>GPT4 Assistant:

Using the Model in Python with TGI

from huggingface_hub import InferenceClient

endpoint_url = "https://your - endpoint - url - here"

prompt = "Tell me about AI"
prompt_template=f'''GPT4 User: {prompt}<|end_of_turn|>GPT4 Assistant:
'''

# Note: The original code had an incomplete line 'clien', which might be a mistake.

📚 Documentation

Model Information

Model creator: OpenChat
Original model: OpenChat 3.5 7B
Model type: mistral
License: apache - 2.0

Repositories Available

[AWQ model(s) for GPU inference.](https://huggingface.co/TheBloke/openchat_3.5 - AWQ)
[GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/openchat_3.5 - GPTQ)
[2, 3, 4, 5, 6 and 8 - bit GGUF models for CPU + GPU inference](https://huggingface.co/TheBloke/openchat_3.5 - GGUF)
OpenChat's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions

Provided Files and GPTQ Parameters

Branch	Bits	GS	Act Order	Damp %	GPTQ Dataset	Seq Len	Size	ExLlama	Desc
[main](https://huggingface.co/TheBloke/openchat_3.5 - GPTQ/tree/main)	4	128	Yes	0.1	[wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test)	4096	4.16 GB	Yes	4 - bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy.
[gptq - 4bit - 32g - actorder_True](https://huggingface.co/TheBloke/openchat_3.5 - GPTQ/tree/gptq - 4bit - 32g - actorder_True)	4	32	Yes	0.1	[wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test)	4096	4.57 GB	Yes	4 - bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage.
[gptq - 8bit--1g - actorder_True](https://huggingface.co/TheBloke/openchat_3.5 - GPTQ/tree/gptq - 8bit--1g - actorder_True)	8	None	Yes	0.1	[wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test)	4096	4.95 GB	No	8 - bit, with Act Order. No group size, to lower VRAM requirements.
[gptq - 8bit - 128g - actorder_True](https://huggingface.co/TheBloke/openchat_3.5 - GPTQ/tree/gptq - 8bit - 128g - actorder_True)	8	128	Yes	0.1	[wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test)	4096	5.00 GB	No	8 - bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy.
[gptq - 8bit - 32g - actorder_True](https://huggingface.co/TheBloke/openchat_3.5 - GPTQ/tree/gptq - 8bit - 32g - actorder_True)	8	32	Yes	0.1	[wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test)	4096	4.97 GB	No	8 - bit, with group size 32g and Act Order for maximum inference quality.
[gptq - 4bit - 64g - actorder_True](https://huggingface.co/TheBloke/openchat_3.5 - GPTQ/tree/gptq - 4bit - 64g - actorder_True)	4	64	Yes	0.1	[wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext - 2 - v1/test)	4096	4.30 GB	Yes	4 - bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy.

Explanation of GPTQ parameters

Bits: The bit size of the quantised model.
GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
Act Order: True or False. Also known as desc_act. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now.
Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
GPTQ dataset: The calibration dataset used during quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ calibration dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama and Mistral models in 4 - bit.

🔧 Technical Details

These files were quantised using hardware kindly provided by Massed Compute. Most GPTQ files are made with AutoGPTQ. Mistral models are currently made with Transformers.

📄 License

This model is licensed under the [apache - 2.0](https://www.apache.org/licenses/LICENSE - 2.0) license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご