Model Overview
Model Features
Model Capabilities
Use Cases
đ CodeLlama 34B Instruct - GPTQ
This repository provides GPTQ model files for Meta's CodeLlama 34B Instruct, offering multiple quantisation options to suit different hardware and requirements.
đ Quick Start
Downloading the Model
- In text - generation - webui: Add
:branch
to the end of the download name, e.g.,TheBloke/CodeLlama-34B-Instruct-GPTQ:main
. - With Git: Clone a branch using the command
git clone --single - branch --branch main https://huggingface.co/TheBloke/CodeLlama-34B-Instruct-GPTQ
. - In Python Transformers code: Use the
revision
parameter to specify the branch.
Using the Model in text - generation - webui
- Ensure you are using the latest version of [text - generation - webui](https://github.com/oobabooga/text - generation - webui). It is recommended to use the one - click installers.
- Click the Model tab.
- Under Download custom model or LoRA, enter
TheBloke/CodeLlama-34B-Instruct-GPTQ
. To download from a specific branch, add:branch
at the end. - Click Download. Wait for the download to complete.
- Click the refresh icon next to Model in the top - left corner.
- Select the downloaded model
CodeLlama-34B-Instruct-GPTQ
from the Model dropdown. - The model will load automatically and be ready for use.
- Set any custom settings, then click Save settings for this model followed by Reload the Model in the top - right corner.
- Click the Text Generation tab and enter a prompt to start generating text.
Using the Model from Python Code
- Install the necessary packages:
- Requires Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
pip3 install transformers>=4.32.0 optimum>=1.12.0 pip3 install auto - gptq --extra - index - url https://huggingface.github.io/autogptq - index/whl/cu118/ # Use cu117 if on CUDA 11.7
- If there are problems installing AutoGPTQ using pre - built wheels, install it from source:
pip3 uninstall - y auto - gptq git clone https://github.com/PanQiWei/AutoGPTQ cd AutoGPTQ pip3 install.
- For CodeLlama models only: You must use Transformers 4.33.0 or later. If not released yet, install from source:
pip3 uninstall - y transformers pip3 install git+https://github.com/huggingface/transformers.git
- Use the following code:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline model_name_or_path = "TheBloke/CodeLlama-34B-Instruct-GPTQ" # To use a different branch, change revision # For example: revision="main" model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto", trust_remote_code=False, revision="main") tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True) prompt = "Tell me about AI" prompt_template=f'''[INST] Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```: {prompt} [/INST] ''' print("\n\n*** Generate:") input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda() output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512) print(tokenizer.decode(output[0])) # Inference can also be done using transformers' pipeline print("*** Pipeline:") pipe = pipeline( "text - generation", model=model, tokenizer=tokenizer, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.95, top_k=40, repetition_penalty=1.1 ) print(pipe(prompt_template)[0]['generated_text'])
⨠Features
- Multiple Quantisation Options: Different GPTQ parameter permutations are provided, allowing users to choose the best option for their hardware and requirements.
- Branch - Based Management: Each separate quant is in a different branch, making it easy to manage and download specific versions.
- Compatibility: The model files are compatible with various tools such as AutoGPTQ, ExLlama (4 - bit Llama models), and Huggingface Text Generation Inference (TGI).
đĻ Installation
Prerequisites
- For Python usage, you need to install Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
Steps
- Follow the steps in the "Quick Start" section for downloading and using the model in text - generation - webui or from Python code.
đģ Usage Examples
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model_name_or_path = "TheBloke/CodeLlama-34B-Instruct-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto", trust_remote_code=False, revision="main")
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
prompt = "Tell me about AI"
prompt_template=f'''[INST] Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:
{prompt}
[/INST]
'''
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))
Advanced Usage
# Using the pipeline for text generation
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model_name_or_path = "TheBloke/CodeLlama-34B-Instruct-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto", trust_remote_code=False, revision="main")
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
prompt = "Tell me about AI"
prompt_template=f'''[INST] Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:
{prompt}
[/INST]
'''
pipe = pipeline(
"text - generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.95,
top_k=40,
repetition_penalty=1.1
)
print(pipe(prompt_template)[0]['generated_text'])
đ Documentation
Model Information
Property | Details |
---|---|
Model Type | Llama |
Model Creator | [Meta](https://huggingface.co/meta - llama) |
Base Model | [codellama/CodeLlama - 34b - instruct - hf](https://huggingface.co/codellama/CodeLlama - 34b - instruct - hf) |
License | Llama2 |
Pipeline Tag | Text - generation |
Prompt Template | [INST] Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```: {prompt} [/INST] |
Quantized By | TheBloke |
Provided Files and GPTQ Parameters
Multiple quantisation parameters are provided. Each separate quant is in a different branch. All recent GPTQ files are made with AutoGPTQ, and files in non - main branches are made with AutoGPTQ. Files in the main
branch uploaded before August 2023 were made with GPTQ - for - LLaMa.
Explanation of GPTQ parameters
- Bits: The bit size of the quantised model.
- GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
- Act Order: True or False. Also known as
desc_act
. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now. - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
- GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
- Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
- ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama models in 4 - bit.
Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
---|---|---|---|---|---|---|---|---|---|
main | 4 | 128 | No | 0.1 | Evol Instruct Code | 4096 | 18.33 GB | Yes | 4 - bit, without Act Order and group size 128g. |
[gptq - 4bit - 32g - actorder_True](https://huggingface.co/TheBloke/CodeLlama-34B-Instruct-GPTQ/tree/gptq - 4bit - 32g - actorder_True) | 4 | 32 | Yes | 0.1 | Evol Instruct Code | 4096 | 20.28 GB | Yes | 4 - bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. |
[gptq - 4bit - 64g - actorder_True](https://huggingface.co/TheBloke/CodeLlama-34B-Instruct-GPTQ/tree/gptq - 4bit - 64g - actorder_True) | 4 | 64 | Yes | 0.1 | Evol Instruct Code | 4096 | 18.98 GB | Yes | 4 - bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. |
[gptq - 4bit - 128g - actorder_True](https://huggingface.co/TheBloke/CodeLlama-34B-Instruct-GPTQ/tree/gptq - 4bit - 128g - actorder_True) | 4 | 128 | Yes | 0.1 | Evol Instruct Code | 4096 | 18.33 GB | Yes | 4 - bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. |
[gptq - 8bit--1g - actorder_True](https://huggingface.co/TheBloke/CodeLlama-34B-Instruct-GPTQ/tree/gptq - 8bit--1g - actorder_True) | 8 | None | Yes | 0.1 | Evol Instruct Code | 4096 | 34.30 GB | No | 8 - bit, with Act Order. No group size, to lower VRAM requirements. |
[gptq - 8bit - 128g - actorder_True](https://huggingface.co/TheBloke/CodeLlama-34B-Instruct-GPTQ/tree/gptq - 8bit - 128g - actorder_True) | 8 | 128 | Yes | 0.1 | Evol Instruct Code | 4096 | 35.07 GB | No | 8 - bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. |
Compatibility
The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with Occ4m's GPTQ - for - LLaMa fork.
ExLlama is compatible with Llama models in 4 - bit. Please see the Provided Files table above for per - file compatibility.
[Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text - generation - inference) is compatible with all GPTQ models.
đ§ Technical Details
The GPTQ quantisation process involves several parameters such as bit size, group size, act order, damp percentage, and the dataset used for quantisation. These parameters affect the trade - off between VRAM usage and quantisation accuracy. For example, a higher group size uses less VRAM but may result in lower accuracy, while enabling act order generally improves accuracy.
đ License
The model is licensed under Llama2.
Discord
For further support, and discussions on these models and AI in general, join us at:
Thanks, and how to contribute
Thanks to the chirper.ai team!
Thanks to Clay from [gpus.llm - utils.org](llm - utils)!
I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training.
If you're able and willing to contribute it will be most gratefully received and will help me to keep providing these models and services.

