CodeLlama-34B-Instruct-GPTQ Open-Source Code Generation Model - Empowering Programming Tasks Based on Llama 2

Codellama 34B Instruct GPTQ

Developed by TheBloke

CodeLlama 34B Instruct is a 34-billion-parameter code generation model released by Meta, based on the Llama 2 architecture, specifically fine-tuned for programming tasks.

Large Language Model

Transformers

Other#Large Parameter Code Generation #Programming Problem Solving #Multi-language Code Support

Downloads 174

Release Time : 8/25/2023

Model Overview

This is a large-scale code generation model fine-tuned for instruction following, capable of understanding and generating programming code to assist developers in solving programming problems.

Model Features

Code Generation Capability

Capable of generating code solutions that meet constraints based on problem descriptions.

Instruction Fine-tuning

Specially fine-tuned to better understand and execute programming task instructions.

Large Model Scale

A 34-billion-parameter model with powerful comprehension and generation capabilities.

Multi-language Support

Supports code generation and understanding for multiple programming languages.

Model Capabilities

Code Generation

Programming Problem Solving

Code Completion

Code Understanding

Use Cases

Development Assistance

Programming Problem Solving

Automatically generates solution code based on problem descriptions.

Capable of passing example test cases.

Code Completion

Provides intelligent code completion suggestions in IDEs.

Improves development efficiency.

Education

Programming Learning Assistance

Helps students understand and solve programming exercises.

Provides instant feedback and solutions.

🚀 CodeLlama 34B Instruct - GPTQ

This repository provides GPTQ model files for Meta's CodeLlama 34B Instruct, offering multiple quantisation options to suit different hardware and requirements.

🚀 Quick Start

Downloading the Model

In text - generation - webui: Add :branch to the end of the download name, e.g., TheBloke/CodeLlama-34B-Instruct-GPTQ:main.
With Git: Clone a branch using the command git clone --single - branch --branch main https://huggingface.co/TheBloke/CodeLlama-34B-Instruct-GPTQ.
In Python Transformers code: Use the revision parameter to specify the branch.

Using the Model in text - generation - webui

Ensure you are using the latest version of [text - generation - webui](https://github.com/oobabooga/text - generation - webui). It is recommended to use the one - click installers.
Click the Model tab.
Under Download custom model or LoRA, enter TheBloke/CodeLlama-34B-Instruct-GPTQ. To download from a specific branch, add :branch at the end.
Click Download. Wait for the download to complete.
Click the refresh icon next to Model in the top - left corner.
Select the downloaded model CodeLlama-34B-Instruct-GPTQ from the Model dropdown.
The model will load automatically and be ready for use.
Set any custom settings, then click Save settings for this model followed by Reload the Model in the top - right corner.
Click the Text Generation tab and enter a prompt to start generating text.

Using the Model from Python Code

Install the necessary packages:

Requires Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.

pip3 install transformers>=4.32.0 optimum>=1.12.0
pip3 install auto - gptq --extra - index - url https://huggingface.github.io/autogptq - index/whl/cu118/  # Use cu117 if on CUDA 11.7

If there are problems installing AutoGPTQ using pre - built wheels, install it from source:

pip3 uninstall - y auto - gptq
git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
pip3 install.

For CodeLlama models only: You must use Transformers 4.33.0 or later. If not released yet, install from source:
```
pip3 uninstall - y transformers
pip3 install git+https://github.com/huggingface/transformers.git
```

Use the following code:

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name_or_path = "TheBloke/CodeLlama-34B-Instruct-GPTQ"
# To use a different branch, change revision
# For example: revision="main"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

prompt = "Tell me about AI"
prompt_template=f'''[INST] Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:
{prompt}
[/INST]

'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))

# Inference can also be done using transformers' pipeline

print("*** Pipeline:")
pipe = pipeline(
    "text - generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])

✨ Features

Multiple Quantisation Options: Different GPTQ parameter permutations are provided, allowing users to choose the best option for their hardware and requirements.
Branch - Based Management: Each separate quant is in a different branch, making it easy to manage and download specific versions.
Compatibility: The model files are compatible with various tools such as AutoGPTQ, ExLlama (4 - bit Llama models), and Huggingface Text Generation Inference (TGI).

📦 Installation

Prerequisites

For Python usage, you need to install Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.

Steps

Follow the steps in the "Quick Start" section for downloading and using the model in text - generation - webui or from Python code.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name_or_path = "TheBloke/CodeLlama-34B-Instruct-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto", trust_remote_code=False, revision="main")
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

prompt = "Tell me about AI"
prompt_template=f'''[INST] Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:
{prompt}
[/INST]

'''

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))

Advanced Usage

# Using the pipeline for text generation
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name_or_path = "TheBloke/CodeLlama-34B-Instruct-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto", trust_remote_code=False, revision="main")
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

prompt = "Tell me about AI"
prompt_template=f'''[INST] Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:
{prompt}
[/INST]

'''

pipe = pipeline(
    "text - generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])

📚 Documentation

Model Information

Property	Details
Model Type	Llama
Model Creator	[Meta](https://huggingface.co/meta - llama)
Base Model	[codellama/CodeLlama - 34b - instruct - hf](https://huggingface.co/codellama/CodeLlama - 34b - instruct - hf)
License	Llama2
Pipeline Tag	Text - generation
Prompt Template	[INST] Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```: {prompt} [/INST]
Quantized By	TheBloke

Provided Files and GPTQ Parameters

Multiple quantisation parameters are provided. Each separate quant is in a different branch. All recent GPTQ files are made with AutoGPTQ, and files in non - main branches are made with AutoGPTQ. Files in the main branch uploaded before August 2023 were made with GPTQ - for - LLaMa.

Explanation of GPTQ parameters

Bits: The bit size of the quantised model.
GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
Act Order: True or False. Also known as desc_act. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now.
Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama models in 4 - bit.

Branch	Bits	GS	Act Order	Damp %	GPTQ Dataset	Seq Len	Size	ExLlama	Desc
main	4	128	No	0.1	Evol Instruct Code	4096	18.33 GB	Yes	4 - bit, without Act Order and group size 128g.
[gptq - 4bit - 32g - actorder_True](https://huggingface.co/TheBloke/CodeLlama-34B-Instruct-GPTQ/tree/gptq - 4bit - 32g - actorder_True)	4	32	Yes	0.1	Evol Instruct Code	4096	20.28 GB	Yes	4 - bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage.
[gptq - 4bit - 64g - actorder_True](https://huggingface.co/TheBloke/CodeLlama-34B-Instruct-GPTQ/tree/gptq - 4bit - 64g - actorder_True)	4	64	Yes	0.1	Evol Instruct Code	4096	18.98 GB	Yes	4 - bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy.
[gptq - 4bit - 128g - actorder_True](https://huggingface.co/TheBloke/CodeLlama-34B-Instruct-GPTQ/tree/gptq - 4bit - 128g - actorder_True)	4	128	Yes	0.1	Evol Instruct Code	4096	18.33 GB	Yes	4 - bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy.
[gptq - 8bit--1g - actorder_True](https://huggingface.co/TheBloke/CodeLlama-34B-Instruct-GPTQ/tree/gptq - 8bit--1g - actorder_True)	8	None	Yes	0.1	Evol Instruct Code	4096	34.30 GB	No	8 - bit, with Act Order. No group size, to lower VRAM requirements.
[gptq - 8bit - 128g - actorder_True](https://huggingface.co/TheBloke/CodeLlama-34B-Instruct-GPTQ/tree/gptq - 8bit - 128g - actorder_True)	8	128	Yes	0.1	Evol Instruct Code	4096	35.07 GB	No	8 - bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy.

Compatibility

The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with Occ4m's GPTQ - for - LLaMa fork.

ExLlama is compatible with Llama models in 4 - bit. Please see the Provided Files table above for per - file compatibility.

[Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text - generation - inference) is compatible with all GPTQ models.

🔧 Technical Details

The GPTQ quantisation process involves several parameters such as bit size, group size, act order, damp percentage, and the dataset used for quantisation. These parameters affect the trade - off between VRAM usage and quantisation accuracy. For example, a higher group size uses less VRAM but may result in lower accuracy, while enabling act order generally improves accuracy.

📄 License

The model is licensed under Llama2.

Discord

For further support, and discussions on these models and AI in general, join us at:

TheBloke AI's Discord server

Thanks, and how to contribute

Thanks to the chirper.ai team!

Thanks to Clay from [gpus.llm - utils.org](llm - utils)!

I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training.

If you're able and willing to contribute it will be most gratefully received and will help me to keep providing these models and services.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご