Llama-3-8B-Instruct-GPTQ-4-Bit Open-Source Model - An AI Conversation Helper That Can Run Efficiently on Low-VRAM Devices

Llama 3 8B Instruct GPTQ 4 Bit

Developed by astronomer

This is a 4-bit quantized GPTQ model based on Meta Llama 3, quantized by Astronomer, capable of efficient operation on low-VRAM devices.

Large Language Model

Transformers

Open Source License:Other #Low-resource inference #4-bit quantization #Instruction fine-tuning

Downloads 2,059

Release Time : 4/19/2024

Model Overview

This model is a 4-bit quantized version of Meta-Llama-3-8B-Instruct, optimized for efficient operation on resource-limited GPUs while maintaining high generation quality.

Model Features

Efficient quantization

4-bit GPTQ quantization technology significantly reduces model size and VRAM requirements while maintaining high generation quality.

Low-resource operation

Can run on devices with less than 6GB VRAM, suitable for entry-level GPUs like Nvidia T4 and K80.

Optimized inference

Supports various inference frameworks such as vLLM and text-generation-webui, providing efficient text generation services.

Model Capabilities

Instruction following

Text generation

Question answering

Dialogue system

Use Cases

Dialogue system

Intelligent assistant

Build responsive and highly understanding conversational assistants

Can provide smooth conversational experiences in resource-limited environments

Content generation

Text creation

Generate various types of textual content

Maintains over 90% of the original model's generation quality

🚀 Llama-3-8B-Instruct-GPTQ-4-Bit

This repository provides 4-bit quantized GPTQ model files for meta-llama/Meta-Llama-3-8B-Instruct, enabling efficient deployment with reduced VRAM requirements.

🚀 Quick Start

Serving the Model

Tested serving this model via vLLM using an Nvidia T4 (16GB VRAM). Use the following command:

python -m vllm.entrypoints.openai.api_server --model astronomer-io/Llama-3-8B-Instruct-GPTQ-4-Bit --max-model-len 8192 --dtype float16

To address the non-stop token generation bug, send requests with stop_token_ids":[128001, 128009] to the vLLM endpoint. Here is an example:

{
    "model": "astronomer-io/Llama-3-8B-Instruct-GPTQ-4-Bit",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who created Llama 3?"}
    ],
    "max_tokens": 2000,
    "stop_token_ids":[128001,128009]
}

Prompt Template

<|begin_of_text|><|start_header_id|>user<|end_header_id|>
{{prompt}}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

✨ Features

Low VRAM Requirement: This model can be loaded with less than 6 GB of VRAM, a significant reduction from the original 16.07GB model.
Fast Serving: It can be served lightning fast with affordable Nvidia GPUs like Nvidia T4, Nvidia K80, and RTX 4070.
Quantization Benefits: The 4-bit GPTQ quantization has a small quality degradation from the original bfloat16 model but offers improved latency and throughput on smaller GPUs.

📦 Installation

No specific installation steps are provided in the original document.

📚 Documentation

Model Information

Property	Details
Base Model	meta-llama/Meta-Llama-3-8B-Instruct
Model Creator	astronomer-io
Model Name	Meta-Llama-3-8B-Instruct
Model Type	llama
Pipeline Tag	text-generation
Quantized By	davidxmle
License	other (llama-3-community-license)
License Link	https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/LICENSE
Tags	llama, llama-3, facebook, meta, astronomer, gptq, pretrained, quantized, finetuned, autotrain_compatible, endpoints_compatible
Datasets	wikitext

GPTQ Quantization Method

This model is quantized using the AutoGPTQ library, following the best practices from the GPTQ paper.
Quantization is calibrated and aligned with random samples from the specified dataset (wikitext for now) to minimize accuracy loss.

Branch	Bits	Group Size	Act Order	Damp %	GPTQ Dataset	Sequence Length	VRAM Size	ExLlama	Description
main	4	128	Yes	0.1	wikitext	8192	5.74 GB	Yes	4-bit, with Act Order and group size 128g. Smallest model possible with small accuracy loss
More variants to come	TBD	TBD	TBD	TBD	TBD	TBD	TBD	TBD	May upload additional variants of GPTQ 4 bit models in the future using different parameters such as 128g group size and etc.

🔧 Technical Details

Serving with vLLM

For loading this model onto vLLM, ensure all requests have "stop_token_ids":[128001, 128009] to address the non-stop generation issue.
- vLLM does not yet respect generation_config.json.
- The vLLM team is working on a fix for this: https://github.com/vllm-project/vllm/issues/4180

Serving with oobabooga/text-generation-webui

Load the model via AutoGPTQ, with no_inject_fused_attention enabled due to a bug in the AutoGPTQ library.
Under Parameters -> Generation -> Skip special tokens, deselect this option.
Under Parameters -> Generation -> Custom stopping strings, add "<|end_of_text|>","<|eot_id|>" to the field.

📄 License

This model is released under the llama-3-community-license.

⚠️ Important Note

When serving the model with vLLM, make sure to include stop_token_ids":[128001, 128009] in requests to avoid non-stop generation.
When using oobabooga/text-generation-webui, follow the specific loading and parameter settings mentioned above.

👥 Contributors

Quantized by David Xue, Machine Learning Engineer from Astronomer

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご