đ Llama-3-8B-Instruct-GPTQ-4-Bit
This repository provides 4-bit quantized GPTQ model files for meta-llama/Meta-Llama-3-8B-Instruct, enabling efficient deployment with reduced VRAM requirements.
đ Quick Start
Serving the Model
Tested serving this model via vLLM using an Nvidia T4 (16GB VRAM). Use the following command:
python -m vllm.entrypoints.openai.api_server --model astronomer-io/Llama-3-8B-Instruct-GPTQ-4-Bit --max-model-len 8192 --dtype float16
To address the non-stop token generation bug, send requests with stop_token_ids":[128001, 128009]
to the vLLM endpoint. Here is an example:
{
"model": "astronomer-io/Llama-3-8B-Instruct-GPTQ-4-Bit",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who created Llama 3?"}
],
"max_tokens": 2000,
"stop_token_ids":[128001,128009]
}
Prompt Template
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
{{prompt}}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
⨠Features
- Low VRAM Requirement: This model can be loaded with less than 6 GB of VRAM, a significant reduction from the original 16.07GB model.
- Fast Serving: It can be served lightning fast with affordable Nvidia GPUs like Nvidia T4, Nvidia K80, and RTX 4070.
- Quantization Benefits: The 4-bit GPTQ quantization has a small quality degradation from the original
bfloat16
model but offers improved latency and throughput on smaller GPUs.
đĻ Installation
No specific installation steps are provided in the original document.
đ Documentation
Model Information
Property |
Details |
Base Model |
meta-llama/Meta-Llama-3-8B-Instruct |
Model Creator |
astronomer-io |
Model Name |
Meta-Llama-3-8B-Instruct |
Model Type |
llama |
Pipeline Tag |
text-generation |
Quantized By |
davidxmle |
License |
other (llama-3-community-license) |
License Link |
https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/LICENSE |
Tags |
llama, llama-3, facebook, meta, astronomer, gptq, pretrained, quantized, finetuned, autotrain_compatible, endpoints_compatible |
Datasets |
wikitext |
GPTQ Quantization Method
- This model is quantized using the AutoGPTQ library, following the best practices from the GPTQ paper.
- Quantization is calibrated and aligned with random samples from the specified dataset (wikitext for now) to minimize accuracy loss.
Branch |
Bits |
Group Size |
Act Order |
Damp % |
GPTQ Dataset |
Sequence Length |
VRAM Size |
ExLlama |
Description |
main |
4 |
128 |
Yes |
0.1 |
wikitext |
8192 |
5.74 GB |
Yes |
4-bit, with Act Order and group size 128g. Smallest model possible with small accuracy loss |
More variants to come |
TBD |
TBD |
TBD |
TBD |
TBD |
TBD |
TBD |
TBD |
May upload additional variants of GPTQ 4 bit models in the future using different parameters such as 128g group size and etc. |
đ§ Technical Details
Serving with vLLM
- For loading this model onto vLLM, ensure all requests have
"stop_token_ids":[128001, 128009]
to address the non-stop generation issue.
Serving with oobabooga/text-generation-webui
- Load the model via AutoGPTQ, with
no_inject_fused_attention
enabled due to a bug in the AutoGPTQ library.
- Under
Parameters
-> Generation
-> Skip special tokens
, deselect this option.
- Under
Parameters
-> Generation
-> Custom stopping strings
, add "<|end_of_text|>","<|eot_id|>"
to the field.
đ License
This model is released under the llama-3-community-license.
â ī¸ Important Note
- When serving the model with vLLM, make sure to include
stop_token_ids":[128001, 128009]
in requests to avoid non-stop generation.
- When using oobabooga/text-generation-webui, follow the specific loading and parameter settings mentioned above.
đĨ Contributors