đ Meta-Llama-3-70B-Instruct-quantized.w8a16
A quantized version of Meta-Llama-3-70B-Instruct, optimized for reduced disk space and GPU memory usage, suitable for commercial and research use in English.
đ Quick Start
This is a quantized version of Meta-Llama-3-70B-Instruct. It's designed for commercial and research use in English, similar to the original model for assistant - like chat.
⨠Features
- Model Architecture: Based on Meta - Llama - 3, taking text as input and outputting text.
- Model Optimizations:
- Weight quantization: Quantized to INT8 data type, reducing disk size and GPU memory requirements by about 50%.
- Intended Use Cases: For commercial and research use in English, specifically for assistant - like chat.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
Use with vLLM
This model can be efficiently deployed using the vLLM backend. The following example shows how to use it with 2 GPUs:
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w8a16"
number_gpus = 2
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
vLLM also supports OpenAI - compatible serving. See the documentation for more details.
Use with transformers
This model is supported by Transformers leveraging the integration with the AutoGPTQ data format. The following example shows how to use the generate()
function:
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w8a16"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = model.generate(
input_ids,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))
đ Documentation
Model Overview
Property |
Details |
Model Type |
Meta - Llama - 3 |
Input |
Text |
Output |
Text |
Model Optimizations |
Weight quantization to INT8 |
Intended Use Cases |
Commercial and research use in English, for assistant - like chat |
Out - of - scope |
Use violating laws or regulations, use in non - English languages |
Release Date |
7/2/2024 |
Version |
1.0 |
License |
Llama3 |
Model Developers |
Neural Magic |
This model achieves an average score of 77.90 on the OpenLLM benchmark (version 1), while the unquantized model gets 79.18.
Model Optimizations
This model was obtained by quantizing the weights of Meta-Llama-3-70B-Instruct to INT8. Only the weights of linear operators in transformers blocks are quantized using symmetric per - channel quantization. AutoGPTQ is used for quantization with a 10% damping factor and 128 sequences from Neural Magic's LLM compression calibration dataset.
Creation
This model was created using the AutoGPTQ library as shown in the following code:
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
model_id = "meta-llama/Meta-Llama-3-70B-Instruct"
num_samples = 128
max_seq_len = 8192
tokenizer = AutoTokenizer.from_pretrained(model_id)
def preprocess_fn(example):
return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)
examples = [tokenizer(example["text"], padding=False, max_length=max_seq_len, truncation=True) for example in ds]
quantize_config = BaseQuantizeConfig(
bits=8,
group_size=-1,
desc_act=False,
model_file_base_name="model",
damp_percent=0.1,
)
model = AutoGPTQForCausalLM.from_pretrained(
model_id,
quantize_config,
device_map="auto",
)
model.quantize(examples)
model.save_pretrained("Meta-Llama-3-70B-Instruct-quantized.w8a16")
Neural Magic is transitioning to using llm - compressor which supports more quantization schemes and models.
Evaluation
The model was evaluated on the OpenLLM leaderboard tasks (version 1) with the lm - evaluation - harness (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the vLLM engine using the following command (with 8 GPUs):
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w8a16",tensor_parallel_size=8,dtype=auto,gpu_memory_utilization=0.4,add_bos_token=True,max_model_len=4096 \
--tasks openllm \
--batch_size auto
Accuracy
Benchmark |
Meta - Llama - 3 - 70B - Instruct |
Meta - Llama - 3 - 70B - Instruct - quantized.w8a16 (this model) |
Recovery |
MMLU (5 - shot) |
80.18 |
78.69 |
98.1% |
ARC Challenge (25 - shot) |
72.44 |
71.59 |
98.8% |
GSM - 8K (5 - shot, strict - match) |
90.83 |
86.43 |
95.2% |
Hellaswag (10 - shot) |
85.54 |
85.65 |
100.1% |
Winogrande (5 - shot) |
83.19 |
83.11 |
98.8% |
TruthfulQA (0 - shot) |
62.92 |
61.94 |
98.4% |
Average |
79.18 |
77.90 |
98.4% |
đ§ Technical Details
This model's optimization involves quantizing the weights of Meta-Llama-3-70B-Instruct to INT8. Only the weights of linear operators in transformers blocks are quantized. Symmetric per - channel quantization is used, where a linear scaling per output dimension maps the INT8 and floating - point representations of the quantized weights. AutoGPTQ is used for quantization with specific parameters and calibration data.
đ License
This model is licensed under Llama3.