Meta-Llama-3-70B-Instruct Quantized Open-Source Model - A Chatting Tool for English Business Research Assistance

Meta Llama 3 70B Instruct Quantized.w8a16

Developed by RedHatAI

A quantized version of Meta-Llama-3-70B-Instruct, mainly used for English business and research purposes, capable of efficiently conducting assistant-like chats.

Large Language Model

Transformers

English#INT8 Quantization #English Assistant #Business Research

Downloads 1,035

Release Time : 7/2/2024

Model Overview

A quantized model based on the Meta-Llama-3 architecture. It reduces the model size and GPU memory requirements through INT8 quantization and is suitable for English business and research purposes.

Model Features

INT8 Quantization

Quantize the weights of linear operators within the Transformer block to INT8, reducing the disk size and GPU memory requirements by approximately 50%.

Efficient Deployment

Support efficient deployment through vLLM and Transformers, suitable for multi-GPU environments.

High Recovery Rate

In the OpenLLM benchmark test, the performance recovery rate of the quantized model reaches 98.4%.

Model Capabilities

Text Generation

Assistant-like Chat

Business Use

Research Use

Use Cases

Business Application

Customer Service Assistant

Used to generate English customer service responses to improve response efficiency.

Research Application

Academic Research Assistant

Assist researchers in generating English research content or abstracts.

🚀 Meta-Llama-3-70B-Instruct-quantized.w8a16

A quantized version of Meta-Llama-3-70B-Instruct, optimized for reduced disk space and GPU memory usage, suitable for commercial and research use in English.

🚀 Quick Start

This is a quantized version of Meta-Llama-3-70B-Instruct. It's designed for commercial and research use in English, similar to the original model for assistant - like chat.

✨ Features

Model Architecture: Based on Meta - Llama - 3, taking text as input and outputting text.
Model Optimizations:
- Weight quantization: Quantized to INT8 data type, reducing disk size and GPU memory requirements by about 50%.
Intended Use Cases: For commercial and research use in English, specifically for assistant - like chat.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

Use with vLLM

This model can be efficiently deployed using the vLLM backend. The following example shows how to use it with 2 GPUs:

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w8a16"
number_gpus = 2

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI - compatible serving. See the documentation for more details.

Use with transformers

This model is supported by Transformers leveraging the integration with the AutoGPTQ data format. The following example shows how to use the generate() function:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w8a16"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

📚 Documentation

Model Overview

Property	Details
Model Type	Meta - Llama - 3
Input	Text
Output	Text
Model Optimizations	Weight quantization to INT8
Intended Use Cases	Commercial and research use in English, for assistant - like chat
Out - of - scope	Use violating laws or regulations, use in non - English languages
Release Date	7/2/2024
Version	1.0
License	Llama3
Model Developers	Neural Magic

This model achieves an average score of 77.90 on the OpenLLM benchmark (version 1), while the unquantized model gets 79.18.

Model Optimizations

This model was obtained by quantizing the weights of Meta-Llama-3-70B-Instruct to INT8. Only the weights of linear operators in transformers blocks are quantized using symmetric per - channel quantization. AutoGPTQ is used for quantization with a 10% damping factor and 128 sequences from Neural Magic's LLM compression calibration dataset.

Creation

This model was created using the AutoGPTQ library as shown in the following code:

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset

model_id = "meta-llama/Meta-Llama-3-70B-Instruct"

num_samples = 128
max_seq_len = 8192

tokenizer = AutoTokenizer.from_pretrained(model_id)

def preprocess_fn(example):
  return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}

ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)

examples = [tokenizer(example["text"], padding=False, max_length=max_seq_len, truncation=True) for example in ds]
    
quantize_config = BaseQuantizeConfig(
  bits=8,
  group_size=-1,
  desc_act=False,
  model_file_base_name="model",
  damp_percent=0.1,
)

model = AutoGPTQForCausalLM.from_pretrained(
  model_id,
  quantize_config,
  device_map="auto",
)

model.quantize(examples)
model.save_pretrained("Meta-Llama-3-70B-Instruct-quantized.w8a16")

Neural Magic is transitioning to using llm - compressor which supports more quantization schemes and models.

Evaluation

The model was evaluated on the OpenLLM leaderboard tasks (version 1) with the lm - evaluation - harness (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the vLLM engine using the following command (with 8 GPUs):

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w8a16",tensor_parallel_size=8,dtype=auto,gpu_memory_utilization=0.4,add_bos_token=True,max_model_len=4096 \
  --tasks openllm \
  --batch_size auto

Accuracy

Benchmark	Meta - Llama - 3 - 70B - Instruct	Meta - Llama - 3 - 70B - Instruct - quantized.w8a16 (this model)	Recovery
MMLU (5 - shot)	80.18	78.69	98.1%
ARC Challenge (25 - shot)	72.44	71.59	98.8%
GSM - 8K (5 - shot, strict - match)	90.83	86.43	95.2%
Hellaswag (10 - shot)	85.54	85.65	100.1%
Winogrande (5 - shot)	83.19	83.11	98.8%
TruthfulQA (0 - shot)	62.92	61.94	98.4%
Average	79.18	77.90	98.4%

🔧 Technical Details

This model's optimization involves quantizing the weights of Meta-Llama-3-70B-Instruct to INT8. Only the weights of linear operators in transformers blocks are quantized. Symmetric per - channel quantization is used, where a linear scaling per output dimension maps the INT8 and floating - point representations of the quantized weights. AutoGPTQ is used for quantization with specific parameters and calibration data.

📄 License

This model is licensed under Llama3.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご