
Model Overview
Model Features
Model Capabilities
Use Cases
đ Meta-Llama-3.1-8B-Instruct-quantized.w8a8
This is a quantized version of Meta-Llama-3.1-8B-Instruct, optimized for multi - language commercial and research use, with high recovery rates on multiple benchmarks.
đ Quick Start
This model can be deployed efficiently using the vLLM backend. Here is an example:
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8"
number_gpus = 1
max_model_len = 8192
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
vLLM also supports OpenAI - compatible serving. See the documentation for more details.
⨠Features
- Model Architecture: Meta - Llama - 3, with text input and text output.
- Model Optimizations:
- Activation quantization: INT8
- Weight quantization: INT8
- Intended Use Cases: Intended for commercial and research use in multiple languages, similar to [Meta - Llama - 3.1 - 8B - Instruct](https://huggingface.co/meta - llama/Meta - Llama - 3.1 - 8B - Instruct), for assistant - like chat.
- Out - of - scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws).
- Release Date: 7/11/2024
- Version: 1.0
- License(s): Llama3.1
- Model Developers: Neural Magic
This model was evaluated on several tasks to assess its quality in comparison to the unquantized model, including multiple - choice, math reasoning, and open - ended text generation. It achieves 105.4% recovery for the Arena - Hard evaluation, 100.3% for OpenLLM v1 (using Meta's prompting when available), 101.5% for OpenLLM v2, 99.7% for HumanEval pass@1, and 98.8% for HumanEval+ pass@1.
đ§ Technical Details
Model Optimizations
This model was obtained by quantizing the weights of [Meta - Llama - 3.1 - 8B - Instruct](https://huggingface.co/meta - llama/Meta - Llama - 3.1 - 8B - Instruct) to INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix - multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%.
Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per - channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension. Activations are quantized with a symmetric dynamic per - token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations. The GPTQ algorithm is applied for quantization, as implemented in the [llm - compressor](https://github.com/vllm - project/llm - compressor) library. GPTQ used a 1% damping factor and 256 sequences of 8,192 random tokens.
Creation
This model was created by using the [llm - compressor](https://github.com/vllm - project/llm - compressor) library, as shown in the following code:
from transformers import AutoTokenizer
from datasets import Dataset
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
import random
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
num_samples = 256
max_seq_len = 8192
tokenizer = AutoTokenizer.from_pretrained(model_id)
max_token_id = len(tokenizer.get_vocab()) - 1
input_ids = [[random.randint(0, max_token_id) for _ in range(max_seq_len)] for _ in range(num_samples)]
attention_mask = num_samples * [max_seq_len * [1]]
ds = Dataset.from_dict({"input_ids": input_ids, "attention_mask": attention_mask})
recipe = GPTQModifier(
targets="Linear",
scheme="W8A8",
ignore=["lm_head"],
dampening_frac=0.01,
)
model = SparseAutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
)
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=max_seq_len,
num_calibration_samples=num_samples,
)
model.save_pretrained("Meta-Llama-3.1-8B-Instruct-quantized.w8a8")
đ Documentation
Evaluation
This model was evaluated on the well - known Arena - Hard, OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval+ benchmarks. In all cases, model outputs were generated with the vLLM engine.
- Arena - Hard evaluations were conducted using the [Arena - Hard - Auto](https://github.com/lmarena/arena - hard - auto) repository. The model generated a single answer for each prompt from Arena - Hard, and each answer was judged twice by GPT - 4.
- OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of [lm - evaluation - harness](https://github.com/neuralmagic/lm - evaluation - harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct). This version of the lm - evaluation - harness includes versions of MMLU, ARC - Challenge and GSM - 8K that match the prompting style of [Meta - Llama - 3.1 - Instruct - evals](https://huggingface.co/datasets/meta - llama/Meta - Llama - 3.1 - 8B - Instruct - evals) and a few fixes to OpenLLM v2 tasks.
- HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the EvalPlus repository.
Detailed model outputs are available as HuggingFace datasets for [Arena - Hard](https://huggingface.co/datasets/neuralmagic/quantized - llama - 3.1 - arena - hard - evals), [OpenLLM v2](https://huggingface.co/datasets/neuralmagic/quantized - llama - 3.1 - leaderboard - v2 - evals), and [HumanEval](https://huggingface.co/datasets/neuralmagic/quantized - llama - 3.1 - humaneval - evals).
Note: Results have been updated after Meta modified the chat template.
Accuracy
Category | Benchmark | Meta - Llama - 3.1 - 8B - Instruct | Meta - Llama - 3.1 - 8B - Instruct - quantized.w8a8 (this model) | Recovery |
---|---|---|---|---|
LLM as a judge | Arena Hard | 25.8 (25.1 / 26.5) | 27.2 (27.6 / 26.7) | 105.4% |
OpenLLM v1 | MMLU (5 - shot) | 68.3 | 67.8 | 99.3% |
OpenLLM v1 | MMLU (CoT, 0 - shot) | 72.8 | 72.2 | 99.1% |
OpenLLM v1 | ARC Challenge (0 - shot) | 81.4 | 81.7 | 100.3% |
OpenLLM v1 | GSM - 8K (CoT, 8 - shot, strict - match) | 82.8 | 84.8 | 102.5% |
OpenLLM v1 | Hellaswag (10 - shot) | 80.5 | 80.3 | 99.8% |
OpenLLM v1 | Winogrande (5 - shot) | 78.1 | 78.5 | 100.5% |
OpenLLM v1 | TruthfulQA (0 - shot, mc2) | 54.5 | 54.7 | 100.3% |
OpenLLM v1 | Average | 74.1 | 74.3 | 100.3% |
OpenLLM v2 | MMLU - Pro (5 - shot) | 30.8 | 30.9 | 100.3% |
OpenLLM v2 | IFEval (0 - shot) | 77.9 | 78.0 | 100.1% |
OpenLLM v2 | BBH (3 - shot) | 30.1 | 31.0 | 102.9% |
OpenLLM v2 | Math - lvl - 5 (4 - shot) | 15.7 | 15.5 | 98.9% |
OpenLLM v2 | GPQA (0 - shot) | 3.7 | 5.4 | 146.2% |
OpenLLM v2 | MuSR (0 - shot) | 7.6 | 7.6 | 100.0% |
OpenLLM v2 | Average | 27.6 | 28.0 | 101.5% |
Coding | HumanEval pass@1 | 67.3 | 67.1 | 99.7% |
Coding | HumanEval+ pass@1 | 60.7 | 60.0 | 98.8% |
Multilingual | Portuguese MMLU (5 - shot) | 59.96 | 59.36 | 99.0% |
Multilingual | Spanish MMLU (5 - shot) | 60.25 | 59.77 | 99.2% |
Multilingual | Italian MMLU (5 - shot) | 59.23 | 58.61 | 99.0% |
Multilingual | German MMLU (5 - shot) | 58.63 | 58.23 | 99.3% |
Multilingual | French MMLU (5 - shot) | 59.65 | 58.70 | 98.4% |
Multilingual | Hindi MMLU (5 - shot) | 50.10 | 49.33 | 98.5% |
Multilingual | Thai MMLU (5 - shot) | 49.12 | 48.09 | 97.9% |
Reproduction
The results were obtained using the following commands:
MMLU
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
--tasks mmlu_llama_3.1_instruct \
--fewshot_as_multiturn \
--apply_chat_template \
--num_fewshot 5 \
--batch_size auto
MMLU - CoT
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",dtype=auto,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \
--tasks mmlu_cot_0shot_llama_3.1_instruct \
--apply_chat_template \
--num_fewshot 0 \
--batch_size auto
ARC - Challenge
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",dtype=auto,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \
--tasks arc_challenge_llama_3.1_instruct \
--apply_chat_template \
--num_fewshot 0 \
--batch_size auto
GSM - 8K
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",dtype=auto,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
--tasks gsm8k_cot_llama_3.1_instruct \
--fewshot_as_multiturn \
--apply_chat_template \
--num_fewshot 8 \
--batch_size auto
Hellaswag
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
--tasks hellaswag \
--num_fewshot 10 \
--batch_size auto
Winogrande
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
--tasks winogrande \
--num_fewshot 5 \
--batch_size auto
TruthfulQA
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
--tasks truthfulqa \
--num_fewshot 0 \
--batch_size auto
OpenLLM v2
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
--apply_chat_template \
--fewshot_as_multiturn \
--tasks leaderboard \
--batch_size auto
MMLU Portuguese
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
--tasks mmlu_pt_llama_3.1_instruct \
--fewshot_as_multiturn \
--apply_chat_template \
--num_fewshot
đ License
The license for this model is Llama3.1.

