Meta-Llama-3.1-8B-FP8 Open-Source Model - Supports Multiple Languages and Suitable for Commercial and Research Scenarios

Home

Meta Llama 3.1 8B FP8

Developed by RedHatAI

FP8 quantized version of Meta-Llama-3.1-8B, suitable for multilingual business and research applications.

Large Language Model

Transformers

Supports Multiple Languages#FP8 quantization #Multilingual generation #Efficient inference

Downloads 4,154

Release Time : 7/31/2024

Model Overview

This model is a quantized version of Meta-Llama-3.1-8B, significantly reducing disk size and GPU memory requirements by quantizing weights and activations to FP8 data type.

Model Features

FP8 quantization

Quantization of weights and activations to FP8 data type reduces disk size and GPU memory requirements by approximately 50%.

Multilingual support

Supports text generation tasks in multiple languages including English, German, French, and more.

High performance recovery rate

Achieves an average score recovery rate of 99.14% in OpenLLM benchmarks, closely matching the performance of the original model.

Model Capabilities

Text generation

Multilingual support

Business applications

Research purposes

Use Cases

Business applications

Multilingual customer service chatbot

Leverage the model's multilingual support to build efficient customer service chatbots.

Enables real-time interaction in multiple languages, improving customer satisfaction.

Research purposes

Language model research

Used to study the impact of quantization on language model performance.

Provides efficient quantized models for research and experimentation.

🚀 Meta-Llama-3.1-8B-FP8

A quantized version of Meta-Llama-3.1-8B, optimized for inference with reduced disk and memory requirements.

🚀 Quick Start

This README provides an overview of the Meta-Llama-3.1-8B-FP8 model, including its architecture, optimizations, creation process, and evaluation results.

✨ Features

Quantization: The weights and activations of the model are quantized to FP8 data type, reducing the disk size and GPU memory requirements by approximately 50%.
Multi-language Support: Supports multiple languages, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
High Performance: Achieves an average score of 65.90 on the OpenLLM benchmark (version 1).

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

No usage examples are provided in the original README.

📚 Documentation

Model Overview

Model Architecture: Meta-Llama-3.1
- Input: Text
- Output: Text
Model Optimizations:
- Weight quantization: FP8
- Activation quantization: FP8
Intended Use Cases: Intended for commercial and research use in multiple languages. Similar to Meta-Llama-3.1-8B, this model serves as a base version.
Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
Release Date: 7/23/2024
Version: 1.0
License(s): llama3.1
Model Developers: Neural Magic

Model Optimizations

This model was obtained by quantizing the weights and activations of Meta-Llama-3.1-8B to FP8 data type, ready for inference with vLLM built from source. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.

Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations. LLM Compressor is used for quantization with 512 sequences of UltraChat.

Creation

This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snippet below.

import torch
from datasets import load_dataset
from transformers import AutoTokenizer

from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.transformers.compression.helpers import (
    calculate_offload_device_map,
    custom_offload_device_map,
)

recipe = """
quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    input_activations:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    targets: ["Linear"]
"""

model_stub = "meta-llama/Meta-Llama-3.1-8B"
model_name = model_stub.split("/")[-1]

device_map = calculate_offload_device_map(
    model_stub, reserve_for_hessians=False, num_gpus=1, torch_dtype=torch.float16
)

model = SparseAutoModelForCausalLM.from_pretrained(
    model_stub, torch_dtype=torch.float16, device_map=device_map
)
tokenizer = AutoTokenizer.from_pretrained(model_stub)

output_dir = f"./{model_name}-FP8"

DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 4096

ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

def preprocess(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
        )
    }

ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )

ds = ds.map(tokenize, remove_columns=ds.column_names)

oneshot(
    model=model,
    output_dir=output_dir,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    save_compressed=True,
)

Evaluation

The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA. Evaluation was conducted using the Neural Magic fork of lm-evaluation-harness (branch llama_3.1_instruct) and the vLLM engine. This version of the lm-evaluation-harness includes versions of ARC-Challenge that matches the prompting style of Meta-Llama-3.1-evals.

Accuracy

Benchmark	Meta-Llama-3.1-8B	Meta-Llama-3.1-8B-FP8 (this model)	Recovery
MMLU (5-shot)	65.19	65.01	99.72%
ARC Challenge (25-shot)	78.84	77.73	98.59%
GSM-8K (5-shot, strict-match)	50.34	48.82	96.98%
Hellaswag (10-shot)	82.33	81.96	99.55%
Winogrande (5-shot)	77.98	78.06	100.10%
TruthfulQA (0-shot, mc2)	44.14	43.83	99.30%
Average	66.47	65.90	99.14%

Reproduction

The results were obtained using the following commands:

MMLU

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
  --tasks mmlu \
  --num_fewshot 5 \
  --batch_size auto

ARC-Challenge

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
  --tasks arc_challenge_llama_3.1_instruct \
  --num_fewshot 25 \
  --batch_size auto

GSM-8K

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size auto

Hellaswag

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
  --tasks hellaswag \
  --num_fewshot 10 \
  --batch_size auto

Winogrande

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
  --tasks winogrande \
  --num_fewshot 5 \
  --batch_size auto

TruthfulQA

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
  --tasks truthfulqa \
  --num_fewshot 0 \
  --batch_size auto

🔧 Technical Details

The model uses symmetric per-tensor quantization for the weights and activations of the linear operators within transformers blocks. LLM Compressor is used for quantization with 512 sequences of UltraChat.

📄 License

This model is licensed under llama3.1.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご