Meta-Llama-3-8B-Instruct-FP8-KV Open-Source Model - Supports vLLM Inference, Quantized Parameters Boost Applications

Meta Llama 3 8B Instruct FP8 KV

Developed by RedHatAI

The Meta-Llama-3-8B-Instruct model has undergone per-tensor quantization of FP8 weights and activations, suitable for inference with vLLM >= 0.5.0. This model checkpoint also includes per-tensor scaling parameters for FP8 quantized KV cache.

Large Language Model

Transformers

#FP8 quantized inference #KV cache optimization #vLLM compatibility

Downloads 3,153

Release Time : 5/20/2024

Model Overview

This is an FP8 quantized Meta-Llama-3-8B-Instruct model that supports FP8 KV cache, designed for efficient inference.

Model Features

FP8 Quantization

Model weights and activations are quantized to FP8 per-tensor, reducing memory usage while maintaining accuracy

FP8 KV Cache Support

Includes per-tensor scaling parameters for FP8 quantized KV cache, callable via vLLM

Efficient Inference

Optimized for vLLM >= 0.5.0, delivering high-efficiency inference performance

Model Capabilities

Text generation

Dialogue systems

Instruction following

Use Cases

Dialogue systems

Chatbot

Build efficient chatbot applications

Content generation

Text creation

Assist in various text creation tasks

🚀 Meta-Llama-3-8B-Instruct-FP8-KV

This model is Meta-Llama-3-8B-Instruct quantized to FP8 weights and activations, enabling efficient inference with vLLM.

🚀 Quick Start

Meta-Llama-3-8B-Instruct is quantized to FP8 weights and activations using per-tensor quantization. It is ready for inference with vLLM >= 0.5.0. This model checkpoint also includes per-tensor scales for FP8 quantized KV Cache, which can be accessed through the --kv-cache-dtype fp8 argument in vLLM.

from vllm import LLM
model = LLM(model="neuralmagic/Meta-Llama-3-8B-Instruct-FP8-KV", kv_cache_dtype="fp8")
result = model.generate("Hello, my name is")

✨ Features

FP8 Quantization: The model uses per-tensor quantization to FP8 weights and activations, optimizing for efficient inference.
KV Cache Support: It includes per-tensor scales for FP8 quantized KV Cache, accessible via vLLM.

💻 Usage Examples

Basic Usage

from vllm import LLM
model = LLM(model="neuralmagic/Meta-Llama-3-8B-Instruct-FP8-KV", kv_cache_dtype="fp8")
result = model.generate("Hello, my name is")

Advanced Usage

The model was produced using AutoFP8 with calibration samples from ultrachat.

from datasets import load_dataset
from transformers import AutoTokenizer

from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8-KV"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

ds = load_dataset("mgoin/ultrachat_2k", split="train_sft")
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")

quantize_config = BaseQuantizeConfig(
    quant_method="fp8",
    activation_scheme="static",
    ignore_patterns=["re:.*lm_head"],
    kv_cache_quant_targets=("k_proj", "v_proj"),
)

model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)

📚 Documentation

Open LLM Leaderboard evaluation scores

	Meta-Llama-3-8B-Instruct	Meta-Llama-3-8B-Instruct-FP8	Meta-Llama-3-8B-Instruct-FP8-KV (this model)
gsm8k 5-shot	75.44	74.37	74.98

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご