Llama-3.2-1B-Instruct-FP8 Open Source Model - Suitable for Multilingual Business Research Scenarios, with Performance Close to the Original Version

Llama 3.2 1B Instruct FP8

Developed by RedHatAI

FP8 quantized version of Llama-3.2-1B-Instruct, suitable for multilingual business and research applications, with performance close to the original model.

Large Language Model

Safetensors

Supports Multiple Languages#FP8 quantization #Multilingual assistant #Low VRAM requirement

Downloads 1,718

Release Time : 9/26/2024

Model Overview

This is a 1B parameter instruction-tuned model based on the Llama-3 architecture, optimized with FP8 quantization for assistant-style dialogue scenarios.

Model Features

FP8 quantization

Utilizes FP8 quantization for both weights and activations, reducing memory requirements by 50% and doubling computational throughput.

Multilingual support

Supports text generation tasks in 8 languages.

High performance retention

Performance degradation is less than 1% across multiple benchmarks, closely matching the original model.

Efficient deployment

Supports vLLM backend deployment and provides OpenAI-compatible services.

Model Capabilities

Multilingual text generation

Assistant-style dialogue

Knowledge Q&A

Task completion

Use Cases

Intelligent assistant

Multilingual customer service bot

Deployed as an online customer support assistant supporting multiple languages

Can handle common customer inquiries in 8 languages

Education

Language learning assistant

Acts as a conversation partner for language learners

Provides multilingual interactive experiences

🚀 Llama-3.2-1B-Instruct-FP8

A quantized version of Llama-3.2-1B-Instruct, achieving high scores with optimized resource usage.

🚀 Quick Start

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/Llama-3.2-1B-Instruct-FP8"
number_gpus = 1
max_model_len = 8192

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

✨ Features

Multilingual Support: Supports multiple languages including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
Optimized Architecture: Based on Llama-3 architecture with input and output in text format.
Quantization Optimization: Uses FP8 for both activation and weight quantization, reducing GPU memory requirements by approximately 50% and increasing matrix - multiply compute throughput by approximately 2x.
High Accuracy: Achieves scores within 1.0% of the scores of the unquantized model for MMLU, ARC - Challenge, GSM - 8k, Hellaswag, Winogrande and TruthfulQA.

📚 Documentation

Model Overview

Model Architecture: Llama-3
- Input: Text
- Output: Text
Model Optimizations:
- Activation quantization: FP8
- Weight quantization: FP8
Intended Use Cases: Intended for commercial and research use in multiple languages. Similar to Llama-3.2-1B-Instruct, this model is intended for assistant - like chat.
Out - of - scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws).
Release Date: 9/25/2024
Version: 1.0
License(s): Llama3.2
Model Developers: Neural Magic

This is a quantized version of Llama-3.2-1B-Instruct. It achieves scores within 1.0% of the scores of the unquantized model for MMLU, ARC - Challenge, GSM - 8k, Hellaswag, Winogrande and TruthfulQA.

Model Optimizations

This model was obtained by quantizing the weights of Llama-3.2-1B-Instruct to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix - multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%.

Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per - channel scheme, where a fixed linear scaling factor is applied between FP8 and floating point representations for each output channel dimension. Activations are quantized with a symmetric per - tensor scheme, where a fixed linear scaling factor is applied between FP8 and floating point representations for the entire activation tensor. Weights are quantized by rounding to nearest FP8 representation. The llm - compressor library was applied to quantize the model, using 512 sequences taken from Neural Magic's LLM compression calibration dataset.

Creation

This model was created by using the llm - compressor library as presented in the code snippet below.

from transformers import AutoTokenizer
from datasets import load_dataset
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

model_id = "meta-llama/Llama-3.2-1B-Instruct"

num_samples = 512
max_seq_len = 8192

tokenizer = AutoTokenizer.from_pretrained(model_id)

def preprocess_fn(example):
  return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}

ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)

recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8",
    ignore=["lm_head"],
  )

model = SparseAutoModelForCausalLM.from_pretrained(
  model_id,
  device_map="auto",
)

oneshot(
  model=model,
  dataset=ds,
  recipe=recipe,
  max_seq_length=max_seq_len,
  num_calibration_samples=num_samples,
)

model.save_pretrained("Llama-3.2-1B-Instruct-FP8")

Evaluation

The model was evaluated on MMLU, ARC - Challenge, GSM - 8K, Hellaswag, Winogrande and TruthfulQA. Evaluation was conducted using the Neural Magic fork of lm - evaluation - harness (branch llama_3.1_instruct) and the vLLM engine. This version of the lm - evaluation - harness includes versions of MMLU, ARC - Challenge and GSM - 8K that match the prompting style of Meta - Llama - 3.1 - Instruct - evals.

Accuracy

Open LLM Leaderboard evaluation scores

Benchmark	Llama - 3.2 - 1B - Instruct	Llama - 3.2 - 1B - Instruct - FP8 (this model)	Recovery
MMLU (5 - shot)	47.66	47.76	100.2%
MMLU (CoT, 0 - shot)	47.10	47.24	94.8%
ARC Challenge (0 - shot)	58.36	57.85	99.1%
GSM - 8K (CoT, 8 - shot, strict - match)	45.72	45.49	99.5%
Hellaswag (10 - shot)	61.01	61.00	100.0%
Winogrande (5 - shot)	62.27	62.35	100.1%
TruthfulQA (0 - shot, mc2)	43.52	43.08	99.0%
Average	52.24	52.11	99.8%

Reproduction

The results were obtained using the following commands:

MMLU

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
  --tasks mmlu_llama_3.1_instruct \
  --fewshot_as_multiturn \
  --apply_chat_template \
  --num_fewshot 5 \
  --batch_size auto

MMLU - CoT

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \
  --tasks mmlu_cot_0shot_llama_3.1_instruct \
  --apply_chat_template \
  --num_fewshot 0 \
  --batch_size auto

ARC - Challenge

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \
  --tasks arc_challenge_llama_3.1_instruct \
  --apply_chat_template \
  --num_fewshot 0 \
  --batch_size auto

GSM - 8K

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
  --tasks gsm8k_cot_llama_3.1_instruct \
  --fewshot_as_multiturn \
  --apply_chat_template \
  --num_fewshot 8 \
  --batch_size auto

Hellaswag

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
  --tasks hellaswag \
  --num_fewshot 10 \
  --batch_size auto

Winogrande

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
  --tasks winogrande \
  --num_fewshot 5 \
  --batch_size auto

TruthfulQA

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-FP8",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
  --tasks truthfulqa \
  --num_fewshot 0 \
  --batch_size auto

📄 License

The model is licensed under Llama3.2.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご