Llama-4-Scout-17B-16E-Instruct-FP8-dynamic Open-Source Multilingual Instruction Model - Efficiently Run with Low Resource Requirements

Llama 4 Scout 17B 16E Instruct FP8 Dynamic

Developed by RedHatAI

A 17B-parameter multilingual instruction model based on Llama-4, optimized with FP8 quantization to significantly reduce resource requirements

Image-to-Text

Safetensors

Supports Multiple LanguagesOpen Source License:Other #FP8 Quantization Acceleration #Multimodal Instruction Understanding #Multilingual Generation

Downloads 5,812

Release Time : 4/10/2025

Model Overview

This is a multilingual large language model with FP8 quantization, supporting text and image inputs and generating text responses. Quantization technology reduces memory and disk space requirements by 50% while improving computational efficiency.

Model Features

FP8 Quantization Optimization

Both weights and activations use FP8 quantization, reducing memory and disk space requirements by 50% and doubling computational throughput

Multimodal Support

Supports image and text inputs for handling multimodal tasks

Multilingual Capabilities

Supports text processing and generation in 12 languages

Model Capabilities

Text Generation

Image Understanding

Multilingual Processing

Instruction Following

Use Cases

Intelligent Assistant

Multilingual Customer Service Bot

Build a smart customer service system supporting multiple languages

Can fluently handle customer inquiries in 12 languages

Content Generation

Multilingual Content Creation

Automatically generate multilingual marketing copy or social media content

🚀 Llama-4-Scout-17B-16E-Instruct-FP8-dynamic

Built with Llama, this model offers efficient text and image processing with optimized quantization techniques.

🚀 Quick Start

This model can be deployed efficiently using the vLLM backend. Here is an example:

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"
number_gpus = 4

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Give me a short introduction to large language model."

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompt, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

✨ Features

Model Overview

Model Architecture: Llama4ForConditionalGeneration
- Input: Text / Image
- Output: Text
Model Optimizations:
- Activation quantization: FP8
- Weight quantization: FP8
Release Date: 04/15/2025
Version: 1.0
Model Developers: Red Hat (Neural Magic)

Model Optimizations

This model was obtained by quantizing activations and weights of Llama-4-Scout-17B-16E-Instruct to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. The llm-compressor library is used for quantization.

💻 Usage Examples

Basic Usage

The basic usage of this model is demonstrated in the deployment example above.

Advanced Usage

The following code shows how to create the model with llm-compressor:

#!/usr/bin/env python3
"""
This script loads an LLM model and applies FP8 quantization to
weights and activations. Activations are dynamically quantized, i.e. during
actual runtime.
"""

import argparse
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Llama4ForConditionalGeneration
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor import oneshot
from compressed_tensors.quantization import (
    QuantizationScheme,
    QuantizationArgs,
    QuantizationType,
    QuantizationStrategy,
)


def parse_arguments():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser(description="Quantize a causal language model")
    parser.add_argument(
        "--model_path",
        type=str,
        required=True,
        help="Path to the pre-trained model",
    )
    parser.add_argument(
        "--quant_path",
        type=str,
        required=True,
        help="Output path for the quantized model",
    )
    return parser.parse_args()


def main():
    """Main function to load and quantize the model."""
    args = parse_arguments()

    print(f"Loading model from {args.model_path}...")
    model = Llama4ForConditionalGeneration.from_pretrained(
        args.model_path,
        device_map="auto",
        torch_dtype="auto",
        trust_remote_code=True,
    )

    quant_scheme = QuantizationScheme(
        targets=["Linear"],
        weights=QuantizationArgs(
            num_bits=8,
            type=QuantizationType.FLOAT,
            strategy=QuantizationStrategy.CHANNEL,
            symmetric=True,
            observer="mse",
        ),
        input_activations=QuantizationArgs(
            num_bits=8,
            type=QuantizationType.FLOAT,
            strategy=QuantizationStrategy.TOKEN,
            symmetric=True,
            dynamic=True,
        ),
        output_activations=None,
    )

    recipe = QuantizationModifier(
        targets="Linear",
        config_groups={"group_0": quant_scheme},
        ignore=[
            're:.*lm_head',
            're:.*self_attn',
            're:.*router',
            're:.*vision_model',
            're:.*multi_modal_projector',
        ]
    )

    print("Applying quantization...")
    oneshot(
        model=model,
        recipe=recipe,
        trust_remote_code_model=True,
    )

    model.save_pretrained(args.quant_path, save_compressed=True, skip_compression_stats=True, disable_sparse_compression=True)
    print(f"Quantized model saved to {args.quant_path}")


if __name__ == "__main__":
    main()

📚 Documentation

Evaluation

The model was evaluated on the OpenLLM leaderboard tasks (v1 and v2), long context RULER, multimodal MMMU, and multimodal ChartQA. All evaluations are obtained through lm-evaluation-harness.

Evaluation details

OpenLLM v1

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.7,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks openllm \
  --batch_size auto

OpenLLM v2

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",dtype=auto,add_bos_token=False,max_model_len=16384,tensor_parallel_size=8,gpu_memory_utilization=0.5,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks leaderboard \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --batch_size auto

Long Context RULER

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",dtype=auto,add_bos_token=False,max_model_len=524288,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks ruler \
  --metadata='{"max_seq_lengths":[131072]}' \
  --batch_size auto

Multimodal MMMU

lm_eval \
  --model vllm-vlm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",dtype=auto,add_bos_token=False,max_model_len=1000000,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True,max_images=10 \
  --tasks mmmu_val \
  --apply_chat_template \
  --batch_size auto

Multimodal ChartQA

export VLLM_MM_INPUT_CACHE_GIB=8
lm_eval \
  --model vllm-vlm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",dtype=auto,add_bos_token=False,max_model_len=1000000,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True,max_images=10 \
  --tasks chartqa \
  --apply_chat_template \
  --batch_size auto

Accuracy

	Recovery (%)	meta-llama/Llama-4-Scout-17B-16E-Instruct	RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic (this model)
ARC-Challenge 25-shot	100.36	69.37	69.62
GSM8k 5-shot	99.24	90.45	89.76
HellaSwag 10-shot	99.94	85.23	85.18
MMLU 5-shot	99.94	80.54	80.49
TruthfulQA 0-shot	99.17	61.41	60.90
WinoGrande 5-shot	98.88	77.90	77.03
OpenLLM v1 Average Score	99.59	77.48	77.16
IFEval 0-shot avg of inst and prompt acc	100.91	86.90	87.69
Big Bench Hard 3-shot	99.82	65.13	65.01
Math Lvl 5 4-shot	98.82	57.78	57.10
GPQA 0-shot	100.53	31.88	32.05
MuSR 0-shot	102.18	42.20	43.12
MMLU-Pro 5-shot	99.82	55.70	55.60
OpenLLM v2 Average Score	100.28	56.60	56.76
RULER seqlen = 131072 niah_multikey_1	101.36	88.20	89.40
RULER seqlen = 131072 niah_multikey_2	100.72	83.60	84.20
RULER seqlen = 131072 niah_multikey_3	96.19	78.80	75.80
RULER seqlen = 131072 niah_multiquery	100.79	95.40	96.15
RULER seqlen = 131072 niah_multivalue	97.22	73.75	71.70
RULER seqlen = 131072 niah_single_1	100.00	100.00	100.00
RULER seqlen = 131072 niah_single_2	100.00	99.80	99.80
RULER seqlen = 131072 niah_single_3	100.00	99.80	99.80
RULER seqlen = 131072 ruler_cwe	96.19	39.42	37.92
RULER seqlen = 131072 ruler_fwe	98.86	92.93	91.87
RULER seqlen = 131072 ruler_qa_hotpot	100.00	48.20	48.20
RULER seqlen = 131072 ruler_qa_squad	98.81	53.57	52.93
RULER seqlen = 131072 ruler_qa_vt	100.35	92.28	92.60
RULER seqlen = 131072 Average Score	99.49	80.44	80.03
MMMU 0-shot	97.92	53.44	52.33
ChartQA 0-shot exact_match	100.12	65.88	65.96
ChartQA 0-shot relaxed_accuracy	99.69	88.92	88.64
Multimodal Average Score	99.38	69.41	68.98

📄 License

The license for this model is under the "other" category, with the license name being "llama4".

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご