Llama-4-Scout-17B-16E-Instruct-FP8-dynamic開源多語言指令模型

首頁

Llama 4 Scout 17B 16E Instruct FP8 Dynamic

由RedHatAI開發

基於Llama-4構建的17B參數多語言指令模型，採用FP8量化優化，顯著降低資源需求

圖像生成文本

Safetensors

支持多種語言開源協議:其他 #FP8量化加速 #多模態指令理解 #多語言生成

下載量 5,812

發布時間 : 4/10/2025

模型概述

這是一個經過FP8量化的多語言大語言模型，支持文本和圖像輸入，輸出文本響應。通過量化技術減少50%內存需求和磁盤空間，同時提升計算效率。

模型特點

FP8量化優化

權重和激活值均採用FP8量化，減少50%內存需求和磁盤空間，提升2倍計算吞吐量

多模態支持

支持圖像和文本輸入，可處理多模態任務

多語言能力

支持12種語言的文本處理和生成

模型能力

文本生成

圖像理解

多語言處理

指令跟隨

使用案例

智能助手

多語言客服機器人

構建支持多種語言的智能客服系統

可流暢處理12種語言的客戶諮詢

內容生成

多語言內容創作

自動生成多語言營銷文案或社交媒體內容

🚀 Llama-4-Scout-17B-16E-Instruct-FP8-dynamic

Llama-4-Scout-17B-16E-Instruct-FP8-dynamic 是基於 Llama 構建的模型，通過量化優化減少了 GPU 內存需求並提高了計算吞吐量，支持多語言，可用於圖像 - 文本到文本的任務。

🚀 快速開始

本模型可以使用 vLLM 後端進行高效部署，示例代碼如下：

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"
number_gpus = 4

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Give me a short introduction to large language model."

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompt, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM 還支持與 OpenAI 兼容的服務，更多詳細信息請參閱文檔。

✨ 主要特性

多語言支持：支持阿拉伯語（ar）、德語（de）、英語（en）等多種語言。
模型架構：採用 Llama4ForConditionalGeneration 架構，輸入可以是文本或圖像，輸出為文本。
模型優化：對激活和權重進行 FP8 量化，減少 GPU 內存需求和磁盤大小要求，提高計算吞吐量。
多任務評估：在多個任務上進行了評估，包括 OpenLLM 排行榜任務、長上下文 RULER、多模態 MMMU 和多模態 ChartQA。

📦 安裝指南

文檔未提供具體安裝步驟，故跳過該章節。

💻 使用示例

基礎用法

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"
number_gpus = 4

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Give me a short introduction to large language model."

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompt, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

高級用法

文檔未提供高級用法示例，故跳過該部分。

📚 詳細文檔

模型概述

屬性	詳情
模型架構	Llama4ForConditionalGeneration，輸入為文本或圖像，輸出為文本
模型優化	激活量化：FP8；權重量化：FP8
發佈日期	2025 年 4 月 15 日
版本	1.0
模型開發者	Red Hat (Neural Magic)

模型優化

本模型是通過將 Llama-4-Scout-17B-16E-Instruct 的激活和權重量化為 FP8 數據類型得到的。這種優化將表示權重和激活的位數從 16 位減少到 8 位，減少了 GPU 內存需求（約 50%），並提高了矩陣乘法的計算吞吐量（約 2 倍）。權重量化還將磁盤大小要求降低了約 50%。量化使用了 llm-compressor 庫。

模型創建

本模型使用 llm-compressor 創建，代碼如下：

#!/usr/bin/env python3
"""
This script loads an LLM model and applies FP8 quantization to
weights and activations. Activations are dynamically quantized, i.e. during
actual runtime.
"""

import argparse
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Llama4ForConditionalGeneration
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor import oneshot
from compressed_tensors.quantization import (
    QuantizationScheme,
    QuantizationArgs,
    QuantizationType,
    QuantizationStrategy,
)


def parse_arguments():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser(description="Quantize a causal language model")
    parser.add_argument(
        "--model_path",
        type=str,
        required=True,
        help="Path to the pre-trained model",
    )
    parser.add_argument(
        "--quant_path",
        type=str,
        required=True,
        help="Output path for the quantized model",
    )
    return parser.parse_args()


def main():
    """Main function to load and quantize the model."""
    args = parse_arguments()

    print(f"Loading model from {args.model_path}...")
    model = Llama4ForConditionalGeneration.from_pretrained(
        args.model_path,
        device_map="auto",
        torch_dtype="auto",
        trust_remote_code=True,
    )

    quant_scheme = QuantizationScheme(
        targets=["Linear"],
        weights=QuantizationArgs(
            num_bits=8,
            type=QuantizationType.FLOAT,
            strategy=QuantizationStrategy.CHANNEL,
            symmetric=True,
            observer="mse",
        ),
        input_activations=QuantizationArgs(
            num_bits=8,
            type=QuantizationType.FLOAT,
            strategy=QuantizationStrategy.TOKEN,
            symmetric=True,
            dynamic=True,
        ),
        output_activations=None,
    )

    recipe = QuantizationModifier(
        targets="Linear",
        config_groups={"group_0": quant_scheme},
        ignore=[
            're:.*lm_head',
            're:.*self_attn',
            're:.*router',
            're:.*vision_model',
            're:.*multi_modal_projector',
        ]
    )

    print("Applying quantization...")
    oneshot(
        model=model,
        recipe=recipe,
        trust_remote_code_model=True,
    )

    model.save_pretrained(args.quant_path, save_compressed=True, skip_compression_stats=True, disable_sparse_compression=True)
    print(f"Quantized model saved to {args.quant_path}")


if __name__ == "__main__":
    main()

模型評估

模型在 OpenLLM 排行榜任務（v1 和 v2）、長上下文 RULER、多模態 MMMU 和多模態 ChartQA 上進行了評估。所有評估均通過 lm-evaluation-harness 進行。

評估詳情

OpenLLM v1

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.7,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks openllm \
  --batch_size auto

OpenLLM v2

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",dtype=auto,add_bos_token=False,max_model_len=16384,tensor_parallel_size=8,gpu_memory_utilization=0.5,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks leaderboard \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --batch_size auto

Long Context RULER

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",dtype=auto,add_bos_token=False,max_model_len=524288,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks ruler \
  --metadata='{"max_seq_lengths":[131072]}' \
  --batch_size auto

Multimodal MMMU

lm_eval \
  --model vllm-vlm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",dtype=auto,add_bos_token=False,max_model_len=1000000,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True,max_images=10 \
  --tasks mmmu_val \
  --apply_chat_template \
  --batch_size auto

Multimodal ChartQA

export VLLM_MM_INPUT_CACHE_GIB=8
lm_eval \
  --model vllm-vlm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",dtype=auto,add_bos_token=False,max_model_len=1000000,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True,max_images=10 \
  --tasks chartqa \
  --apply_chat_template \
  --batch_size auto

準確率

	恢復率 (%)	meta-llama/Llama-4-Scout-17B-16E-Instruct	RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic (本模型)
ARC-Challenge 25-shot	100.36	69.37	69.62
GSM8k 5-shot	99.24	90.45	89.76
HellaSwag 10-shot	99.94	85.23	85.18
MMLU 5-shot	99.94	80.54	80.49
TruthfulQA 0-shot	99.17	61.41	60.90
WinoGrande 5-shot	98.88	77.90	77.03
OpenLLM v1 平均得分	99.59	77.48	77.16
IFEval 0-shot 指令和提示準確率的平均值	100.91	86.90	87.69
Big Bench Hard 3-shot	99.82	65.13	65.01
Math Lvl 5 4-shot	98.82	57.78	57.10
GPQA 0-shot	100.53	31.88	32.05
MuSR 0-shot	102.18	42.20	43.12
MMLU-Pro 5-shot	99.82	55.70	55.60
OpenLLM v2 平均得分	100.28	56.60	56.76
RULER 序列長度 = 131072 niah_multikey_1	101.36	88.20	89.40
RULER 序列長度 = 131072 niah_multikey_2	100.72	83.60	84.20
RULER 序列長度 = 131072 niah_multikey_3	96.19	78.80	75.80
RULER 序列長度 = 131072 niah_multiquery	100.79	95.40	96.15
RULER 序列長度 = 131072 niah_multivalue	97.22	73.75	71.70
RULER 序列長度 = 131072 niah_single_1	100.00	100.00	100.00
RULER 序列長度 = 131072 niah_single_2	100.00	99.80	99.80
RULER 序列長度 = 131072 niah_single_3	100.00	99.80	99.80
RULER 序列長度 = 131072 ruler_cwe	96.19	39.42	37.92
RULER 序列長度 = 131072 ruler_fwe	98.86	92.93	91.87
RULER 序列長度 = 131072 ruler_qa_hotpot	100.00	48.20	48.20
RULER 序列長度 = 131072 ruler_qa_squad	98.81	53.57	52.93
RULER 序列長度 = 131072 ruler_qa_vt	100.35	92.28	92.60
RULER 序列長度 = 131072 平均得分	99.49	80.44	80.03
MMMU 0-shot	97.92	53.44	52.33
ChartQA 0-shot 精確匹配	100.12	65.88	65.96
ChartQA 0-shot 寬鬆準確率	99.69	88.92	88.64
多模態平均得分	99.38	69.41	68.98