Llama-4-Scout-17B-16E-Instruct量化版開源！顯存降75%，支持多語言圖文生成

Home

Llama 4 Scout 17B 16E Instruct Quantized.w4a16

Developed by RedHatAI

基於Llama-4-Scout-17B-16E-Instruct的INT4權重量化版本，顯存需求降低75%，支持多語言圖文生成任務

文本生成圖像

Safetensors

Supports Multiple LanguagesOpen Source License:Other #多模態圖文生成 #INT4高效量化 #企業級部署優化

Downloads 11.03k

Release Time : 4/25/2025

Model Overview

這是一個經過優化的多語言大語言模型，支持文本和圖像輸入，輸出文本內容。模型經過INT4量化處理，顯著降低資源需求。

Model Features

高效量化

採用INT4權重量化技術，顯存需求降低約75%，磁盤空間需求同步減少75%

多語言支持

支持12種語言的圖文生成任務，包括亞洲和歐洲主要語言

企業級部署

優化適配紅帽企業AI平臺，包括RHEL AI和Openshift AI

Model Capabilities

文本生成

多語言處理

圖文理解

Use Cases

內容創作

多語言內容生成

為不同語言用戶自動生成符合文化背景的內容

高效產出12種語言的優質內容

企業應用

企業知識問答

部署在企業內部的知識問答系統

快速響應員工查詢，提高工作效率

🚀 Llama-4-Scout-17B-16E-Instruct-quantized.w4a16

這是一個經過量化處理的模型，基於Llama-4-Scout-17B-16E-Instruct，能有效減少GPU內存和磁盤空間需求，支持多語言，可在多種平臺上部署。

🔍 模型信息

屬性	詳情
庫名稱	vllm
支持語言	阿拉伯語、德語、英語、西班牙語、法語、印地語、印尼語、意大利語、葡萄牙語、泰語、他加祿語、越南語
基礎模型	meta-llama/Llama-4-Scout-17B-16E-Instruct
任務類型	圖像文本到文本
標籤	facebook、meta、pytorch、llama、llama4、neuralmagic、redhat、llmcompressor、quantized、W4A16、INT4
許可證	其他（llama4）

🚀 快速開始

本模型可在多個平臺上高效部署，以下是詳細的部署說明。

💻 使用示例

vLLM部署示例

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16"
number_gpus = 4

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Give me a short introduction to large language model."

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompt, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM還支持OpenAI兼容服務，更多詳情請參考文檔。

Red Hat AI Inference Server部署示例

$ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
 --ipc=host \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
--name=vllm \
registry.access.redhat.com/rhaiis/rh-vllm-cuda \
vllm serve \
--tensor-parallel-size 8 \
--max-model-len 32768  \
--enforce-eager --model RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16

更多詳情請參考Red Hat AI Inference Server文檔。

Red Hat Enterprise Linux AI部署示例

# 從Red Hat Registry通過docker下載模型
# 注意：除非指定--model-dir，否則模型將下載到~/.cache/instructlab/models
ilab model download --repository docker://registry.redhat.io/rhelai1/llama-4-scout-17b-16e-instruct-quantized-w4a16:1.5

# 通過ilab提供模型服務
ilab model serve --model-path ~/.cache/instructlab/models/llama-4-scout-17b-16e-instruct-quantized-w4a16
  
# 與模型進行對話
ilab model chat --model ~/.cache/instructlab/models/llama-4-scout-17b-16e-instruct-quantized-w4a16

更多詳情請參考Red Hat Enterprise Linux AI文檔。

Red Hat Openshift AI部署示例

# 使用ServingRuntime設置vllm服務器
# 保存為: vllm-servingruntime.yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
 name: vllm-cuda-runtime # 可選更改: 設置唯一名稱
 annotations:
   openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
   opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
 labels:
   opendatahub.io/dashboard: 'true'
spec:
 annotations:
   prometheus.io/port: '8080'
   prometheus.io/path: '/metrics'
 multiModel: false
 supportedModelFormats:
   - autoSelect: true
     name: vLLM
 containers:
   - name: kserve-container
     image: quay.io/modh/vllm:rhoai-2.20-cuda # 如有需要更改。如果是AMD: quay.io/modh/vllm:rhoai-2.20-rocm
     command:
       - python
       - -m
       - vllm.entrypoints.openai.api_server
     args:
       - "--port=8080"
       - "--model=/mnt/models"
       - "--served-model-name={{.Name}}"
     env:
       - name: HF_HOME
         value: /tmp/hf_home
     ports:
       - containerPort: 8080
         protocol: TCP

# 將模型附加到vllm服務器。這是一個NVIDIA模板
# 保存為: inferenceservice.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 # 可選更改
    serving.kserve.io/deploymentMode: RawDeployment
  name: Llama-4-Scout-17B-16E-Instruct-quantized.w4a16          # 指定模型名稱。此值將用於在有效負載中調用模型
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: vLLM
      name: ''
      resources:
        limits:
          cpu: '2'			# 這是特定於模型的
          memory: 8Gi		# 這是特定於模型的
          nvidia.com/gpu: '1'	# 這是特定於加速器的
        requests:			# 此塊同樣適用
          cpu: '1'
          memory: 4Gi
          nvidia.com/gpu: '1'
      runtime: vllm-cuda-runtime	# 必須與上面的ServingRuntime名稱匹配
      storageUri: oci://registry.redhat.io/rhelai1/modelcar-llama-4-scout-17b-16e-instruct-quantized-w4a16:1.5
    tolerations:
    - effect: NoSchedule
      key: nvidia.com/gpu
      operator: Exists

# 確保首先位於要部署模型的項目中
# oc project <項目名稱>

# 應用兩個資源以運行模型

# 應用ServingRuntime
oc apply -f vllm-servingruntime.yaml

# 應用InferenceService
oc apply -f qwen-inferenceservice.yaml

# 替換下面的<推理服務名稱>和<集群入口域名>
# - 如果不確定，請運行`oc get inferenceservice`查找URL

# 使用curl調用服務器:
curl https://<推理服務名稱>-predictor-default.<域名>/v1/chat/completions
        -H "Content-Type: application/json" \
        -d '{
    "model": "Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",
    "stream": true,
    "stream_options": {
        "include_usage": true
    },
    "max_tokens": 1,
    "messages": [
        {
            "role": "user",
            "content": "How can a bee fly when its wings are so small?"
        }
    ]
}'

更多詳情請參考Red Hat Openshift AI文檔。

🔧 技術細節

模型概述

模型架構：Llama4ForConditionalGeneration
- 輸入：文本 / 圖像
- 輸出：文本
模型優化：
- 激活量化：無
- 權重量化：INT4
發佈日期：2025年4月25日
版本：1.0
模型開發者：Red Hat (Neural Magic)

模型優化說明

本模型是通過將Llama-4-Scout-17B-16E-Instruct的權重量化為INT4數據類型得到的。這種優化將表示權重的位數從16位減少到4位，大約減少了75%的GPU內存需求，同時也將磁盤空間需求減少了約75%。權重量化使用了llm-compressor庫。

📊 評估

本模型在OpenLLM排行榜任務（v1和v2）、長上下文RULER、多模態MMMU和多模態ChartQA上進行了評估。所有評估均通過lm-evaluation-harness進行。

評估詳情

OpenLLM v1

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.7,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks openllm \
  --batch_size auto

OpenLLM v2

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=False,max_model_len=16384,tensor_parallel_size=8,gpu_memory_utilization=0.5,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks leaderboard \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --batch_size auto

Long Context RULER

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=False,max_model_len=524288,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True \
  --tasks ruler \
  --metadata='{"max_seq_lengths":[131072]}' \
  --batch_size auto

Multimodal MMMU

lm_eval \
  --model vllm-vlm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=False,max_model_len=1000000,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True,max_images=10 \
  --tasks mmmu_val \
  --apply_chat_template \
  --batch_size auto

Multimodal ChartQA

export VLLM_MM_INPUT_CACHE_GIB=8
lm_eval \
  --model vllm-vlm \
  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",dtype=auto,add_bos_token=False,max_model_len=1000000,tensor_parallel_size=8,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True,max_images=10 \
  --tasks chartqa \
  --apply_chat_template \
  --batch_size auto

準確率

評估任務	恢復率 (%)	meta-llama/Llama-4-Scout-17B-16E-Instruct	RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 (本模型)
ARC-Challenge 25-shot	98.51	69.37	68.34
GSM8k 5-shot	100.4	90.45	90.90
HellaSwag 10-shot	99.67	85.23	84.95
MMLU 5-shot	99.75	80.54	80.34
TruthfulQA 0-shot	99.82	61.41	61.30
WinoGrande 5-shot	98.98	77.90	77.11
OpenLLM v1 平均得分	99.59	77.48	77.16
IFEval 0-shot 指令和提示準確率的平均值	99.51	86.90	86.47
Big Bench Hard 3-shot	99.46	65.13	64.78
Math Lvl 5 4-shot	99.22	57.78	57.33
GPQA 0-shot	100.0	31.88	31.88
MuSR 0-shot	100.9	42.20	42.59
MMLU-Pro 5-shot	98.67	55.70	54.96
OpenLLM v2 平均得分	99.54	56.60	56.34
MMMU 0-shot	100.6	53.44	53.78
ChartQA 0-shot 精確匹配	100.1	65.88	66.00
ChartQA 0-shot 寬鬆準確率	99.55	88.92	88.52
多模態平均得分	100.0	69.41	69.43
RULER 序列長度 = 131072 niah_multikey_1	98.41	88.20	86.80
RULER 序列長度 = 131072 niah_multikey_2	94.73	83.60	79.20
RULER 序列長度 = 131072 niah_multikey_3	96.44	78.80	76.00
RULER 序列長度 = 131072 niah_multiquery	98.79	95.40	94.25
RULER 序列長度 = 131072 niah_multivalue	101.6	73.75	74.95
RULER 序列長度 = 131072 niah_single_1	100.0	100.00	100.0
RULER 序列長度 = 131072 niah_single_2	100.0	99.80	99.80
RULER 序列長度 = 131072 niah_single_3	100.2	99.80	100.0
RULER 序列長度 = 131072 ruler_cwe	87.39	39.42	33.14
RULER 序列長度 = 131072 ruler_fwe	98.13	92.93	91.20
RULER 序列長度 = 131072 ruler_qa_hotpot	100.4	48.20	48.40
RULER 序列長度 = 131072 ruler_qa_squad	96.22	53.57	51.55
RULER 序列長度 = 131072 ruler_qa_vt	98.82	92.28	91.20
RULER 序列長度 = 131072 平均得分	98.16	80.44	78.96