Llama 4 Scout 17B 16E Instruct FP8 Open-source Model - Supports text-image interaction and demonstrates outstanding understanding performance

Llama 4 Scout 17B 16E Instruct FP8

Developed by fahadh4ilyas

The Llama 4 series is a native multimodal AI model launched by Meta, supporting text and image interaction. It adopts the Mixture of Experts architecture and performs excellently in text and image understanding.

Multimodal Fusion

Transformers

Supports Multiple LanguagesOpen Source License:Other #Multimodal interaction #Mixture of Experts architecture #Visual reasoning

Downloads 1,760

Release Time : 4/17/2025

Model Overview

The Llama 4 series models are native multimodal AI models that support text and image input and output, suitable for tasks such as image understanding and visual reasoning. They adopt the Mixture of Experts architecture, providing efficient text and image processing capabilities.

Model Features

Multimodal capabilities

Natively supports text and image input and output, and can be used for tasks such as image understanding and visual reasoning.

Mixture of Experts architecture

Adopts the MoE architecture, performs well in text and image understanding, and provides efficient model performance.

Efficient models

Launched efficient models with 17 billion parameters, including Llama 4 Scout and Llama 4 Maverick.

Multilingual support

Supports 12 languages, and the pre - training covers 200 languages. Developers can fine - tune to support more languages.

Model Capabilities

Text generation

Image understanding

Visual reasoning

Multimodal interaction

Multilingual processing

Code generation

Use Cases

Business and research

Multilingual business applications

Supports multilingual business and research applications, such as multilingual customer service and document analysis.

Dialogue and reasoning

Assistant chat

Suitable for assistant - like chat and visual reasoning tasks, providing a natural and smooth interaction experience.

Natural language generation

Text generation

The pre - trained model can be used for natural language generation, such as article writing and summary generation.

Visual tasks

Image description

Suitable for visual recognition, image reasoning, image description, and answering image - related questions.

Achieved an ANLS score of 94.4 in the DocVQA test

🚀 Llama 4 - Multimodal AI Models

Llama 4 is a collection of natively multimodal AI models developed by Meta. These models offer industry - leading performance in text and image understanding, enabling text and multimodal experiences. They leverage a mixture - of - experts architecture and early fusion for native multimodality.

🚀 Quick Start

Prerequisites

Please make sure you have transformers v4.51.0 installed, or upgrade using pip install -U transformers.

Example Code

from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": url1},
            {"type": "image", "url": url2},
            {"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
)

response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
print(outputs[0])

✨ Features

Multimodal Capabilities: Llama 4 models support both text and image understanding, enabling a wide range of multimodal applications such as visual reasoning, captioning, and answering questions about images.
Mixture - of - Experts Architecture: Leveraging this architecture, the models offer industry - leading performance in various tasks, including reasoning, knowledge, code generation, and multilingual processing.
Two Efficient Models: The Llama 4 series includes Llama 4 Scout (17 billion parameters, 16 experts) and Llama 4 Maverick (17 billion parameters, 128 experts).

📦 Installation

To use Llama 4 with the transformers library, ensure you have transformers v4.51.0 installed. You can upgrade it using the following command:

pip install -U transformers

💻 Usage Examples

Basic Usage

The provided Python code in the "Quick Start" section demonstrates a basic usage example of using Llama 4 for multimodal input processing and generation.

📚 Documentation

Model Information

Model Developer: Meta
Model Architecture: Auto - regressive language models using a mixture - of - experts (MoE) architecture and early fusion for native multimodality.

Property	Details
Model Type	Llama 4 Scout, Llama 4 Maverick
Training Data	A mix of publicly available, licensed data and information from Meta's products and services, including publicly shared posts from Instagram and Facebook and people's interactions with Meta AI. Cutoff: August 2024.
Params (Llama 4 Scout)	17B (Activated), 109B (Total)
Params (Llama 4 Maverick)	17B (Activated), 400B (Total)
Input modalities	Multilingual text and image
Output modalities	Multilingual text and code
Context length (Llama 4 Scout)	10M
Context length (Llama 4 Maverick)	1M
Token count (Llama 4 Scout)	~40T
Token count (Llama 4 Maverick)	~22T
Knowledge cutoff	August 2024
Supported languages	Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese.
Model Release Date	April 5, 2025
Status	Static model trained on an offline dataset. Future tuned versions may be released.
License	Llama 4 Community License Agreement. [https://github.com/meta - llama/llama - models/blob/main/models/llama4/LICENSE](https://github.com/meta - llama/llama - models/blob/main/models/llama4/LICENSE)
Feedback	Instructions on providing feedback can be found in the Llama [README](https://github.com/meta - llama/llama - models/blob/main/README.md). Technical information about generation parameters and usage recipes can be found [here](https://github.com/meta - llama/llama - cookbook).

Intended Use

Intended Use Cases: Commercial and research use in multiple languages. Instruction - tuned models are for assistant - like chat and visual reasoning tasks. Pretrained models can be adapted for natural language generation. The models also support leveraging outputs for improving other models, including synthetic data generation and distillation.
Out - of - scope: Any use that violates applicable laws or regulations, the Acceptable Use Policy, or the Llama 4 Community License. Use in languages or capabilities beyond the explicitly supported ones.

Hardware and Software

Training Factors: Custom training libraries, Meta's custom - built GPU clusters, and production infrastructure were used for pretraining. Fine - tuning, quantization, annotation, and evaluation were also performed on production infrastructure.
Training Energy Use: Model pre - training utilized a cumulative of 7.38M GPU hours of computation on H100 - 80GB (TDP of 700W) type hardware.

Model Name	Training Time (GPU hours)	Training Power Consumption (W)	Training Location - Based Greenhouse Gas Emissions (tons CO2eq)
Llama 4 Scout	5.0M	700	1,354
Llama 4 Maverick	2.38M	700	645
Total	7.38M	-	1,999

The methodology for determining training energy use and greenhouse gas emissions can be found here.

Benchmarks

Pre - trained models

Category	Benchmark	# Shots	Metric	Llama 3.1 70B	Llama 3.1 405B	Llama 4 Scout	Llama 4 Maverick
Reasoning & Knowledge	MMLU	5	macro_avg/acc_char	79.3	85.2	79.6	85.5
	MMLU - Pro	5	macro_avg/em	53.8	61.6	58.2	62.9
	MATH	4	em_maj1@1	41.6	53.5	50.3	61.2
Code	MBPP	3	pass@1	66.4	74.4	67.8	77.6
Multilingual	TydiQA	1	average/f1	29.9	34.3	31.5	31.7
Image	ChartQA	0	relaxed_accuracy	No multimodal support		83.4	85.3
	DocVQA	0	anls			89.4	91.6

Instruction tuned models

Category	Benchmark	# Shots	Metric	Llama 3.3 70B	Llama 3.1 405B	Llama 4 Scout	Llama 4 Maverick
Image Reasoning	MMMU	0	accuracy	No multimodal support		69.4	73.4
	MMMU Pro^	0	accuracy			52.2	59.6
	MathVista	0	accuracy			70.7	73.7
Image Understanding	ChartQA	0	relaxed_accuracy			88.8	90.0
	DocVQA (test)	0	anls			94.4	94.4
Coding	LiveCodeBench (10/01/2024 - 02/01/2025)	0	pass@1	33.3	27.7	32.8	43.4
Reasoning & Knowledge	MMLU Pro	0	macro_avg/acc	68.9	73.4	74.3	80.5
	GPQA Diamond	0	accuracy	50.5	49.0	57.2	69.8
Multilingual	MGSM	0	average/em	91.1	91.6	90.6	92.3
Long context	MTOB (half book) eng->kgv/kgv->eng	-	chrF	Context window is 128K		42.2/36.6	54.0/46.4
	MTOB (full book) eng->kgv/kgv->eng	-	chrF			39.7/36.3	50.8/46.7

^reported numbers for MMMU Pro is the average of Standard and Vision tasks

Quantization

The Llama 4 Scout model is released as BF16 weights and can fit within a single H100 GPU with on - the - fly int4 quantization.
The Llama 4 Maverick model is released as both BF16 and FP8 quantized weights. The FP8 quantized weights fit on a single H100 DGX host while maintaining quality. Code for on - the - fly int4 quantization is also provided to minimize performance degradation.

Safeguar

The Llama 4 models are subject to the Llama 4 Community License Agreement. When using or distributing the models, please ensure compliance with the license terms, including providing a copy of the agreement, displaying the "Built with Llama" notice, and adhering to the Acceptable Use Policy.

Important Notes

⚠️ Important Note

Llama 4 has been trained on a broader collection of languages than the 12 supported languages. Developers may fine - tune the models for additional languages, but they must comply with the Llama 4 Community License and the Acceptable Use Policy.

Llama 4 has been tested for image understanding up to 5 input images. If using additional image understanding capabilities, developers are responsible for risk mitigation, additional testing, and tuning tailored to their specific applications.

💡 Usage Tip

When using Llama 4, make sure to follow the instructions in the Llama [README](https://github.com/meta - llama/llama - models/blob/main/README.md) for providing feedback and the [Llama cookbook](https://github.com/meta - llama/llama - cookbook) for more technical information about generation parameters and usage recipes.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご