Llama-4-Scout-17B-16E-Instruct Open-Source Multimodal AI - Supports 12 Languages and Image Understanding

Llama 4 Scout 17B 16E Instruct

Developed by chutesai

Llama 4 Scout is a 17B parameter/16-expert multimodal AI model from Meta, supporting 12 languages and image understanding with industry-leading performance.

Multimodal Fusion

Transformers

Supports Multiple LanguagesOpen Source License:Other #Multimodal Mixture of Experts #12 Native Language Support #10M Ultra-Long Context

Downloads 173.52k

Release Time : 4/5/2025

Model Overview

A native multimodal AI model featuring a mixture-of-experts architecture, supporting text and multimodal experiences, suitable for multilingual applications, conversational assistants, and visual reasoning scenarios.

Model Features

Multimodal Support

Supports both text and image inputs for cross-modal understanding and reasoning

Mixture-of-Experts Architecture

Features a 16-expert design with 17B active parameters and 109B total parameters, balancing performance and efficiency

Multilingual Capabilities

Natively supports 12 languages, with pretraining covering 200 languages, allowing developer extensions

Long Context Processing

Supports context windows up to 10M tokens, ideal for complex tasks

Model Capabilities

Multilingual text generation

Image understanding and captioning

Cross-modal reasoning

Code generation

Long-form text translation

Synthetic data generation

Use Cases

Commercial Applications

Multilingual Customer Support Assistant

Intelligent customer support system supporting 12 languages

Scored 31.5 on the TydiQA multilingual benchmark

Visual Reasoning

Image Content Analysis

Processes multiple images simultaneously for comparative analysis

Scored 89.4 on the DocVQA benchmark

Education & Research

Multilingual Teaching Tool

Generates multilingual teaching materials and exercises

Book translation achieved chrF scores of 42.2/36.6

🚀 Llama 4 Model

The Llama 4 models are natively multimodal AI models, enabling text and multimodal experiences. They leverage a mixture - of - experts architecture, offering industry - leading performance in text and image understanding.

🚀 Quick Start

Prerequisites

Please, make sure you have transformers v4.51.0 installed, or upgrade using pip install -U transformers.

Example Code

from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": url1},
            {"type": "image", "url": url2},
            {"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
)

response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
print(outputs[0])

✨ Features

Multimodal Capability: The Llama 4 models are natively multimodal, enabling text and multimodal experiences, with strong performance in text and image understanding.
Mixture - of - Experts Architecture: Leveraging this architecture to offer industry - leading performance.
Model Variants: Two efficient models in the Llama 4 series are launched, Llama 4 Scout and Llama 4 Maverick.

📦 Installation

Ensure you have transformers v4.51.0 installed. You can upgrade it using the following command:

pip install -U transformers

💻 Usage Examples

Basic Usage

from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": url1},
            {"type": "image", "url": url2},
            {"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
)

response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
print(outputs[0])

📚 Documentation

Model Information

Model developer: Meta
Model Architecture: The Llama 4 models are auto - regressive language models that use a mixture - of - experts (MoE) architecture and incorporate early fusion for native multimodality.

Property	Details
Model Type	Llama 4 Scout (17Bx16E), Llama 4 Maverick (17Bx128E)
Training Data	A mix of publicly available, licensed data and information from Meta's products and services. This includes publicly shared posts from Instagram and Facebook and people's interactions with Meta AI. Learn more in the Privacy Center.
Params (Llama 4 Scout)	17B (Activated), 109B (Total)
Params (Llama 4 Maverick)	17B (Activated), 400B (Total)
Input modalities	Multilingual text and image
Output modalities	Multilingual text and code
Context length (Llama 4 Scout)	10M
Context length (Llama 4 Maverick)	1M
Token count (Llama 4 Scout)	~40T
Token count (Llama 4 Maverick)	~22T
Knowledge cutoff	August 2024

Supported languages: Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese.

Model Release Date: April 5, 2025

Status: This is a static model trained on an offline dataset. Future versions of the tuned models may be released as we improve model behavior with community feedback.

License: A custom commercial license, the Llama 4 Community License Agreement, is available at: [https://github.com/meta - llama/llama - models/blob/main/models/llama4/LICENSE](https://github.com/meta - llama/llama - models/blob/main/models/llama4/LICENSE)

Where to send questions or comments about the model: Instructions on how to provide feedback or comments on the model can be found in the Llama [README](https://github.com/meta - llama/llama - models/blob/main/README.md). For more technical information about generation parameters and recipes for how to use Llama 4 in applications, please go [here](https://github.com/meta - llama/llama - cookbook).

Intended Use

Intended Use Cases: Llama 4 is intended for commercial and research use in multiple languages. Instruction tuned models are intended for assistant - like chat and visual reasoning tasks, whereas pretrained models can be adapted for natural language generation. For vision, Llama 4 models are also optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The Llama 4 model collection also supports the ability to leverage the outputs of its models to improve other models including synthetic data generation and distillation. The Llama 4 Community License allows for these use cases.
Out - of - scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 4 Community License. Use in languages or capabilities beyond those explicitly referenced as supported in this model card.

Hardware and Software

Training Factors: Custom training libraries, Meta's custom built GPU clusters, and production infrastructure were used for pretraining. Fine - tuning, quantization, annotation, and evaluation were also performed on production infrastructure.
Training Energy Use: Model pre - training utilized a cumulative of 7.38M GPU hours of computation on H100 - 80GB (TDP of 700W) type hardware. Training time is the total GPU time required for training each model and power consumption is the peak power capacity per GPU device used, adjusted for power usage efficiency.
Training Greenhouse Gas Emissions: Estimated total location - based greenhouse gas emissions were 1,999 tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with clean and renewable energy; therefore, the total market - based greenhouse gas emissions for training were 0 tons CO2eq.

Model Name	Training Time (GPU hours)	Training Power Consumption (W)	Training Location - Based Greenhouse Gas Emissions (tons CO2eq)
Llama 4 Scout	5.0M	700	1,354
Llama 4 Maverick	2.38M	700	645
Total	7.38M	-	1,999

The methodology used to determine training energy use and greenhouse gas emissions can be found here.

Benchmarks

Pre - trained models

Category	Benchmark	# Shots	Metric	Llama 3.1 70B	Llama 3.1 405B	Llama 4 Scout	Llama 4 Maverick
Reasoning & Knowledge	MMLU	5	macro_avg/acc_char	79.3	85.2	79.6	85.5
	MMLU - Pro	5	macro_avg/em	53.8	61.6	58.2	62.9
	MATH	4	em_maj1@1	41.6	53.5	50.3	61.2
Code	MBPP	3	pass@1	66.4	74.4	67.8	77.6
Multilingual	TydiQA	1	average/f1	29.9	34.3	31.5	31.7
Image	ChartQA	0	relaxed_accuracy	No multimodal support		83.4	85.3
	DocVQA	0	anls			89.4	91.6

Instruction tuned models

Category	Benchmark	# Shots	Metric	Llama 3.3 70B	Llama 3.1 405B	Llama 4 Scout	Llama 4 Maverick
Image Reasoning	MMMU	0	accuracy	No multimodal support		69.4	73.4
	MMMU Pro^	0	accuracy			52.2	59.6
	MathVista	0	accuracy			70.7	73.7
Image Understanding	ChartQA	0	relaxed_accuracy			88.8	90.0
	DocVQA (test)	0	anls			94.4	94.4
Coding	LiveCodeBench (10/01/2024 - 02/01/2025)	0	pass@1	33.3	27.7	32.8	43.4
Reasoning & Knowledge	MMLU Pro	0	macro_avg/acc	68.9	73.4	74.3	80.5
	GPQA Diamond	0	accuracy	50.5	49.0	57.2	69.8
Multilingual	MGSM	0	average/em	91.1	91.6	90.6	92.3
Long context	MTOB (half book) eng->kgv/kgv->eng	-	chrF	Context window is 128K		42.2/36.6	54.0/46.4
	MTOB (full book) eng->kgv/kgv->eng	-	chrF			39.7/36.3	50.8/46.7

^reported numbers for MMMU Pro is the average of Standard and Vision tasks

Quantization

The Llama 4 Scout model is released as BF16 weights, but can fit within a single H100 GPU with on - the - fly int4 quantization; the Llama 4 Maverick model is released as both BF16 and FP8 quantized weights. The FP8 quantized weights fit on a single H100 DGX host while still maintaining quality.

Intended Use Notes

⚠️ Important Note

Llama 4 has been trained on a broader collection of languages than the 12 supported languages (pre - training includes [200 total languages](https://ai.meta.com/research/no - language - left - behind/)). Developers may fine - tune Llama 4 models for languages beyond the 12 supported languages provided they comply with the Llama 4 Community License and the Acceptable Use Policy. Developers are responsible for ensuring that their use of Llama 4 in additional languages is done in a safe and responsible manner.

Llama 4 has been tested for image understanding up to 5 input images. If leveraging additional image understanding capabilities beyond this, Developers are responsible for ensuring that their deployments are mitigated for risks and should perform additional testing and tuning tailored to their specific applications.

🔧 Technical Details

Training Factors

We used custom training libraries, Meta's custom built GPU clusters, and production infrastructure for pretraining. Fine - tuning, quantization, annotation, and evaluation were also performed on production infrastructure.

Training Energy Use

Model pre - training utilized a cumulative of 7.38M GPU hours of computation on H100 - 80GB (TDP of 700W) type hardware. Training time is the total GPU time required for training each model and power consumption is the peak power capacity per GPU device used, adjusted for power usage efficiency.

Training Greenhouse Gas Emissions

Estimated total location - based greenhouse gas emissions were 1,999 tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with clean and renewable energy; therefore, the total market - based greenhouse gas emissions for training were 0 tons CO2eq.

📄 License

A custom commercial license, the Llama 4 Community License Agreement, is available at: [https://github.com/meta - llama/llama - models/blob/main/models/llama4/LICENSE](https://github.com/meta - llama/llama - models/blob/main/models/llama4/LICENSE)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご