Llama 4 Scout Open-Source Multimodal Large Model - Free Deployment for Easy Comprehension of Text and Images

Llama 4 Scout 17B 16E Unsloth Bnb 8bit

Developed by unsloth

Llama 4 Scout is a multimodal large language model developed by Meta, utilizing a mixture of experts architecture that supports both text and image understanding, with a parameter scale of 17 billion (activated) / 109 billion (total).

Text-to-Image

Transformers

Supports Multiple LanguagesOpen Source License:Other #Multimodal Mixture of Experts #17B Activated Parameters #12 Language Support

Downloads 855

Release Time : 4/6/2025

Model Overview

The Llama 4 series models are native multimodal AI models supporting text and multimodal experiences. These models leverage a mixture of experts architecture to deliver industry-leading performance in text and image understanding.

Model Features

Multimodal Capabilities

Native support for text and image input, enabling visual recognition, image reasoning, and caption generation

Mixture of Experts Architecture

Utilizes a 16-expert mixture architecture to balance computational efficiency and model performance

Multilingual Support

Supports 12 major languages, with pretraining covering 200 languages, and can be fine-tuned for others

Long Context Processing

Supports context lengths of up to 1 million tokens, ideal for handling long documents and complex tasks

Model Capabilities

Multilingual text generation

Image understanding and analysis

Visual reasoning

Multimodal dialogue

Code generation

Knowledge Q&A

Use Cases

Intelligent Assistant

Multimodal Chat Assistant

Can understand and answer questions about both text and images

Provides human-like conversational experiences

Content Generation

Image Caption Generation

Generates detailed descriptions and captions for images

Enhances content accessibility

Education

Multilingual Learning Assistant

Supports learning and translation in multiple languages

Assists language learners

🚀 Llama 4 Model

The Llama 4 collection is a set of natively multimodal AI models that support text and multimodal experiences, offering industry - leading performance in text and image understanding.

🚀 Quick Start

Please, make sure you have transformers v4.51.0 installed, or upgrade using pip install -U transformers.

from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch

model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": url1},
            {"type": "image", "url": url2},
            {"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
)

response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
print(outputs[0])

✨ Features

Multimodal Capability: The Llama 4 models are natively multimodal, enabling text and multimodal experiences, and are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image.
Mixture - of - Experts Architecture: Leveraging a mixture - of - experts architecture to offer industry - leading performance in text and image understanding.
Multiple Language Support: Supports 12 languages including Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese.
Model Adaptability: Instruction tuned models are suitable for assistant - like chat and visual reasoning tasks, while pretrained models can be adapted for natural language generation. It also supports leveraging model outputs to improve other models.

📦 Installation

To use the Llama 4 models with the transformers library, ensure you have transformers v4.51.0 installed. You can upgrade it using the following command:

pip install -U transformers

💻 Usage Examples

Basic Usage

from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch

model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": url1},
            {"type": "image", "url": url2},
            {"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
)

response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
print(outputs[0])

📚 Documentation

Model Information

Property	Details
Model Type	The Llama 4 collection of models are natively multimodal AI models.
Model Developer	Meta
Model Architecture	Auto - regressive language models that use a mixture - of - experts (MoE) architecture and incorporate early fusion for native multimodality.
Training Data	A mix of publicly available, licensed data and information from Meta's products and services, including publicly shared posts from Instagram and Facebook and people's interactions with Meta AI.
Supported Languages	Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese.
Model Release Date	April 5, 2025
Status	A static model trained on an offline dataset. Future versions of the tuned models may be released.
License	A custom commercial license, the Llama 4 Community License Agreement, available at: [https://github.com/meta - llama/llama - models/blob/main/models/llama4/LICENSE](https://github.com/meta - llama/llama - models/blob/main/models/llama4/LICENSE)

Intended Use

Intended Use Cases: Llama 4 is intended for commercial and research use in multiple languages. Instruction tuned models are for assistant - like chat and visual reasoning tasks, and pretrained models can be adapted for natural language generation. For vision, it is optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. It also supports leveraging model outputs to improve other models.

Out - of - scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws), use prohibited by the Acceptable Use Policy and Llama 4 Community License, and use in languages or capabilities beyond those explicitly supported in this model card.

Hardware and Software

Training Factors: Custom training libraries, Meta's custom built GPU clusters, and production infrastructure were used for pretraining. Fine - tuning, quantization, annotation, and evaluation were also performed on production infrastructure.

Training Energy Use: Model pre - training utilized a cumulative of 7.38M GPU hours of computation on H100 - 80GB (TDP of 700W) type hardware.

Training Greenhouse Gas Emissions: Estimated total location - based greenhouse gas emissions were 1,999 tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with clean and renewable energy, so the total market - based greenhouse gas emissions for training were 0 tons CO2eq.

Training Data

Llama 4 Scout was pretrained on ~40 trillion tokens and Llama 4 Maverick was pretrained on ~22 trillion tokens of multimodal data from a mix of publicly available, licensed data and information from Meta’s products and services. The pretraining data has a cutoff of August 2024.

Benchmarks

Pre - trained models

Category	Benchmark	# Shots	Metric	Llama 3.1 70B	Llama 3.1 405B	Llama 4 Scout	Llama 4 Maverick
Reasoning & Knowledge	MMLU	5	macro_avg/acc_char	79.3	85.2	79.6	85.5
	MMLU - Pro	5	macro_avg/em	53.8	61.6	58.2	62.9
	MATH	4	em_maj1@1	41.6	53.5	50.3	61.2
Code	MBPP	3	pass@1	66.4	74.4	67.8	77.6
Multilingual	TydiQA	1	average/f1	29.9	34.3	31.5	31.7
Image	ChartQA	0	relaxed_accuracy	No multimodal support		83.4	85.3
	DocVQA	0	anls			89.4	91.6

Instruction tuned models

Category	Benchmark	# Shots	Metric	Llama 3.3 70B	Llama 3.1 405B	Llama 4 Scout	Llama 4 Maverick
Image Reasoning	MMMU	0	accuracy	No multimodal support		69.4	73.4
	MMMU Pro^	0	accuracy			52.2	59.6
	MathVista	0	accuracy			70.7	73.7
Image Understanding	ChartQA	0	relaxed_accuracy			88.8	90.0
	DocVQA (test)	0	anls			94.4	94.4
Coding	LiveCodeBench (10/01/2024 - 02/01/2025)	0	pass@1	33.3	27.7	32.8	43.4
Reasoning & Knowledge	MMLU Pro	0	macro_avg/em	68.9	73.4	74.3	80.5
	GPQA Diamond	0	accuracy	50.5	49.0	57.2	69.8
Multilingual	MGSM	0	average/em	91.1	91.6	90.6	92.3
Long context	MTOB (half book) eng -> kgv/kgv -> eng	-	chrF	Context window is 128K		42.2/36.6	54.0/46.4

🔧 Technical Details

Training Factors

We used custom training libraries, Meta's custom built GPU clusters, and production infrastructure for pretraining. Fine - tuning, quantization, annotation, and evaluation were also performed on production infrastructure.

Training Energy Use

Model pre - training utilized a cumulative of 7.38M GPU hours of computation on H100 - 80GB (TDP of 700W) type hardware. Training time is the total GPU time required for training each model and power consumption is the peak power capacity per GPU device used, adjusted for power usage efficiency.

Training Greenhouse Gas Emissions

Estimated total location - based greenhouse gas emissions were 1,999 tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with clean and renewable energy; therefore, the total market - based greenhouse gas emissions for training were 0 tons CO2eq.

📄 License

A custom commercial license, the Llama 4 Community License Agreement, is available at: [https://github.com/meta - llama/llama - models/blob/main/models/llama4/LICENSE](https://github.com/meta - llama/llama - models/blob/main/models/llama4/LICENSE)

⚠️ Important Note

The 8 - bit model currently only works with Unsloth! See [our collection](https://huggingface.co/collections/unsloth/llama - 4 - 67f19503d764b0f3a2a868d2) for versions of Llama 4 including 4 - bit & 16 - bit formats.

💡 Usage Tip

Unsloth's [Dynamic Quants](https://unsloth.ai/blog/dynamic - 4bit) is selectively quantized, greatly improving accuracy over standard 4 - bit.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご