🚀 Llama 4 - Multimodal AI Models
Llama 4 is a collection of natively multimodal AI models developed by Meta. These models offer industry - leading performance in text and image understanding, enabling text and multimodal experiences. They leverage a mixture - of - experts architecture and early fusion for native multimodality.
🚀 Quick Start
Prerequisites
Please make sure you have transformers
v4.51.0
installed, or upgrade using pip install -U transformers
.
Example Code
from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch
model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
attn_implementation="flex_attention",
device_map="auto",
torch_dtype=torch.bfloat16,
)
url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": url1},
{"type": "image", "url": url2},
{"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
)
response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
print(outputs[0])
✨ Features
- Multimodal Capabilities: Llama 4 models support both text and image understanding, enabling a wide range of multimodal applications such as visual reasoning, captioning, and answering questions about images.
- Mixture - of - Experts Architecture: Leveraging this architecture, the models offer industry - leading performance in various tasks, including reasoning, knowledge, code generation, and multilingual processing.
- Two Efficient Models: The Llama 4 series includes Llama 4 Scout (17 billion parameters, 16 experts) and Llama 4 Maverick (17 billion parameters, 128 experts).
📦 Installation
To use Llama 4 with the transformers
library, ensure you have transformers
v4.51.0
installed. You can upgrade it using the following command:
pip install -U transformers
💻 Usage Examples
Basic Usage
The provided Python code in the "Quick Start" section demonstrates a basic usage example of using Llama 4 for multimodal input processing and generation.
📚 Documentation
Model Information
- Model Developer: Meta
- Model Architecture: Auto - regressive language models using a mixture - of - experts (MoE) architecture and early fusion for native multimodality.
Property |
Details |
Model Type |
Llama 4 Scout, Llama 4 Maverick |
Training Data |
A mix of publicly available, licensed data and information from Meta's products and services, including publicly shared posts from Instagram and Facebook and people's interactions with Meta AI. Cutoff: August 2024. |
Params (Llama 4 Scout) |
17B (Activated), 109B (Total) |
Params (Llama 4 Maverick) |
17B (Activated), 400B (Total) |
Input modalities |
Multilingual text and image |
Output modalities |
Multilingual text and code |
Context length (Llama 4 Scout) |
10M |
Context length (Llama 4 Maverick) |
1M |
Token count (Llama 4 Scout) |
~40T |
Token count (Llama 4 Maverick) |
~22T |
Knowledge cutoff |
August 2024 |
Supported languages |
Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese. |
Model Release Date |
April 5, 2025 |
Status |
Static model trained on an offline dataset. Future tuned versions may be released. |
License |
Llama 4 Community License Agreement. [https://github.com/meta - llama/llama - models/blob/main/models/llama4/LICENSE](https://github.com/meta - llama/llama - models/blob/main/models/llama4/LICENSE) |
Feedback |
Instructions on providing feedback can be found in the Llama [README](https://github.com/meta - llama/llama - models/blob/main/README.md). Technical information about generation parameters and usage recipes can be found [here](https://github.com/meta - llama/llama - cookbook). |
Intended Use
- Intended Use Cases: Commercial and research use in multiple languages. Instruction - tuned models are for assistant - like chat and visual reasoning tasks. Pretrained models can be adapted for natural language generation. The models also support leveraging outputs for improving other models, including synthetic data generation and distillation.
- Out - of - scope: Any use that violates applicable laws or regulations, the Acceptable Use Policy, or the Llama 4 Community License. Use in languages or capabilities beyond the explicitly supported ones.
Hardware and Software
- Training Factors: Custom training libraries, Meta's custom - built GPU clusters, and production infrastructure were used for pretraining. Fine - tuning, quantization, annotation, and evaluation were also performed on production infrastructure.
- Training Energy Use: Model pre - training utilized a cumulative of 7.38M GPU hours of computation on H100 - 80GB (TDP of 700W) type hardware.
Model Name |
Training Time (GPU hours) |
Training Power Consumption (W) |
Training Location - Based Greenhouse Gas Emissions (tons CO2eq) |
Training Market - Based Greenhouse Gas Emissions (tons CO2eq) |
Llama 4 Scout |
5.0M |
700 |
1,354 |
0 |
Llama 4 Maverick |
2.38M |
700 |
645 |
0 |
Total |
7.38M |
- |
1,999 |
0 |
The methodology for determining training energy use and greenhouse gas emissions can be found here.
Benchmarks
Pre - trained models
Category |
Benchmark |
# Shots |
Metric |
Llama 3.1 70B |
Llama 3.1 405B |
Llama 4 Scout |
Llama 4 Maverick |
Reasoning & Knowledge |
MMLU |
5 |
macro_avg/acc_char |
79.3 |
85.2 |
79.6 |
85.5 |
|
MMLU - Pro |
5 |
macro_avg/em |
53.8 |
61.6 |
58.2 |
62.9 |
|
MATH |
4 |
em_maj1@1 |
41.6 |
53.5 |
50.3 |
61.2 |
Code |
MBPP |
3 |
pass@1 |
66.4 |
74.4 |
67.8 |
77.6 |
Multilingual |
TydiQA |
1 |
average/f1 |
29.9 |
34.3 |
31.5 |
31.7 |
Image |
ChartQA |
0 |
relaxed_accuracy |
No multimodal support |
|
83.4 |
85.3 |
|
DocVQA |
0 |
anls |
|
|
89.4 |
91.6 |
Instruction tuned models
Category |
Benchmark |
# Shots |
Metric |
Llama 3.3 70B |
Llama 3.1 405B |
Llama 4 Scout |
Llama 4 Maverick |
Image Reasoning |
MMMU |
0 |
accuracy |
No multimodal support |
|
69.4 |
73.4 |
|
MMMU Pro^ |
0 |
accuracy |
|
|
52.2 |
59.6 |
|
MathVista |
0 |
accuracy |
|
|
70.7 |
73.7 |
Image Understanding |
ChartQA |
0 |
relaxed_accuracy |
|
|
88.8 |
90.0 |
|
DocVQA (test) |
0 |
anls |
|
|
94.4 |
94.4 |
Coding |
LiveCodeBench (10/01/2024 - 02/01/2025) |
0 |
pass@1 |
33.3 |
27.7 |
32.8 |
43.4 |
Reasoning & Knowledge |
MMLU Pro |
0 |
macro_avg/acc |
68.9 |
73.4 |
74.3 |
80.5 |
|
GPQA Diamond |
0 |
accuracy |
50.5 |
49.0 |
57.2 |
69.8 |
Multilingual |
MGSM |
0 |
average/em |
91.1 |
91.6 |
90.6 |
92.3 |
Long context |
MTOB (half book) eng->kgv/kgv->eng |
- |
chrF |
Context window is 128K |
|
42.2/36.6 |
54.0/46.4 |
|
MTOB (full book) eng->kgv/kgv->eng |
- |
chrF |
|
|
39.7/36.3 |
50.8/46.7 |
^reported numbers for MMMU Pro is the average of Standard and Vision tasks
Quantization
- The Llama 4 Scout model is released as BF16 weights and can fit within a single H100 GPU with on - the - fly int4 quantization.
- The Llama 4 Maverick model is released as both BF16 and FP8 quantized weights. The FP8 quantized weights fit on a single H100 DGX host while maintaining quality. Code for on - the - fly int4 quantization is also provided to minimize performance degradation.
Safeguar
The Llama 4 models are subject to the Llama 4 Community License Agreement. When using or distributing the models, please ensure compliance with the license terms, including providing a copy of the agreement, displaying the "Built with Llama" notice, and adhering to the Acceptable Use Policy.
Important Notes
⚠️ Important Note
- Llama 4 has been trained on a broader collection of languages than the 12 supported languages. Developers may fine - tune the models for additional languages, but they must comply with the Llama 4 Community License and the Acceptable Use Policy.
- Llama 4 has been tested for image understanding up to 5 input images. If using additional image understanding capabilities, developers are responsible for risk mitigation, additional testing, and tuning tailored to their specific applications.
💡 Usage Tip
When using Llama 4, make sure to follow the instructions in the Llama [README](https://github.com/meta - llama/llama - models/blob/main/README.md) for providing feedback and the [Llama cookbook](https://github.com/meta - llama/llama - cookbook) for more technical information about generation parameters and usage recipes.