Meta Releases Open-Source Llama 4 Maverick Multimodal AI Model - Supports Image-Text Understanding and Multilingual Code Generation

Meta Llama Llama 4 Maverick 17B 128E Instruct

Developed by Undi95

Llama 4 Maverick is a multimodal AI model released by Meta, supporting text and image understanding. It adopts a Mixture of Experts (MoE) architecture and excels in multilingual text and code generation tasks.

Multimodal Fusion

Transformers

Supports Multiple LanguagesOpen Source License:Other #Multimodal Expert Model #Ultra-Long Context Understanding #Multilingual Generation

Downloads 35

Release Time : 4/5/2025

Model Overview

Llama 4 Maverick is an efficient multimodal large language model supporting 12 languages, with robust text and image understanding capabilities, suitable for both commercial and research applications.

Model Features

Multimodal Support

Supports text and image inputs, enabling cross-modal understanding and generation.

Mixture of Experts Architecture

Adopts a 128-expert MoE architecture to enhance model efficiency and performance.

Multilingual Capabilities

Supports 12 languages, suitable for global application scenarios.

Efficient Quantization

Supports BF16 and FP8 quantization for easier single-machine deployment.

Model Capabilities

Multilingual text generation

Image understanding and description

Code generation

Multimodal reasoning

Dialogue systems

Use Cases

Commercial Applications

Multilingual Customer Support Assistant

Provides businesses with intelligent customer support in multiple languages to enhance user experience.

Enables fluent conversations in 12 languages, reducing manual support costs.

Visual Reasoning Applications

Used for image recognition and description, applicable in e-commerce, healthcare, and other fields.

Achieved a score of 91.6 on the DocVQA benchmark.

Research & Development

AI Model Distillation

Uses Llama 4 to generate synthetic data for training smaller models.

Improves small model performance while reducing training costs.

Natural Language Generation Research

Used to explore advanced techniques in multilingual text generation.

Achieved a score of 85.5 on the MMLU benchmark.

🚀 Llama 4 Models

The Llama 4 models are natively multimodal AI models, enabling text and multimodal experiences. They use a mixture-of-experts architecture to achieve industry-leading performance in text and image understanding.

🚀 Quick Start

Prerequisites

Please ensure you have transformers v4.51.0 installed. You can upgrade it using the following command:

pip install -U transformers

Usage Example

from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch

model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": url1},
            {"type": "image", "url": url2},
            {"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
)

response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
print(outputs[0])

✨ Features

Native Multimodality: The Llama 4 models support text and multimodal experiences, offering industry-leading performance in text and image understanding.
Mixture-of-Experts Architecture: They leverage a mixture-of-experts architecture to provide high performance.
Two Efficient Models: The Llama 4 series includes Llama 4 Scout (17 billion parameters, 16 experts) and Llama 4 Maverick (17 billion parameters, 128 experts).

📦 Installation

To use the Llama 4 models with the transformers library, make sure you have transformers v4.51.0 installed. You can upgrade it using the following command:

pip install -U transformers

💻 Usage Examples

Basic Usage

from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch

model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": url1},
            {"type": "image", "url": url2},
            {"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
)

response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
print(outputs[0])

📚 Documentation

Model Information

Property	Details
Model Developer	Meta
Model Architecture	Auto-regressive language models using a mixture-of-experts (MoE) architecture with early fusion for native multimodality
Model Name	Llama 4 Scout (17Bx16E), Llama 4 Maverick (17Bx128E)
Training Data	A mix of publicly available, licensed data and information from Meta's products and services, including publicly shared posts from Instagram and Facebook and people's interactions with Meta AI. Learn more
Params	Llama 4 Scout: 17B (Activated), 109B (Total); Llama 4 Maverick: 17B (Activated), 400B (Total)
Input modalities	Multilingual text and image
Output modalities	Multilingual text and code
Context length	Llama 4 Scout: 10M; Llama 4 Maverick: 1M
Token count	Llama 4 Scout: ~40T; Llama 4 Maverick: ~22T
Knowledge cutoff	August 2024
Supported languages	Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese
Model Release Date	April 5, 2025
Status	A static model trained on an offline dataset. Future versions of the tuned models may be released.
License	Llama 4 Community License Agreement. View license
Feedback	Instructions on how to provide feedback or comments on the model can be found in the Llama README. For more technical information, click here.

Intended Use

Intended Use Cases: Commercial and research use in multiple languages. Instruction tuned models are for assistant-like chat and visual reasoning tasks, while pretrained models can be adapted for natural language generation. For vision, the models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The Llama 4 Community License allows for these use cases.
Out-of-scope: Use that violates applicable laws or regulations, the Acceptable Use Policy, or the Llama 4 Community License. Use in languages or capabilities beyond those explicitly supported.

Hardware and Software

Training Factors: Custom training libraries, Meta's custom built GPU clusters, and production infrastructure were used for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.
Training Energy Use: Model pre-training utilized a cumulative of 7.38M GPU hours of computation on H100-80GB (TDP of 700W) type hardware.
Training Greenhouse Gas Emissions: Estimated total location-based greenhouse gas emissions were 1,999 tons CO2eq for training. Market-based emissions were 0 tons CO2eq.

Model Name	Training Time (GPU hours)	Training Power Consumption (W)	Training Location-Based Greenhouse Gas Emissions (tons CO2eq)
Llama 4 Scout	5.0M	700	1,354
Llama 4 Maverick	2.38M	700	645
Total	7.38M	-	1,999

Benchmarks

Pre-trained models

Category	Benchmark	# Shots	Metric	Llama 3.1 70B	Llama 3.1 405B	Llama 4 Scout	Llama 4 Maverick
Reasoning & Knowledge	MMLU	5	macro_avg/acc_char	79.3	85.2	79.6	85.5
	MMLU-Pro	5	macro_avg/em	53.8	61.6	58.2	62.9
	MATH	4	em_maj1@1	41.6	53.5	50.3	61.2
Code	MBPP	3	pass@1	66.4	74.4	67.8	77.6
Multilingual	TydiQA	1	average/f1	29.9	34.3	31.5	31.7
Image	ChartQA	0	relaxed_accuracy	No multimodal support		83.4	85.3
	DocVQA	0	anls			89.4	91.6

Instruction tuned models

Category	Benchmark	# Shots	Metric	Llama 3.3 70B	Llama 3.1 405B	Llama 4 Scout	Llama 4 Maverick
Image Reasoning	MMMU	0	accuracy	No multimodal support		69.4	73.4
	MMMU Pro^	0	accuracy			52.2	59.6
	MathVista	0	accuracy			70.7	73.7
Image Understanding	ChartQA	0	relaxed_accuracy			88.8	90.0
	DocVQA (test)	0	anls			94.4	94.4
Coding	LiveCodeBench (10/01/2024 - 02/01/2025)	0	pass@1	33.3	27.7	32.8	43.4
Reasoning & Knowledge	MMLU Pro	0	macro_avg/em	68.9	73.4	74.3	80.5
	GPQA Diamond	0	accuracy	50.5	49.0	57.2	69.8
Multilingual	MGSM	0	average/em	91.1	91.6	90.6	92.3
Long context	MTOB (half book) eng->kgv/kgv->eng	-	chrF	Context window is 128K		42.2/36.6	54.0/46.4
	MTOB (full book) eng->kgv/kgv->eng	-	chrF			39.7/36.3	50.8/46.7

^reported numbers for MMMU Pro is the average of Standard and Vision tasks

Quantization

The Llama 4 Scout model is released as BF16 weights and can fit within a single H100 GPU with on-the-fly int4 quantization. The Llama 4 Maverick model is released as both BF16 and FP8 quantized weights. The FP8 quantized weights fit on a single H100 DGX host while maintaining quality.

🔧 Technical Details

Training Methodology

The Llama 4 models were pretrained using custom training libraries, Meta's custom built GPU clusters, and production infrastructure. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.

Energy Use and Emissions

Model pre-training utilized a cumulative of 7.38M GPU hours of computation on H100-80GB (TDP of 700W) type hardware. Estimated total location-based greenhouse gas emissions were 1,999 tons CO2eq for training, while market-based emissions were 0 tons CO2eq.

📄 License

The Llama 4 models are licensed under the Llama 4 Community License Agreement. View license

⚠️ Important Note

The information you provide will be collected, stored, processed and shared in accordance with the Meta Privacy Policy.

💡 Usage Tip

Llama 4 has been trained on a broader collection of languages than the 12 supported languages. Developers may fine-tune the models for additional languages, but they must comply with the Llama 4 Community License and the Acceptable Use Policy.

Llama 4 has been tested for image understanding up to 5 input images. If using additional image understanding capabilities, developers should ensure their deployments are mitigated for risks and perform additional testing and tuning.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご