🚀 Llama 3.2-Vision
Llama 3.2-Vision is a collection of multimodal large language models optimized for visual recognition, image reasoning, captioning, and answering general questions about an image.
🚀 Quick Start
This README provides detailed information about the Llama 3.2-Vision model, including its features, intended use, installation, and usage examples.
✨ Features
- Multimodal Capabilities: Llama 3.2-Vision supports both text and image inputs, enabling a wide range of applications such as visual question answering, document visual question answering, image captioning, image-text retrieval, and visual grounding.
- Multiple Languages: For text-only tasks, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported.
- Optimized Architecture: Built on top of the Llama 3.1 text-only model, it uses an optimized transformer architecture and techniques like supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to ensure helpfulness and safety.
- Scalability: All model versions use Grouped-Query Attention (GQA) for improved inference scalability.
📦 Installation
The installation details are not provided in the original README. If there are specific installation steps in the future, they will be added here.
💻 Usage Examples
The original README does not contain code examples. If there are usage code examples later, they will be presented here.
📚 Documentation
Model Information
The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image, outperforming many available open source and closed multimodal models on common industry benchmarks.
Property |
Details |
Model Developer |
Meta |
Model Architecture |
Built on top of Llama 3.1 text-only model, an auto-regressive language model using an optimized transformer architecture. Tuned versions use SFT and RLHF. A separately trained vision adapter integrates with the pre-trained Llama 3.1 language model, consisting of cross-attention layers feeding image encoder representations into the core LLM. |
Training Data |
(Image, text) pairs |
Params |
11B (10.6) and 90B (88.8) |
Input modalities |
Text + Image |
Output modalities |
Text |
Context length |
128k |
GQA |
Yes |
Data volume |
6B (image, text) pairs |
Knowledge cutoff |
December 2023 |
Supported Languages: For text-only tasks, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages. Note that for image+text applications, English is the only supported language.
Developers may fine-tune Llama 3.2 models for languages beyond the supported ones, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy.
Llama 3.2 Model Family: Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.
Model Release Date: Sept 25, 2024
Status: This is a static model trained on an offline dataset. Future versions may be released to improve model capabilities and safety.
License: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).
Feedback: Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.2-Vision in applications, please go here.
Intended Use
Intended Use Cases:
- Visual Question Answering (VQA) and Visual Reasoning: A machine that can understand questions about an image.
- Document Visual Question Answering (DocVQA): A computer that can understand the text and layout of a document (e.g., map or contract) and answer questions directly from the image.
- Image Captioning: Extract details from an image, understand the scene, and create a description.
- Image-Text Retrieval: Match images with their descriptions, similar to a search engine that understands both pictures and words.
- Visual Grounding: Connect natural language descriptions to specific parts of an image, allowing AI models to pinpoint objects or regions.
The Llama 3.2 model collection also supports leveraging its model outputs to improve other models, including synthetic data generation and distillation, as allowed by the Llama 3.2 Community License.
Out of Scope:
- Use in any manner that violates applicable laws or regulations (including trade compliance laws).
- Use prohibited by the Acceptable Use Policy and Llama 3.2 Community License.
- Use in languages beyond those explicitly referenced as supported in this model card.
How to use
This repository contains two versions of Llama-3.2-11B-Vision-Instruct, for use with transformers.
License Agreement
The full text of the Llama 3.2 Community License Agreement can be found in the extra_gated_prompt
section above. It details the terms and conditions for use, reproduction, distribution, and modification of the Llama Materials.
Acceptable Use Policy
Meta is committed to promoting safe and fair use of Llama 3.2. The Acceptable Use Policy prohibits various uses, including those that violate the law, cause harm, deceive others, or interact with unlawful tools. The most recent copy of this policy can be found at https://www.llama.com/llama3_2/use-policy.
Reporting Issues
Please report any violation of the Acceptable Use Policy, software “bug,” or other problems through one of the following means:
🔧 Technical Details
The Llama 3.2-Vision model uses an optimized transformer architecture based on the Llama 3.1 text-only model. To support image recognition, it integrates a separately trained vision adapter. The adapter consists of cross-attention layers that feed image encoder representations into the core LLM, enabling the model to process both text and image inputs effectively.
📄 License
Use of Llama 3.2 is governed by the Llama 3.2 Community License, a custom, commercial license agreement. The agreement details the terms and conditions for use, reproduction, distribution, and modification of the Llama Materials.