Open-source Llama 3.2-Vision multimodal large model, free to deploy and support image recognition, description, and Q&A!

Llama3.2 11B Vision Instruct INT4 GPTQ

Developed by fahadh4ilyas

Llama 3.2-Vision is a multimodal large language model developed by Meta, with image reasoning and text generation capabilities, supporting tasks such as visual recognition, image description, and question answering.

Image-to-Text

Transformers

Supports Multiple Languages#Multimodal visual reasoning #128k long context #Cross-modal question answering

Downloads 1,770

Release Time : 4/8/2025

Model Overview

Llama 3.2-Vision is a multimodal large language model built on the Llama 3.1 pure text model, supporting image input through a visual adapter, suitable for various tasks such as visual question answering and image description.

Model Features

Multimodal capabilities

Process image and text inputs simultaneously to achieve cross-modal understanding and generation

Large-scale pre-training

Trained on 6 billion (image, text) pairs of data, with strong visual language understanding capabilities

Long context support

Supports a context length of 128k, suitable for handling complex tasks

Efficient reasoning

Uses Grouped Query Attention (GQA) technology to improve reasoning efficiency

Model Capabilities

Image understanding

Text generation

Visual question answering

Image description

Document understanding

Visual positioning

Image-text retrieval

Use Cases

Visual question answering

Image content question answering

Answer natural language questions about image content

Accurately understand image content and provide relevant answers

Document processing

Document visual question answering

Understand the text and layout of documents (such as contracts, maps) and answer questions

Extract information directly from document images and answer questions

Content generation

Image description generation

Generate detailed natural language descriptions for images

Generate accurate and fluent image descriptions

🚀 Llama 3.2-Vision

Llama 3.2-Vision is a collection of multimodal large language models optimized for visual recognition, image reasoning, captioning, and answering general questions about an image.

🚀 Quick Start

This README provides detailed information about the Llama 3.2-Vision model, including its features, intended use, installation, and usage examples.

✨ Features

Multimodal Capabilities: Llama 3.2-Vision supports both text and image inputs, enabling a wide range of applications such as visual question answering, document visual question answering, image captioning, image-text retrieval, and visual grounding.
Multiple Languages: For text-only tasks, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported.
Optimized Architecture: Built on top of the Llama 3.1 text-only model, it uses an optimized transformer architecture and techniques like supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to ensure helpfulness and safety.
Scalability: All model versions use Grouped-Query Attention (GQA) for improved inference scalability.

📦 Installation

The installation details are not provided in the original README. If there are specific installation steps in the future, they will be added here.

💻 Usage Examples

The original README does not contain code examples. If there are usage code examples later, they will be presented here.

📚 Documentation

Model Information

The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image, outperforming many available open source and closed multimodal models on common industry benchmarks.

Property	Details
Model Developer	Meta
Model Architecture	Built on top of Llama 3.1 text-only model, an auto-regressive language model using an optimized transformer architecture. Tuned versions use SFT and RLHF. A separately trained vision adapter integrates with the pre-trained Llama 3.1 language model, consisting of cross-attention layers feeding image encoder representations into the core LLM.
Training Data	(Image, text) pairs
Params	11B (10.6) and 90B (88.8)
Input modalities	Text + Image
Output modalities	Text
Context length	128k
GQA	Yes
Data volume	6B (image, text) pairs
Knowledge cutoff	December 2023

Supported Languages: For text-only tasks, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages. Note that for image+text applications, English is the only supported language.

Developers may fine-tune Llama 3.2 models for languages beyond the supported ones, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy.

Llama 3.2 Model Family: Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.

Model Release Date: Sept 25, 2024

Status: This is a static model trained on an offline dataset. Future versions may be released to improve model capabilities and safety.

License: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).

Feedback: Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.2-Vision in applications, please go here.

Intended Use

Intended Use Cases:

Visual Question Answering (VQA) and Visual Reasoning: A machine that can understand questions about an image.
Document Visual Question Answering (DocVQA): A computer that can understand the text and layout of a document (e.g., map or contract) and answer questions directly from the image.
Image Captioning: Extract details from an image, understand the scene, and create a description.
Image-Text Retrieval: Match images with their descriptions, similar to a search engine that understands both pictures and words.
Visual Grounding: Connect natural language descriptions to specific parts of an image, allowing AI models to pinpoint objects or regions.

The Llama 3.2 model collection also supports leveraging its model outputs to improve other models, including synthetic data generation and distillation, as allowed by the Llama 3.2 Community License.

Out of Scope:

Use in any manner that violates applicable laws or regulations (including trade compliance laws).
Use prohibited by the Acceptable Use Policy and Llama 3.2 Community License.
Use in languages beyond those explicitly referenced as supported in this model card.

How to use

This repository contains two versions of Llama-3.2-11B-Vision-Instruct, for use with transformers.

License Agreement

The full text of the Llama 3.2 Community License Agreement can be found in the extra_gated_prompt section above. It details the terms and conditions for use, reproduction, distribution, and modification of the Llama Materials.

Acceptable Use Policy

Meta is committed to promoting safe and fair use of Llama 3.2. The Acceptable Use Policy prohibits various uses, including those that violate the law, cause harm, deceive others, or interact with unlawful tools. The most recent copy of this policy can be found at https://www.llama.com/llama3_2/use-policy.

Reporting Issues

Please report any violation of the Acceptable Use Policy, software “bug,” or other problems through one of the following means:

Reporting issues with the model: https://github.com/meta-llama/llama-models/issues
Reporting risky content generated by the model: developers.facebook.com/llama_output_feedback
Reporting bugs and security concerns: facebook.com/whitehat/info
Reporting violations of the Acceptable Use Policy or unlicensed uses of Llama 3.2: LlamaUseReport@meta.com

🔧 Technical Details

The Llama 3.2-Vision model uses an optimized transformer architecture based on the Llama 3.1 text-only model. To support image recognition, it integrates a separately trained vision adapter. The adapter consists of cross-attention layers that feed image encoder representations into the core LLM, enabling the model to process both text and image inputs effectively.

📄 License

Use of Llama 3.2 is governed by the Llama 3.2 Community License, a custom, commercial license agreement. The agreement details the terms and conditions for use, reproduction, distribution, and modification of the Llama Materials.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご