L

Llama 3.2 90B Vision Instruct

Developed by meta-llama
Llama 3.2-Vision is a multimodal large language model developed by Meta, supporting image and text input with text output, excelling in visual recognition, image reasoning, image captioning, and visual question answering tasks.
Downloads 15.44k
Release Time : 9/19/2024

Model Overview

A multimodal model built upon the text-only Llama 3.1, integrating image processing capabilities through a vision adapter, suitable for tasks like visual question answering and image caption generation.

Model Features

Multimodal Capability
Supports image and text input, capable of understanding and analyzing image content to generate relevant text output.
High-Performance Visual Understanding
Outperforms most open-source and proprietary multimodal models in tasks like visual question answering, document parsing, and chart reasoning.
Long Context Support
Supports 128k context length, ideal for handling complex multimodal tasks.
Safety Alignment
Aligned with human values through supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF).

Model Capabilities

Visual Question Answering
Image Reasoning
Image Caption Generation
Image-Text Retrieval Matching
Visual Grounding
Document Visual Question Answering
Chart Reasoning

Use Cases

Visual Question Answering
Image Content Question Answering
Answer natural language questions about image content
Achieves 73.6% accuracy on the VQAv2 dataset
Document Processing
Document Visual Question Answering
Understand and answer questions based on document images
Scores 70.7 ANLS on the DocVQA dataset
Image Generation
Image Caption Generation
Generate natural language descriptions for input images
Capable of producing high-quality image captions and creative text
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase