L

Llama 3.2 11B Vision

Developed by meta-llama
Llama 3.2-Vision is a series of multimodal large language models developed by Meta, available in 11B and 90B scales, supporting image + text input and text output, optimized for visual recognition, image reasoning, image captioning, and visual question answering tasks.
Downloads 31.12k
Release Time : 9/18/2024

Model Overview

This series of models is built upon the pure-text model Llama 3.1, achieving alignment with human preferences through supervised fine-tuning and reinforcement learning from human feedback. Visual capabilities are enabled via an independently trained vision adapter.

Model Features

Multimodal Capability
Supports joint input of images and text, capable of understanding and generating text related to images.
Large-Scale Pretraining
Pretrained on 6 billion image-text pairs, equipped with strong visual and language understanding capabilities.
Instruction Tuning Optimization
Optimized for visual recognition, image reasoning, and other tasks through instruction tuning with 3 million synthetic samples.
Long Context Support
Supports 128k context length, suitable for handling complex multimodal tasks.
Safety Measures
Includes a three-layer protection strategy and specialized risk assessment to ensure safe model usage.

Model Capabilities

Visual Question Answering
Image Reasoning
Image Captioning
Image-Text Retrieval
Visual Grounding
Multilingual Text Processing

Use Cases

Education
University-Level Visual Reasoning
Used to solve university-level visual reasoning problems.
Achieved 50.7% (11B model) and 60.3% (90B model) accuracy on the MMMU-val test set.
Business
Chart Understanding
Used to comprehend and interpret data in business charts.
Achieved 83.4% (11B model) and 85.5% (90B model) accuracy on the ChartQA-test test set.
General
General Visual Question Answering
Used to answer various questions related to images.
Achieved 75.2% (11B model) and 78.1% (90B model) accuracy on the VQAv2-test test set.
Multilingual
Multilingual Text Processing
Used to handle text tasks in multiple languages.
Achieved 68.9% (11B model) and 86.9% (90B model) accuracy on the MGSM-CoT test set.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase