L

Llama 3.2 11b Vision R1 Distill

Developed by bababababooey
Llama 3.2-Vision is a multimodal large language model developed by Meta, supporting image and text inputs, optimized for visual recognition, image reasoning, and description tasks.
Downloads 29
Release Time : 2/7/2025

Model Overview

A multimodal model built upon the Llama 3.1 text-only model, supporting visual tasks through an image adapter, with excellent performance across various vision benchmarks.

Model Features

Multimodal Understanding
Processes both image and text inputs simultaneously for cross-modal understanding and reasoning.
Long Context Support
128k token context window, suitable for handling complex visual scenes.
Efficient Inference
Utilizes Grouped Query Attention (GQA) technology to enhance inference efficiency.
Safety Alignment
Aligned with human preferences through RLHF and SFT, with built-in safety mitigation measures.

Model Capabilities

Visual Question Answering
Image Caption Generation
Document Understanding
Chart Parsing
Multilingual Text Generation
Visual Grounding
Image-Text Retrieval

Use Cases

Education
Textbook Content Understanding
Analyzes charts and illustrations in textbooks to answer student questions.
Achieved 60.3% accuracy on the MMMU college-level question test.
Business Analysis
Business Chart Interpretation
Automatically analyzes financial report charts and data visualizations.
Achieved 85.5% accuracy on the ChartQA test set.
Document Processing
Smart Invoice Processing
Extracts key information from invoice images and calculates date differences.
Scored 90.1 ANLS on DocVQA test.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase