I

Internvl3 2B Pretrained

Developed by OpenGVLab
InternVL3-2B is an advanced multimodal large language model developed by OpenGVLab, featuring robust visual-language understanding and reasoning capabilities, supporting various multimodal tasks.
Downloads 61
Release Time : 4/17/2025

Model Overview

InternVL3-2B is a multimodal large language model based on the integration of Qwen2.5-1.5B and InternViT-300M-448px-V2_5, having undergone native multimodal pretraining and demonstrating exceptional overall performance.

Model Features

Native Multimodal Pretraining
Integrates language and visual learning into a single pretraining phase, enhancing multimodal representation capabilities.
Variable Visual Position Encoding (V2PE)
Uses smaller, more flexible position increments to improve long-context understanding.
Mixed Preference Optimization (MPO)
Aligns model response distributions through positive and negative sample supervision, improving reasoning performance.
Dynamic Resolution Processing
Supports 448ร—448 pixel tile division, adapting to inputs of varying sizes.

Model Capabilities

Multimodal reasoning
Image caption generation
Document understanding
Multi-image analysis
Video understanding
GUI localization
Spatial reasoning
Multilingual understanding

Use Cases

Visual Content Analysis
Image Caption Generation
Generates detailed descriptions for input images.
High-quality natural language descriptions.
Multi-image Comparison
Analyzes similarities and differences among multiple images.
Accurate comparative analysis results.
Industrial Applications
Industrial Image Analysis
Analyzes image data in industrial scenarios.
Accurate defect detection and classification.
Interactive Applications
GUI Agent
Understands and operates graphical user interfaces.
Accurate interface element recognition and operation.
Featured Recommended AI Models
ยฉ 2025AIbase