I

Internvit 300M 448px V2 5

Developed by OpenGVLab
InternViT-300M-448px-V2_5 is a major upgrade based on InternViT-300M-448px, enhancing visual feature extraction capabilities through ViT incremental learning and NTP loss, particularly excelling in handling multilingual OCR data and complex scenarios like mathematical charts.
Downloads 23.29k
Release Time : 11/22/2024

Model Overview

This model is a visual feature extraction model primarily used for image feature extraction tasks, capable of capturing more comprehensive visual information, especially excelling in underrepresented domains within large-scale web datasets.

Model Features

ViT Incremental Learning
Enhances the feature extraction capability of the visual encoder through incremental learning and NTP loss, particularly in complex domains like multilingual OCR and mathematical charts.
Dynamic High-Resolution Training
Supports processing multiple images and video data, achieving efficient high-resolution training through dynamic tile allocation.
Multimodal Support
Integrates incrementally pretrained InternViT with various pretrained LLMs to support multimodal tasks.

Model Capabilities

Image Feature Extraction
Multilingual OCR Processing
Mathematical Chart Analysis
Multimodal Task Support

Use Cases

Visual Feature Extraction
Multilingual OCR
Processes multilingual text images to extract high-quality visual features.
Performs exceptionally well in underrepresented domains.
Mathematical Chart Analysis
Extracts visual features from mathematical charts, supporting the recognition of complex mathematical symbols and structures.
Capable of capturing more comprehensive information.
Multimodal Tasks
Image-Text Alignment
Aligns visual features with textual information to support multimodal understanding and generation tasks.
Improves robustness in cross-modal alignment.
Featured Recommended AI Models
ยฉ 2025AIbase