I

Internvit 300M 448px

Developed by OpenGVLab
InternViT-300M-448px is an efficient vision foundation model developed through knowledge distillation from InternViT-6B-448px-V1-5, featuring dynamic input resolution of 448×448 and supporting 1 to 40 patch processing.
Downloads 7,506
Release Time : 5/24/2024

Model Overview

InternViT-300M-448px is a vision foundation model primarily used for image feature extraction. It inherits the robust capabilities, OCR performance, and high-resolution processing from InternViT-6B-448px-V1-5.

Model Features

High-resolution processing capability
Supports dynamic input resolution of 448×448, with training support for 1 to 12 patches and testing scalability up to 1 to 40 patches.
Powerful OCR capability
Enhanced with additional OCR data, the model excels in handling Chinese and English OCR tasks.
Efficient model
Achieves high efficiency through knowledge distillation from larger models, with only 304 million parameters.

Model Capabilities

Image feature extraction
High-resolution image processing
OCR recognition
Multimodal task support

Use Cases

Multimodal large language models
Building MLLMs
The InternViT V2.5 series is suitable for constructing multimodal large language models (MLLMs).
OCR tasks
Chinese OCR recognition
Using PaddleOCR to perform Chinese OCR recognition on images from Wukong.
English OCR recognition
Performing English OCR recognition on images from LAION-COCO.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase