I

Internvit 6B 448px V1 5

Developed by OpenGVLab
InternViT-6B-448px-V1-5 is a vision foundation model fine-tuned based on InternViT-6B-448px-V1-2, featuring strong robustness, OCR capabilities, and high-resolution processing.
Downloads 155
Release Time : 4/17/2024

Model Overview

This model is a vision foundation model primarily used for image feature extraction. It is fine-tuned from InternViT-6B-448px-V1-2, improving the quality and diversity of the pre-training dataset and expanding the training image resolution.

Model Features

Dynamic resolution processing
Supports a base patch size of 448×448 with a patch count range of 1 to 12, enabling high-resolution processing.
Enhanced OCR capability
Significantly improves text recognition by incorporating OCR-related datasets.
Optimized model structure
Discards the last 3 blocks, reducing parameters from 5.9B to 5.54B while saving GPU memory and maintaining performance.
Diverse pre-training data
Uses datasets like LAION, COYO, and GRIT to enhance model robustness and generalization.

Model Capabilities

Image feature extraction
High-resolution image processing
Text recognition (OCR)
Multimodal task support

Use Cases

Computer vision
Image feature extraction
Extracts high-level feature representations of images for downstream tasks like classification and detection.
Document OCR
Recognizes text content in images, suitable for document digitization.
Multimodal learning
Vision-language model construction
Serves as a vision backbone for building multimodal large language models (MLLM).
The V2.5 series is recommended for building MLLM
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase