InternViT-6B-448px-V1-0 Open-Source Vision Model - Efficiently Extract Image Features, Enhance OCR and Chinese Dialogue

Internvit 6B 448px V1 0

Developed by OpenGVLab

InternViT-6B-448px-V1-0 is a vision foundation model focused on image feature extraction, supporting 448x448 resolution with enhanced OCR capabilities and improved Chinese dialogue support.

Text-to-Image

Transformers

Open Source License:MIT #448 high resolution #multimodal feature extraction #Chinese OCR enhancement

Downloads 24

Release Time : 1/30/2024

Model Overview

This model is a vision foundation model primarily used for image feature extraction, especially suitable for building multimodal large language models (MLLM). By increasing resolution and optimizing feature extraction layers, it enhances optical character recognition (OCR) capabilities and improves support for Chinese dialogue.

Model Features

High-resolution support

Supports high-resolution image input at 448x448, improving detail capture capabilities.

Enhanced OCR capabilities

Significantly improves the accuracy of optical character recognition (OCR) by optimizing training data and model architecture.

Chinese dialogue optimization

Specifically optimized for Chinese dialogue, making it suitable for Chinese multimodal application scenarios.

Efficient feature extraction

Uses the output from the fourth-to-last layer, making it particularly suitable for building multimodal large language models (MLLM).

Model Capabilities

Image feature extraction

Optical character recognition (OCR)

Multimodal dialogue support

High-resolution image processing

Use Cases

Multimodal applications

Multimodal dialogue systems

Build dialogue systems that support image and text interaction, especially in Chinese environments.

Enhances the visual understanding and response capabilities of dialogue systems.

Document OCR processing

Used for high-precision text recognition and extraction from document images.

Improves OCR accuracy and processing efficiency.

Computer vision

Image feature extraction

Used for image feature extraction in downstream tasks such as classification and detection.

Provides high-quality feature representations.

🚀 InternViT-6B-448px-V1-0

We are excited to introduce InternViT-6B-448px-V1-0, an advanced vision foundation model. This model is integrated into InternVL-Chat-V1-1, offering enhanced capabilities in image feature extraction, OCR, and Chinese conversation support.

🚀 Quick Start

⚠️ Important Note

In our experience, the InternViT V2.5 series is better suited for building MLLMs than traditional computer vision tasks.

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor

model = AutoModel.from_pretrained(
    'OpenGVLab/InternViT-6B-448px-V1-0',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).cuda().eval()

image = Image.open('./examples/image1.jpg').convert('RGB')

image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-448px-V1-0')

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

outputs = model(pixel_values)

✨ Features

Increased Resolution: We explored increasing the resolution to 448x448, enhancing the model's ability to capture fine details in images.
Enhanced OCR Capabilities: The model shows improved performance in Optical Character Recognition tasks.
Better Chinese Conversation Support: It offers better support for Chinese conversations, making it more suitable for multilingual applications.

📦 Installation

The installation process is mainly about installing the necessary Python libraries. You can use the following command to install the transformers library:

pip install transformers

📚 Documentation

Model Details

Property	Details
Model Type	vision foundation model, feature backbone
Model Stats	Params (M): 5903; Image size: 448 x 448
Pretrain Dataset	LAION-en, LAION-COCO, COYO, CC12M, CC3M, SBU, Wukong, LAION-multi, OCR-related datasets

Note

This model has 48 blocks, and we found that using the output after the fourth-to-last block worked best for MLLM. Therefore, when building a MLLM with this model, please use the features from the fourth-to-last layer.

📄 License

This project is licensed under the MIT License.

📖 Citation

If you find this project useful in your research, please consider citing:

@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{gao2024mini,
  title={Mini-internvl: A flexible-transfer pocket multimodal model with 5\% parameters and 90\% performance},
  author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others},
  journal={arXiv preprint arXiv:2410.16261},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}

[📂 GitHub] [📜 InternVL 1.0] [📜 InternVL 1.5] [📜 Mini-InternVL] [📜 InternVL 2.5]

[🆕 Blog] [🗨️ Chat Demo] [🤗 HF Demo] [🚀 Quick Start] [📖 Documents]

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご