ViTamin-XL-384px Open-source Visual-Language Model - Supports High-resolution Images and Multimodal Feature Extraction

Vitamin XL 384px

Developed by jienengchen

ViTamin-XL-384px is a large-scale vision-language model based on the ViTamin architecture, specifically designed for vision-language tasks, supporting high-resolution image processing and multimodal feature extraction.

Image-to-Text

Transformers

Open Source License:MIT #Multimodal Vision-Language #High-Resolution Image Processing #Open-Vocabulary Recognition

Downloads 104

Release Time : 4/2/2024

Model Overview

ViTamin-XL-384px is a vision-language model primarily used for image feature extraction and text-image matching tasks. It is based on the ViTamin architecture, supports high-resolution image input (384px), and excels in multiple vision tasks.

Model Features

High-Resolution Support

Supports image inputs up to 384px, enabling the processing of finer image details.

Multimodal Feature Extraction

Capable of simultaneously extracting image and text features, supporting cross-modal matching tasks.

Efficient Training

Pretrained on large-scale datasets like DataComp-1B, demonstrating excellent generalization capabilities.

Downstream Task Adaptation

Performs exceptionally well in tasks such as open-vocabulary detection, segmentation, and multimodal understanding.

Model Capabilities

Image feature extraction

Text-image matching

Open-vocabulary detection

Open-vocabulary segmentation

Multimodal understanding

Use Cases

Computer Vision

Open-Vocabulary Object Detection

Object detection on unseen categories

OV-COCO (AP50 novel) 37.5, OV-LVIS (APr) 35.6

Open-Vocabulary Image Segmentation

Semantic segmentation of images, supporting recognition of new categories

ADE 27.3 PQ, CityScapes 44.0 PQ

Multimodal Applications

Visual Question Answering

Answering natural language questions about image content

VQAv2 78.9, GQA 61.6

Image Retrieval

Retrieving relevant images based on text queries

Average score of 61.8 in retrieval tasks

🚀 ViTamin-XL-336px Model Card

This is the official Hugging Face model of ViTamin, introduced in the following CVPR 2024 paper:

ViTamin: Design Scalable Vision Models in the Vision-language Era.
✨ Jieneng Chen, Qihang Yu, Xiaohui Shen, Alan Yuille and Liang-Chieh Chen
🏠 Johns Hopkins University, Bytedance

🚀 Quick Start

You can load the model from HuggingFace using transformers.AutoModel:

import torch
import open_clip
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    'jienengchen/ViTamin-XL-384px',
    trust_remote_code=True).to(device).eval()

image = Image.open('./image.png').convert('RGB')
image_processor = CLIPImageProcessor.from_pretrained('jienengchen/ViTamin-XL-384px')

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K')
text = tokenizer(["a photo of vitamin", "a dog", "a cat"]).to(device)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features, text_features, logit_scale = model(pixel_values, text)
    text_probs = (100.0 * image_features @ text_features.to(torch.float).T).softmax(dim=-1)

print("Label probs:", text_probs)

✨ Features

Main Results with CLIP Pre-training on DataComp-1B

Property	Details
Model Type	ViTamin-XL-336px
Training Data	mlfoundations/datacomp_1b

Image Encoder	Image Size	Num Patches	Text Encoder Depth/Width	Seen Samples (B)	Trainable Params Image+Text (M)	MACs Image+Text (G)	ImageNet Acc.	Avg. 38 Datasets	ImageNet Dist. Shift.	VTAB	Retrieval
ViTamin-L	224	196	12/768	12.8	333.3+123.7	72.6+6.6	80.8	66.7	69.8	65.3	60.3
ViTamin-L	256	256	12/768	12.8+0.2	333.4+123.7	94.8+6.6	81.2	67.0	71.1	65.3	61.2
ViTamin-L	336	441	12/768	12.8+0.2	333.6+123.7	163.4+6.6	81.6	67.0	72.1	64.4	61.6
ViTamin-L	384	576	12/768	12.8+0.2	333.7+123.7	213.4+6.6	81.8	67.2	72.4	64.7	61.8
ViTamin-L2	224	196	24/1024	12.8	333.6+354.0	72.6+23.3	80.9	66.4	70.6	63.4	61.5
ViTamin-L2	256	256	24/1024	12.8+0.5	333.6+354.0	94.8+23.3	81.5	67.4	71.9	64.1	63.1
ViTamin-L2	336	441	24/1024	12.8+0.5	333.8+354.0	163.4+23.3	81.8	67.8	73.0	64.5	63.6
ViTamin-L2	384	576	24/1024	12.8+0.5	334.0+354.0	213.4+23.3	82.1	68.1	73.4	64.8	63.7
ViTamin-XL	256	256	27/1152	12.8+0.5	436.1+488.7	125.3+33.1	82.1	67.6	72.3	65.4	62.7
ViTamin-XL	384	576	27/1152	12.8+0.5	436.1+488.7	281.9+33.1	82.6	68.1	73.6	65.6	63.8
ViTamin-XL	256	256	27/1152	40	436.1+488.7	125.3+33.1	82.3	67.5	72.8	64.0	62.1
ViTamin-XL	336	441	27/1152	40+1	436.1+488.7	215.9+33.1	82.7	68.0	73.9	64.1	62.6
ViTamin-XL	384	576	27/1152	40+1	436.1+488.7	281.9+33.1	82.9	68.1	74.1	64.0	62.5

Main Results on Downstream tasks

Open-Vocab Detection

Image Encoder	Detector	OV-COCO (AP₅₀^novel)	OV-LVIS (AP_r)
ViT-L/14	Sliding F-ViT	36.1	32.5
ViTamin-L	Sliding F-ViT	37.5	35.6

Open-Vocab Segmentation

Image Encoder	Segmentor	ADE	Cityscapes	MV	A-150	A-847	PC-459	PC-59	PAS-21
ViT-L/14	Sliding FC-CLIP	24.6	40.7	16.5	31.8	14.3	18.3	55.1	81.5
ViTamin-L	Sliding FC-CLIP	27.3	44.0	18.2	35.6	16.1	20.4	58.4	83.4

Note: Panoptic dataset (ADE, CityScapes, MV) are with the metric of PQ. Semantic dataset (A-150, A-847, PC-459, PC-59, PAS-21) are with the metric of mIoU.

Large Multi-modal Models

Image Encoder	Image Size	VQAv2	GQA	VizWiz	SQA	T-VQA	POPE	MME	MM-Bench	MM-B-CN	SEED	LLaVA-Wild	MM-Vet
ViTamin-L	336	78.4	61.6	51.1	66.9	58.7	84.6	1421	65.4	58.4	57.7	64.5	33.6
ViTamin-L	384	78.9	61.6	55.4	67.6	59.8	85.5	1447	64.5	58.3	57.9	66.1	33.6

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご