ViTamin-XL-256px Open-source Vision-language Model - Efficient Feature Extraction for High-resolution Image Processing

Vitamin XL 256px

Developed by jienengchen

ViTamin-XL-256px is a vision-language model based on the ViTamin architecture, designed for efficient visual feature extraction and multimodal tasks, supporting high-resolution image processing.

Text-to-Image

Transformers

Open Source License:MIT #Multimodal Vision-Language Model #High-Resolution Image Processing #Open-Vocabulary Detection

Downloads 655

Release Time : 4/8/2024

Model Overview

ViTamin-XL-256px is a scalable vision model that combines visual and language processing capabilities, suitable for image classification, open-vocabulary detection, segmentation, and multimodal tasks.

Model Features

High-Resolution Support

Supports image resolutions from 256px to 384px, adaptable to various scenario requirements.

Excellent Multi-Task Performance

Outstanding performance in ImageNet classification, open-vocabulary detection, segmentation, and multimodal tasks.

Scalable Architecture

The ViTamin design allows flexible adjustment of model scale and computational load, balancing performance and efficiency.

Model Capabilities

Image feature extraction

Text feature extraction

Multimodal alignment

Open-vocabulary detection

Semantic segmentation

Visual question answering

Use Cases

Computer Vision

Image Classification

Efficiently classifies images, supporting open-vocabulary labels.

ImageNet accuracy 82.1% (256px resolution)

Open-Vocabulary Detection

Detects new category objects in images that were not present in the training set.

OV-COCO new class AP50 reaches 37.5%

Multimodal Applications

Visual Question Answering

Answers complex questions by combining image and text inputs.

VQAv2 accuracy 78.4%

Image-Text Retrieval

Achieves cross-modal image-text matching and retrieval.

Retrieval performance metrics 61.2-63.8

🚀 Model card for ViTamin-XL-256px

This is the official Hugging Face model of ViTamin, introduced in the following CVPR 2024 paper:

ViTamin: Design Scalable Vision Models in the Vision-language Era.
✨ Jieneng Chen, Qihang Yu, Xiaohui Shen, Alan Yuille and Liang-Chieh Chen
🏠 Johns Hopkins University, Bytedance

🚀 Quick Start

You can load the model from HuggingFace using transformers.AutoModel:

import torch
import open_clip
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    'jienengchen/ViTamin-XL-256px',
    trust_remote_code=True).to(device).eval()

image = Image.open('./image.png').convert('RGB')
image_processor = CLIPImageProcessor.from_pretrained('jienengchen/ViTamin-XL-256px')

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K')
text = tokenizer(["a photo of vitamin", "a dog", "a cat"]).to(device)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features, text_features, logit_scale = model(pixel_values, text)
    text_probs = (100.0 * image_features @ text_features.to(torch.float).T).softmax(dim=-1)

print("Label probs:", text_probs)

✨ Features

Main Results with CLIP Pre-training on DataComp-1B

Property	Details
Model Type	ViTamin-XL-256px
Training Data	mlfoundations/datacomp_1b

Image Encoder	Image Size	Num Patches	Text Encoder Depth/Width	Seen Samples (B)	Trainable Params Image+Text (M)	MACs Image+Text (G)	ImageNet Acc.	Avg. 38 Datasets	ImageNet Dist. Shift.	VTAB	Retrieval
ViTamin-L	224	196	12/768	12.8	333.3+123.7	72.6+6.6	80.8	66.7	69.8	65.3	60.3
ViTamin-L	256	256	12/768	12.8+0.2	333.4+123.7	94.8+6.6	81.2	67.0	71.1	65.3	61.2
ViTamin-L	336	441	12/768	12.8+0.2	333.6+123.7	163.4+6.6	81.6	67.0	72.1	64.4	61.6
ViTamin-L	384	576	12/768	12.8+0.2	333.7+123.7	213.4+6.6	81.8	67.2	72.4	64.7	61.8
ViTamin-L2	224	196	24/1024	12.8	333.6+354.0	72.6+23.3	80.9	66.4	70.6	63.4	61.5
ViTamin-L2	256	256	24/1024	12.8+0.5	333.6+354.0	94.8+23.3	81.5	67.4	71.9	64.1	63.1
ViTamin-L2	336	441	24/1024	12.8+0.5	333.8+354.0	163.4+23.3	81.8	67.8	73.0	64.5	63.6
ViTamin-L2	384	576	24/1024	12.8+0.5	334.0+354.0	213.4+23.3	82.1	68.1	73.4	64.8	63.7
ViTamin-XL	256	256	27/1152	12.8+0.5	436.1+488.7	125.3+33.1	82.1	67.6	72.3	65.4	62.7
ViTamin-XL	384	576	27/1152	12.8+0.5	436.1+488.7	281.9+33.1	82.6	68.1	73.6	65.6	63.8
ViTamin-XL	256	256	27/1152	40	436.1+488.7	125.3+33.1	82.3	67.5	72.8	64.0	62.1
ViTamin-XL	336	441	27/1152	40+1	436.1+488.7	215.9+33.1	82.7	68.0	73.9	64.1	62.6
ViTamin-XL	384	576	27/1152	40+1	436.1+488.7	281.9+33.1	82.9	68.1	74.1	64.0	62.5

Main Results on Downstream tasks

Open-Vocab Detection

Image Encoder	Detector	OV-COCO (AP₅₀^novel)	OV-LVIS (AP_r)
ViT-L/14	Sliding F-ViT	36.1	32.5
ViTamin-L	Sliding F-ViT	37.5	35.6

Open-Vocab Segmentation

Image Encoder	Segmentor	ADE	Cityscapes	MV	A-150	A-847	PC-459	PC-59	PAS-21
ViT-L/14	Sliding FC-CLIP	24.6	40.7	16.5	31.8	14.3	18.3	55.1	81.5
ViTamin-L	Sliding FC-CLIP	27.3	44.0	18.2	35.6	16.1	20.4	58.4	83.4

Note: Panoptic dataset (ADE, CityScapes, MV) are with the metric of PQ. Semantic dataset (A-150, A-847, PC-459, PC-59, PAS-21) are with the metric of mIoU.

Large Multi-modal Models

Image Encoder	Image Size	VQAv2	GQA	VizWiz	SQA	T-VQA	POPE	MME	MM-Bench	MM-B-CN	SEED	LLaVA-Wild	MM-Vet
ViTamin-L	336	78.4	61.6	51.1	66.9	58.7	84.6	1421	65.4	58.4	57.7	64.5	33.6
ViTamin-L	384	78.9	61.6	55.4	67.6	59.8	85.5	1447	64.5	58.3	57.9	66.1	33.6

📄 License

This project is licensed under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご