V

Vitamin XL 384px

Developed by jienengchen
ViTamin-XL-384px is a large-scale vision-language model based on the ViTamin architecture, specifically designed for vision-language tasks, supporting high-resolution image processing and multimodal feature extraction.
Downloads 104
Release Time : 4/2/2024

Model Overview

ViTamin-XL-384px is a vision-language model primarily used for image feature extraction and text-image matching tasks. It is based on the ViTamin architecture, supports high-resolution image input (384px), and excels in multiple vision tasks.

Model Features

High-Resolution Support
Supports image inputs up to 384px, enabling the processing of finer image details.
Multimodal Feature Extraction
Capable of simultaneously extracting image and text features, supporting cross-modal matching tasks.
Efficient Training
Pretrained on large-scale datasets like DataComp-1B, demonstrating excellent generalization capabilities.
Downstream Task Adaptation
Performs exceptionally well in tasks such as open-vocabulary detection, segmentation, and multimodal understanding.

Model Capabilities

Image feature extraction
Text-image matching
Open-vocabulary detection
Open-vocabulary segmentation
Multimodal understanding

Use Cases

Computer Vision
Open-Vocabulary Object Detection
Object detection on unseen categories
OV-COCO (AP50 novel) 37.5, OV-LVIS (APr) 35.6
Open-Vocabulary Image Segmentation
Semantic segmentation of images, supporting recognition of new categories
ADE 27.3 PQ, CityScapes 44.0 PQ
Multimodal Applications
Visual Question Answering
Answering natural language questions about image content
VQAv2 78.9, GQA 61.6
Image Retrieval
Retrieving relevant images based on text queries
Average score of 61.8 in retrieval tasks
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase