Vit So400m Patch14 Siglip 378.v2 Webli
Vision Transformer model based on SigLIP 2, designed for image feature extraction, trained on the webli dataset
Downloads 30
Release Time : 2/21/2025
Model Overview
This is a SigLIP 2 architecture-based Vision Transformer model, containing only the image encoder part, suitable for image feature extraction tasks. The model is implemented based on the timm library, functionally equivalent to the image tower module of the ViT-SO400M-14-SigLIP2-378 model on HuggingFace.
Model Features
SigLIP 2 Architecture
Utilizes the improved SigLIP 2 architecture with enhanced semantic understanding and localization capabilities
Dense Feature Extraction
Capable of extracting dense feature representations from images
Large-scale Pretraining
Pretrained on the large-scale webli dataset
Model Capabilities
Image Feature Extraction
Visual Semantic Understanding
Image Localization
Use Cases
Computer Vision
Image Retrieval
Utilizes extracted image features for similar image retrieval
Visual Localization
Identifies and locates specific objects or regions in images
Multimodal Applications
Vision-Language Tasks
Serves as a visual encoder for tasks like image-text matching
Featured Recommended AI Models
Š 2025AIbase