Vit So400m Patch14 Siglip Gap 378.v2 Webli
Vision Transformer model based on SigLIP 2 architecture, pre-trained on WebLI dataset, with attention pooling head removed and global average pooling applied
Downloads 20
Release Time : 2/21/2025
Model Overview
This model is the visual encoder component of SigLIP 2, specifically designed for image feature extraction, suitable for visual understanding in multimodal tasks
Model Features
SigLIP 2 Architecture Improvements
Utilizes an enhanced vision-language pre-training architecture for improved semantic understanding and localization capabilities
Global Average Pooling
Removes attention pooling head and simplifies feature extraction with Global Average Pooling (GAP)
Large-Scale Pre-training
Pre-trained on the large-scale WebLI dataset, providing robust visual representation capabilities
Model Capabilities
Image Feature Extraction
Visual Semantic Understanding
Multimodal Task Visual Encoding
Use Cases
Computer Vision
Image Retrieval
Extracts image features for similar image search
Vision-Language Tasks
Serves as the visual encoder for multimodal models
Featured Recommended AI Models
Š 2025AIbase