Vit Large Patch16 Siglip Gap 384.v2 Webli
A vision Transformer model based on the SigLIP 2 architecture, featuring a Global Average Pooling (GAP) variant that removes the attention pooling head, suitable for image feature extraction tasks.
Downloads 95
Release Time : 2/21/2025
Model Overview
This is a SigLIP 2 ViT image encoder specifically designed for timm, utilizing Global Average Pooling for feature processing, ideal for image feature extraction in computer vision tasks.
Model Features
SigLIP 2 Architecture
Utilizes an improved SigLIP 2 architecture with enhanced semantic understanding and localization capabilities.
Global Average Pooling
Removes the attention pooling head and employs Global Average Pooling (GAP) for feature processing.
High-Resolution Processing
Supports 384Ã384 resolution input.
Model Capabilities
Image Feature Extraction
Visual Semantic Understanding
Image Localization
Use Cases
Computer Vision
Image Retrieval
Extracts image features for similar image retrieval.
Vision-Language Tasks
Serves as a visual encoder for multimodal tasks.
Featured Recommended AI Models
Š 2025AIbase