Vit Large Patch16 Siglip Gap 256.v2 Webli
A ViT image encoder based on SigLIP 2, employing global average pooling with the attention pooling head removed, specifically designed for image feature extraction.
Downloads 95
Release Time : 2/21/2025
Model Overview
This model is a Vision Transformer (ViT) architecture-based image encoder, pretrained using the SigLIP 2 method, suitable for image feature extraction tasks.
Model Features
SigLIP 2 Pretraining
Pretrained using the improved SigLIP 2 method, offering better semantic understanding and localization capabilities
Global Average Pooling
Employs global average pooling instead of an attention pooling head, simplifying the model structure
Dense Feature Extraction
Capable of extracting high-quality dense image features
Model Capabilities
Image Feature Extraction
Visual Semantic Understanding
Image Localization
Use Cases
Computer Vision
Image Retrieval
Utilizes extracted image features for similar image search
Visual Question Answering
Serves as the image encoder component in vision-language models
Featured Recommended AI Models
Š 2025AIbase