Vit Base Patch16 Siglip Gap 512.v2 Webli
A ViT image encoder based on SigLIP 2, using global average pooling with the attention pooling head removed, suitable for image feature extraction tasks.
Downloads 105
Release Time : 2/21/2025
Model Overview
This model is a SigLIP 2 ViT image encoder specifically designed for timm, primarily used for image feature extraction. It is trained on the Webli dataset and employs global average pooling (GAP) instead of the attention pooling head.
Model Features
SigLIP 2 Architecture
Utilizes the improved SigLIP 2 architecture with enhanced semantic understanding and localization capabilities.
Global Average Pooling
Replaces the attention pooling head with global average pooling (GAP), simplifying the model structure.
Dense Feature Extraction
Capable of extracting high-quality dense image features.
Model Capabilities
Image Feature Extraction
Visual Semantic Understanding
Image Localization
Use Cases
Computer Vision
Image Retrieval
Uses extracted image features for similar image retrieval.
Visual Question Answering
Serves as the image encoder component for vision-language models.
Featured Recommended AI Models
Š 2025AIbase