Vit Base Patch16 Siglip Gap 384.v2 Webli
ViT image encoder based on SigLIP 2, using Global Average Pooling (GAP) instead of attention pooling head, suitable for image feature extraction tasks.
Downloads 105
Release Time : 2/21/2025
Model Overview
This model is a Vision Transformer (ViT) implementation of SigLIP 2, specifically designed for extracting image features. The attention pooling head is removed and replaced with global average pooling, making it suitable for visual tasks requiring dense features.
Model Features
Global Average Pooling
Uses GAP instead of attention pooling head, simplifying model structure while preserving important features
SigLIP 2 Improvements
Based on SigLIP 2 architecture with enhanced semantic understanding, localization, and dense feature capabilities
High-Resolution Support
Supports 384Ã384 resolution input, suitable for tasks requiring fine-grained features
Model Capabilities
Image Feature Extraction
Visual Semantic Understanding
Dense Feature Generation
Use Cases
Computer Vision
Image Retrieval
Extracts image features for similar image search
Visual Localization
Identifies specific objects or regions in images
Multimodal Applications
Vision-Language Tasks
Serves as a visual encoder for tasks like image-text matching
Featured Recommended AI Models
Š 2025AIbase