Vit Base Patch16 Siglip 512.v2 Webli
Vision Transformer model based on SigLIP 2, designed for image feature extraction, pre-trained on the webli dataset
Downloads 2,664
Release Time : 2/21/2025
Model Overview
This is a Vision Transformer (ViT) model based on the SigLIP 2 architecture, containing only the image encoder component. The model uses a patch size of 16, an input resolution of 512x512, and is pre-trained using a sigmoid loss function.
Model Features
SigLIP 2 Architecture
Utilizes the improved SigLIP 2 architecture with enhanced semantic understanding and localization capabilities
High-Resolution Processing
Supports high-resolution image input at 512x512
Dense Feature Extraction
Capable of extracting dense feature representations from images
Sigmoid Loss Function
Pre-trained using a sigmoid loss function to optimize vision-language alignment
Model Capabilities
Image Feature Extraction
Visual Semantic Understanding
Image Localization Analysis
Use Cases
Computer Vision
Image Retrieval
Extracts image features for similar image retrieval
Provides high-quality image embeddings
Vision-Language Tasks
Serves as a visual encoder for multimodal tasks
Enhanced visual semantic understanding capabilities
Featured Recommended AI Models
Š 2025AIbase