Vit Base Patch16 Siglip 384.webli
Vision Transformer model based on SigLIP, containing only the image encoder part, using original attention pooling mechanism
Downloads 64
Release Time : 12/24/2024
Model Overview
This is a Vision Transformer model based on the SigLIP architecture, specifically designed for image feature extraction. The model adopts a 384x384 input resolution with 16x16 patch size, suitable for various computer vision tasks.
Model Features
SigLIP Architecture
Vision Transformer using SigLIP architecture, focusing on image encoding tasks
Original Attention Pooling
Utilizes original attention pooling mechanism to preserve more image feature information
High-Resolution Processing
Supports high-resolution 384x384 input, suitable for detailed image analysis
Model Capabilities
Image Feature Extraction
Visual Representation Learning
Image Classification
Image Retrieval
Use Cases
Computer Vision
Image Classification
Can be used for basic feature extraction in image classification tasks
Image Retrieval
Extracted image features can be used for similar image retrieval
Visual Representation Learning
Used as a pre-trained model for downstream vision tasks
Featured Recommended AI Models