Vit Large Patch16 Siglip 384.webli
A vision Transformer model based on SigLIP, containing only the image encoder, using original attention pooling, suitable for image feature extraction tasks.
Downloads 64
Release Time : 12/24/2024
Model Overview
This model is a vision Transformer based on the SigLIP architecture, specifically designed for image feature extraction. It takes inputs with a patch size of 16 and a resolution of 384x384, efficiently extracting high-level features from images.
Model Features
SigLIP Architecture
A vision Transformer architecture based on SigLIP, optimized for image feature extraction performance.
Original Attention Pooling
Utilizes original attention pooling mechanism, enhancing the model's ability to capture key image features.
High-Resolution Support
Supports high-resolution inputs of 384x384, suitable for processing images with rich details.
Model Capabilities
Image Feature Extraction
Image Classification
Visual Representation Learning
Use Cases
Computer Vision
Image Classification
Used for image classification tasks, extracting image features and classifying them.
Visual Search
Used in visual search systems to extract image features for similarity matching.
Featured Recommended AI Models