Vit So400m Patch14 Siglip Gap 384.webli
Vision Transformer model based on SigLIP, utilizing global average pooling for image features
Downloads 96
Release Time : 12/24/2024
Model Overview
This model is a Vision Transformer architecture-based image encoder, trained using the SigLIP method, primarily designed for image feature extraction tasks. It accepts 384x384 resolution input with 14x14 patch size and outputs features via Global Average Pooling (GAP).
Model Features
SigLIP Training Method
Trained using SigLIP (Sigmoid Loss for Language-Image Pre-training) method to optimize image-text alignment capability
Global Average Pooling
Uses a Global Average Pooling (GAP) layer at the model's end to extract image features, simplifying feature representation
High-resolution Processing
Supports 384x384 pixel input resolution, suitable for processing high-quality images
Model Capabilities
Image Feature Extraction
Visual Representation Learning
Use Cases
Computer Vision
Image Retrieval
Extracts image features for similar image search
Visual Content Analysis
Analyzes image content and generates compact feature representations
Featured Recommended AI Models
Š 2025AIbase