V

Vit So400m Patch14 Siglip Gap 378.v2 Webli

Developed by timm
Vision Transformer model based on SigLIP 2 architecture, pre-trained on WebLI dataset, with attention pooling head removed and global average pooling applied
Downloads 20
Release Time : 2/21/2025

Model Overview

This model is the visual encoder component of SigLIP 2, specifically designed for image feature extraction, suitable for visual understanding in multimodal tasks

Model Features

SigLIP 2 Architecture Improvements
Utilizes an enhanced vision-language pre-training architecture for improved semantic understanding and localization capabilities
Global Average Pooling
Removes attention pooling head and simplifies feature extraction with Global Average Pooling (GAP)
Large-Scale Pre-training
Pre-trained on the large-scale WebLI dataset, providing robust visual representation capabilities

Model Capabilities

Image Feature Extraction
Visual Semantic Understanding
Multimodal Task Visual Encoding

Use Cases

Computer Vision
Image Retrieval
Extracts image features for similar image search
Vision-Language Tasks
Serves as the visual encoder for multimodal models
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase