V

Vit So400m Patch14 Siglip Gap 224.v2 Webli

Developed by timm
A ViT image encoder based on SigLIP 2, employing global average pooling with the attention pooling head removed, suitable for image feature extraction tasks.
Downloads 179
Release Time : 2/21/2025

Model Overview

This is a SigLIP 2 ViT image encoder specifically designed for timm, equivalent to the image tower portion of the ViT-SO400M-14-SigLIP2 model on HuggingFace. The gap variant replaces the attention pooling head with global average pooling.

Model Features

SigLIP 2 Architecture
Utilizes an improved SigLIP 2 architecture with enhanced semantic understanding, localization, and dense feature extraction capabilities.
Global Average Pooling
Uses global average pooling (gap) instead of the attention pooling head, simplifying the model structure.
Large-scale Pretraining
Pretrained on the webli dataset, offering robust visual representation capabilities.

Model Capabilities

Image Feature Extraction
Visual Semantic Understanding
Image Localization
Dense Feature Extraction

Use Cases

Computer Vision
Image Classification
Can serve as a feature extractor for image classification tasks.
Visual Question Answering
Provides image feature representations for visual question answering systems.
Multimodal Applications
Image-Text Matching
Used for image encoding in image-text matching tasks.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase