V

Vit Large Patch16 Siglip Gap 384.v2 Webli

Developed by timm
A vision Transformer model based on the SigLIP 2 architecture, featuring a Global Average Pooling (GAP) variant that removes the attention pooling head, suitable for image feature extraction tasks.
Downloads 95
Release Time : 2/21/2025

Model Overview

This is a SigLIP 2 ViT image encoder specifically designed for timm, utilizing Global Average Pooling for feature processing, ideal for image feature extraction in computer vision tasks.

Model Features

SigLIP 2 Architecture
Utilizes an improved SigLIP 2 architecture with enhanced semantic understanding and localization capabilities.
Global Average Pooling
Removes the attention pooling head and employs Global Average Pooling (GAP) for feature processing.
High-Resolution Processing
Supports 384×384 resolution input.

Model Capabilities

Image Feature Extraction
Visual Semantic Understanding
Image Localization

Use Cases

Computer Vision
Image Retrieval
Extracts image features for similar image retrieval.
Vision-Language Tasks
Serves as a visual encoder for multimodal tasks.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase