V

Vit Base Patch16 Siglip Gap 384.v2 Webli

Developed by timm
ViT image encoder based on SigLIP 2, using Global Average Pooling (GAP) instead of attention pooling head, suitable for image feature extraction tasks.
Downloads 105
Release Time : 2/21/2025

Model Overview

This model is a Vision Transformer (ViT) implementation of SigLIP 2, specifically designed for extracting image features. The attention pooling head is removed and replaced with global average pooling, making it suitable for visual tasks requiring dense features.

Model Features

Global Average Pooling
Uses GAP instead of attention pooling head, simplifying model structure while preserving important features
SigLIP 2 Improvements
Based on SigLIP 2 architecture with enhanced semantic understanding, localization, and dense feature capabilities
High-Resolution Support
Supports 384×384 resolution input, suitable for tasks requiring fine-grained features

Model Capabilities

Image Feature Extraction
Visual Semantic Understanding
Dense Feature Generation

Use Cases

Computer Vision
Image Retrieval
Extracts image features for similar image search
Visual Localization
Identifies specific objects or regions in images
Multimodal Applications
Vision-Language Tasks
Serves as a visual encoder for tasks like image-text matching
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase