vit_base_patch16_siglip_gap_224.v2_webli Open-source Visual Model - Process Image Features More Efficiently and Practically

Home

Vit Base Patch16 Siglip Gap 224.v2 Webli

Developed by timm

Vision Transformer model based on SigLIP 2, utilizing global average pooling for image features

Image Classification

Transformers

Open Source License:Apache-2.0 #Multimodal Visual Encoding #Global Average Pooling #Dense Feature Extraction

Downloads 303

Release Time : 2/21/2025

Model Overview

This is a SigLIP 2 ViT image encoder specifically designed for timm, which removes the attention pooling head and adopts global average pooling to extract image features.

Model Features

Global Average Pooling

Uses GAP (Global Average Pooling) instead of attention pooling head to simplify the feature extraction process

SigLIP 2 Improvements

Based on the SigLIP 2 architecture with enhanced semantic understanding and localization capabilities

Dense Feature Extraction

Capable of generating high-quality dense image feature representations

Model Capabilities

Image Feature Extraction

Visual Semantic Understanding

Multimodal Task Support

Use Cases

Computer Vision

Image Retrieval

Utilizes extracted image features for similar image search

Multimodal Tasks

Serves as a visual encoder for vision-language joint tasks

Property	Details
Dataset	webli
Papers	- SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features: https://arxiv.org/abs/2502.14786 - Sigmoid Loss for Language Image Pre-Training: https://arxiv.org/abs/2303.15343

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Vit Base Patch16 Siglip Gap 224.v2 Webli

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 Model card for vit_base_patch16_siglip_gap_224.v2_webli

🚀 Quick Start

📚 Documentation

Model Details

📄 License

📚 Citation