vit_base_patch16_siglip_gap_256.v2_webli Open-source Model

Vit Base Patch16 Siglip Gap 256.v2 Webli

Developed by timm

A ViT image encoder based on SigLIP 2, employing global average pooling with the attention pooling head removed, suitable for image feature extraction.

Multimodal Fusion

Transformers

Open Source License:Apache-2.0 #Multimodal Visual Encoding #Global Average Pooling #Semantic Understanding Enhancement

Downloads 114

Release Time : 2/21/2025

Model Overview

This model is a SigLIP 2 ViT image encoder specifically designed for timm, primarily used for image feature extraction tasks. It is trained on the Webli dataset and adopts a global average pooling strategy, removing the attention pooling head.

Model Features

SigLIP 2 Architecture

Utilizes an improved SigLIP 2 architecture with enhanced semantic understanding and localization capabilities.

Global Average Pooling

Replaces the attention pooling head with global average pooling (GAP), simplifying the model structure.

Webli Dataset Training

Pre-trained on the large-scale Webli dataset.

Model Capabilities

Image Feature Extraction

Visual Semantic Understanding

Image Localization

Use Cases

Computer Vision

Image Retrieval

Uses extracted image features for similar image retrieval.

Visual Question Answering

Serves as the image encoder component in vision-language models.

Property	Details
Dataset	webli
Papers	- SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features: https://arxiv.org/abs/2502.14786 - Sigmoid Loss for Language Image Pre-Training: https://arxiv.org/abs/2303.15343

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Vit Base Patch16 Siglip Gap 256.v2 Webli

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 Model card for vit_base_patch16_siglip_gap_256.v2_webli

🚀 Quick Start

📚 Documentation

Model Details

Citation

📄 License