vit_base_patch16_siglip_gap_512.v2_webli Open-source Model - Highly Efficient for Image Feature Extraction Tasks

Vit Base Patch16 Siglip Gap 512.v2 Webli

Developed by timm

A ViT image encoder based on SigLIP 2, using global average pooling with the attention pooling head removed, suitable for image feature extraction tasks.

Image Classification

Transformers

Open Source License:Apache-2.0 #Multimodal Visual Encoding #Global Average Pooling #Dense Feature Extraction

Downloads 105

Release Time : 2/21/2025

Model Overview

This model is a SigLIP 2 ViT image encoder specifically designed for timm, primarily used for image feature extraction. It is trained on the Webli dataset and employs global average pooling (GAP) instead of the attention pooling head.

Model Features

SigLIP 2 Architecture

Utilizes the improved SigLIP 2 architecture with enhanced semantic understanding and localization capabilities.

Global Average Pooling

Replaces the attention pooling head with global average pooling (GAP), simplifying the model structure.

Dense Feature Extraction

Capable of extracting high-quality dense image features.

Model Capabilities

Image Feature Extraction

Visual Semantic Understanding

Image Localization

Use Cases

Computer Vision

Image Retrieval

Uses extracted image features for similar image retrieval.

Visual Question Answering

Serves as the image encoder component for vision-language models.

🚀 Model card for vit_base_patch16_siglip_gap_512.v2_webli

A SigLIP 2 ViT (image encoder only) for timm, equivalent to the image tower from https://huggingface.co/timm/ViT-B-16-SigLIP2-512. The gap variant uses global average pooling and removes the attention pooling head.

🚀 Quick Start

This SigLIP 2 ViT model is designed for timm, offering a solution for image feature extraction. It is equivalent to the image tower from the specified Hugging Face model, with a gap variant that simplifies the architecture by using global average pooling.

✨ Features

Image Encoder: Specifically designed as an image encoder, suitable for image feature extraction tasks.
SigLIP 2 Technology: Based on the SigLIP 2 research, which improves semantic understanding, localization, and dense features.
gap Variant: Uses global average pooling, removing the attention pooling head for a more straightforward architecture.

📚 Documentation

📋 Model Details

Property	Details
Dataset	webli
Papers	- SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features: https://arxiv.org/abs/2502.14786 - Sigmoid Loss for Language Image Pre-Training: https://arxiv.org/abs/2303.15343

📄 Citation

@article{tschannen2025siglip,
          title={SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features},
          author={Tschannen, Michael and Gritsenko, Alexey and Wang, Xiao and Naeem, Muhammad Ferjad and Alabdulmohsin, Ibrahim and Parthasarathy, Nikhil and Evans, Talfan and Beyer, Lucas and Xia, Ye and Mustafa, Basil and H'enaff, Olivier and Harmsen, Jeremiah and Steiner, Andreas and Zhai, Xiaohua},
          year={2025},
          journal={arXiv preprint arXiv:2502.14786}
        }

@inproceedings{zhai2023sigmoid,
          title={Sigmoid loss for language image pre-training},
          author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
          booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
          pages={11975--11986},
          year={2023}
        }

📄 License

This model is licensed under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご