vit_so400m_patch16_siglip_gap_256.v2_webli Open Source Model

Vit So400m Patch16 Siglip Gap 256.v2 Webli

Developed by timm

ViT image encoder based on SigLIP 2, using global average pooling, with attention pooling head removed, suitable for image feature extraction tasks.

Text-to-Image

Transformers

Open Source License:Apache-2.0 #Multimodal Visual Encoding #Global Average Pooling #Enhanced Semantic Understanding

Downloads 22

Release Time : 2/21/2025

Model Overview

This model is a SigLIP 2 ViT (image encoder only) specifically designed for timm, using global average pooling (GAP) instead of an attention pooling head, primarily for image feature extraction tasks.

Model Features

SigLIP 2 Architecture

Utilizes an improved SigLIP 2 architecture with better semantic understanding, localization, and dense feature extraction capabilities.

Global Average Pooling

Uses global average pooling (GAP) instead of an attention pooling head to simplify the model structure.

Multilingual Support

Trained on the webli dataset, capable of multilingual processing.

Model Capabilities

Image Feature Extraction

Semantic Understanding

Visual Localization

Use Cases

Computer Vision

Image Retrieval

Efficient image retrieval using extracted image features.

Visual Question Answering

Used as the image encoder part of vision-language models.

🚀 Model card for vit_so400m_patch16_siglip_gap_256.v2_webli

This is a SigLIP 2 ViT (image encoder only) designed for timm. It is equivalent to the image tower from https://huggingface.co/timm/ViT-SO400M-16-SigLIP2-256. The gap variant of this model uses global average pooling and has the attention pooling head removed.

🚀 Quick Start

This section provides a brief introduction to the model. For more detailed usage, please refer to the official timm documentation.

✨ Features

Based on SigLIP 2: Utilizes the advanced SigLIP 2 architecture for better performance in image feature extraction.
Designed for timm: Specifically tailored for the timm library, ensuring easy integration and use.
Global Average Pooling: The gap variant uses global average pooling, simplifying the model structure.

📚 Documentation

Model Details

Property	Details
Model Type	SigLIP 2 ViT (image encoder only)
Training Data	webli
Related Papers	- SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features: https://arxiv.org/abs/2502.14786 - Sigmoid Loss for Language Image Pre-Training: https://arxiv.org/abs/2303.15343

Citation

@article{tschannen2025siglip,
          title={SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features},
          author={Tschannen, Michael and Gritsenko, Alexey and Wang, Xiao and Naeem, Muhammad Ferjad and Alabdulmohsin, Ibrahim and Parthasarathy, Nikhil and Evans, Talfan and Beyer, Lucas and Xia, Ye and Mustafa, Basil and H'enaff, Olivier and Harmsen, Jeremiah and Steiner, Andreas and Zhai, Xiaohua},
          year={2025},
          journal={arXiv preprint arXiv:2502.14786}
        }

@inproceedings{zhai2023sigmoid,
          title={Sigmoid loss for language image pre-training},
          author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
          booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
          pages={11975--11986},
          year={2023}
        }

📄 License

This model is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご