vit_so400m_patch16_siglip_gap_512.v2_webli Open-Source Image Encoder

Home

Vit So400m Patch16 Siglip Gap 512.v2 Webli

Developed by timm

A ViT image encoder based on SigLIP 2, utilizing global average pooling, suitable for vision-language tasks.

Text-to-Image

Transformers

Open Source License:Apache-2.0 #Multimodal Visual Encoding #Global Average Pooling #High Semantic Understanding

Downloads 21

Release Time : 2/21/2025

Model Overview

This model is a SigLIP 2 ViT image encoder specifically designed for timm, with the attention pooling head removed and replaced by global average pooling. It is primarily used for image feature extraction and vision-language tasks.

Model Features

SigLIP 2 Architecture

Utilizes the SigLIP 2 architecture, featuring enhanced semantic understanding and localization capabilities.

Global Average Pooling

The attention pooling head is removed and replaced by global average pooling.

Large-scale Pretraining

Pretrained on the webli dataset, offering robust image feature extraction capabilities.

Model Capabilities

Image Feature Extraction

Vision-Language Task Processing

Use Cases

Computer Vision

Image Classification

Can be used for image classification tasks by extracting image features for categorization.

Vision-Language Tasks

Suitable for vision-language tasks such as image caption generation.

Property	Details
Dataset	webli
Papers	- SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features: https://arxiv.org/abs/2502.14786 - Sigmoid Loss for Language Image Pre-Training: https://arxiv.org/abs/2303.15343

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Vit So400m Patch16 Siglip Gap 512.v2 Webli

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 Model card for vit_so400m_patch16_siglip_gap_512.v2_webli

🚀 Quick Start

📚 Documentation

Model Details

Citation

📄 License