vit_so400m_patch16_siglip_384.v2_webli Open Source Model - Efficiently Extract Image Features!

Vit So400m Patch16 Siglip 384.v2 Webli

Developed by timm

Vision Transformer model based on SigLIP 2, designed for image feature extraction, pre-trained on the webli dataset

Downloads 2,073

Release Time : 2/21/2025

Model Overview

This model is the visual encoder part of SigLIP 2, using ViT architecture, suitable for image understanding and feature extraction tasks

SigLIP 2 Architecture

Utilizes an improved SigLIP 2 architecture, enhancing semantic understanding and localization capabilities

Dense Feature Extraction

Capable of extracting dense feature representations from images

Large-scale Pre-training

Pre-trained on the large-scale webli dataset

Image Feature Extraction

Visual Semantic Understanding

Image Localization

Computer Vision

Image Retrieval

Uses extracted image features for similar image retrieval

Vision-Language Tasks

Serves as a visual encoder for multimodal tasks

Property	Details
Dataset	webli
Papers	- SigLIP 2: Multilingual Vision - Language Encoders with Improved Semantic Understanding, Localization, and Dense Features: https://arxiv.org/abs/2502.14786 - Sigmoid Loss for Language Image Pre - Training: https://arxiv.org/abs/2303.15343

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base