vit_base_patch16_siglip_384.v2_webli Open-source Image Feature Extraction Model - Precise Support for Image Analysis and Processing

Home

Vit Base Patch16 Siglip 384.v2 Webli

Developed by timm

Vision Transformer model based on SigLIP 2, designed for image feature extraction, pre-trained on the webli dataset

Text-to-Image

Transformers

Open Source License:Apache-2.0 #Multimodal Visual Encoding #High Semantic Understanding #Dense Feature Extraction

Downloads 330

Release Time : 2/21/2025

Model Overview

This is a SigLIP 2 Vision Transformer model, containing only the image encoder part, suitable for image feature extraction tasks. The model is based on ViT architecture and pre-trained using Sigmoid loss.

Model Features

SigLIP 2 Improvements

Based on SigLIP 2 architecture with enhanced semantic understanding and localization capabilities

Dense Feature Extraction

Capable of extracting dense feature representations from images

Large-scale Pre-training

Pre-trained on the large-scale webli dataset

Model Capabilities

Image Feature Extraction

Visual Semantic Understanding

Image Localization

Use Cases

Computer Vision

Image Retrieval

Using extracted image features for similar image retrieval

Visual Localization

Identifying and locating key regions in images

Property	Details
Dataset	webli
Papers	- SigLIP 2: Multilingual Vision - Language Encoders with Improved Semantic Understanding, Localization, and Dense Features: https://arxiv.org/abs/2502.14786 - Sigmoid Loss for Language Image Pre - Training: https://arxiv.org/abs/2303.15343

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Vit Base Patch16 Siglip 384.v2 Webli

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 Model card for vit_base_patch16_siglip_384.v2_webli

✨ Features

📚 Documentation

Model Details

Citation

📄 License