vit_large_patch16_siglip_gap_384.v2_webli Open-source Model

Vit Large Patch16 Siglip Gap 384.v2 Webli

Developed by timm

A vision Transformer model based on the SigLIP 2 architecture, featuring a Global Average Pooling (GAP) variant that removes the attention pooling head, suitable for image feature extraction tasks.

Text-to-Image

Transformers

Open Source License:Apache-2.0 #Multimodal Visual Encoding #Global Average Pooling #High Semantic Understanding

Downloads 95

Release Time : 2/21/2025

Model Overview

This is a SigLIP 2 ViT image encoder specifically designed for timm, utilizing Global Average Pooling for feature processing, ideal for image feature extraction in computer vision tasks.

Model Features

SigLIP 2 Architecture

Utilizes an improved SigLIP 2 architecture with enhanced semantic understanding and localization capabilities.

Global Average Pooling

Removes the attention pooling head and employs Global Average Pooling (GAP) for feature processing.

High-Resolution Processing

Supports 384×384 resolution input.

Model Capabilities

Image Feature Extraction

Visual Semantic Understanding

Image Localization

Use Cases

Computer Vision

Image Retrieval

Extracts image features for similar image retrieval.

Vision-Language Tasks

Serves as a visual encoder for multimodal tasks.

Property	Details
Dataset	webli
Papers	SigLIP 2: Multilingual Vision - Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Sigmoid Loss for Language Image Pre - Training

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Vit Large Patch16 Siglip Gap 384.v2 Webli

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 Model card for vit_large_patch16_siglip_gap_384.v2_webli

🚀 Quick Start

✨ Features

📚 Documentation

Model Details

Citation

📄 License