V

Vit So400m Patch16 Siglip Gap 512.v2 Webli

Developed by timm
A ViT image encoder based on SigLIP 2, utilizing global average pooling, suitable for vision-language tasks.
Downloads 21
Release Time : 2/21/2025

Model Overview

This model is a SigLIP 2 ViT image encoder specifically designed for timm, with the attention pooling head removed and replaced by global average pooling. It is primarily used for image feature extraction and vision-language tasks.

Model Features

SigLIP 2 Architecture
Utilizes the SigLIP 2 architecture, featuring enhanced semantic understanding and localization capabilities.
Global Average Pooling
The attention pooling head is removed and replaced by global average pooling.
Large-scale Pretraining
Pretrained on the webli dataset, offering robust image feature extraction capabilities.

Model Capabilities

Image Feature Extraction
Vision-Language Task Processing

Use Cases

Computer Vision
Image Classification
Can be used for image classification tasks by extracting image features for categorization.
Vision-Language Tasks
Suitable for vision-language tasks such as image caption generation.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase