V

Vit Base Patch16 Siglip 384.webli

Developed by timm
Vision Transformer model based on SigLIP, containing only the image encoder part, using original attention pooling mechanism
Downloads 64
Release Time : 12/24/2024

Model Overview

This is a Vision Transformer model based on the SigLIP architecture, specifically designed for image feature extraction. The model adopts a 384x384 input resolution with 16x16 patch size, suitable for various computer vision tasks.

Model Features

SigLIP Architecture
Vision Transformer using SigLIP architecture, focusing on image encoding tasks
Original Attention Pooling
Utilizes original attention pooling mechanism to preserve more image feature information
High-Resolution Processing
Supports high-resolution 384x384 input, suitable for detailed image analysis

Model Capabilities

Image Feature Extraction
Visual Representation Learning
Image Classification
Image Retrieval

Use Cases

Computer Vision
Image Classification
Can be used for basic feature extraction in image classification tasks
Image Retrieval
Extracted image features can be used for similar image retrieval
Visual Representation Learning
Used as a pre-trained model for downstream vision tasks
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase