Model Selection

Multimodal Image Encoding

# Multimodal Image Encoding

Vit So400m Patch14 Siglip Gap 384.webli

Vision Transformer model based on SigLIP, utilizing global average pooling for image features

Image Classification

Vit Giant Patch14 Clip 224.laion2b

Vision Transformer model based on CLIP architecture, designed for image feature extraction, trained on the laion2B dataset

Image Classification

Vit Base Patch16 Clip 224.laion2b

Vision Transformer model based on CLIP architecture, containing only the image encoder part, suitable for image feature extraction tasks

Image Classification

Convnext Large Mlp.clip Laion2b Ft Soup 320

ConvNeXt-Large image encoder based on CLIP architecture, fine-tuned on the LAION-2B dataset, supporting 320x320 resolution image feature extraction

Image Classification

Convnext Large Mlp.clip Laion2b Augreg

ConvNeXt-Large image encoder based on the CLIP framework, trained on the LAION-2B dataset, supports visual feature extraction

Image Classification

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase