Model Selection

Multimodal alignment

# Multimodal alignment

Vision-language model fine-tuned based on CLIP-ViT-B/32, suitable for image-text matching tasks

Resnet50x64 Clip Gap.openai

CLIP model image encoder based on ResNet50 architecture with 64x width expansion, using Global Average Pooling (GAP) strategy

Image Classification

Resnet50x16 Clip Gap.openai

A ResNet50x16 variant model based on the CLIP framework, focused on image feature extraction

Image Classification

Owlvit Tiny Non Contiguous Weight

OWL-ViT is a vision Transformer-based open-vocabulary object detection model capable of detecting categories not present in the training dataset.

Languagebind Audio

LanguageBind is a language-centric multimodal pre-training method that extends video-language pre-training to N modalities through language semantic alignment, achieving high-performance multimodal understanding and alignment.

Multimodal Alignment

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase