AIbase
Home
AI Tools
AI Models
MCP
AI NEWS
EN
Model Selection
Tags
Multimodal alignment

# Multimodal alignment

Cultureclip
Vision-language model fine-tuned based on CLIP-ViT-B/32, suitable for image-text matching tasks
Text-to-Image Transformers
C
lukahh
20
0
Resnet50x64 Clip Gap.openai
Apache-2.0
CLIP model image encoder based on ResNet50 architecture with 64x width expansion, using Global Average Pooling (GAP) strategy
Image Classification Transformers
R
timm
107
0
Resnet50x16 Clip Gap.openai
Apache-2.0
A ResNet50x16 variant model based on the CLIP framework, focused on image feature extraction
Image Classification Transformers
R
timm
129
0
Owlvit Tiny Non Contiguous Weight
MIT
OWL-ViT is a vision Transformer-based open-vocabulary object detection model capable of detecting categories not present in the training dataset.
Text-to-Image Transformers
O
fxmarty
337
0
Languagebind Audio
MIT
LanguageBind is a language-centric multimodal pre-training method that extends video-language pre-training to N modalities through language semantic alignment, achieving high-performance multimodal understanding and alignment.
Multimodal Alignment Transformers
L
LanguageBind
271
3
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
English简体中文繁體中文にほんご
© 2025AIbase