# Multimodal alignment
Cultureclip
Vision-language model fine-tuned based on CLIP-ViT-B/32, suitable for image-text matching tasks
Text-to-Image
Transformers

C
lukahh
20
0
Resnet50x64 Clip Gap.openai
Apache-2.0
CLIP model image encoder based on ResNet50 architecture with 64x width expansion, using Global Average Pooling (GAP) strategy
Image Classification
Transformers

R
timm
107
0
Resnet50x16 Clip Gap.openai
Apache-2.0
A ResNet50x16 variant model based on the CLIP framework, focused on image feature extraction
Image Classification
Transformers

R
timm
129
0
Owlvit Tiny Non Contiguous Weight
MIT
OWL-ViT is a vision Transformer-based open-vocabulary object detection model capable of detecting categories not present in the training dataset.
Text-to-Image
Transformers

O
fxmarty
337
0
Languagebind Audio
MIT
LanguageBind is a language-centric multimodal pre-training method that extends video-language pre-training to N modalities through language semantic alignment, achieving high-performance multimodal understanding and alignment.
Multimodal Alignment
Transformers

L
LanguageBind
271
3
Featured Recommended AI Models