M BERT Base ViT B
A multilingual CLIP text encoder fine-tuned from BERT-base-multilingual, supporting alignment with CLIP visual encoder across 69 languages
Downloads 3,376
Release Time : 3/2/2022
Model Overview
This model fine-tunes BERT-base-multilingual to align the text embedding space of 69 languages with the CLIP text encoder paired with ViT-B/32 visual encoder, enabling multilingual vision-language understanding.
Model Features
Multilingual Support
Supports text embedding alignment with CLIP visual space for 69 languages
Cross-modal Alignment
Maps multilingual BERT embeddings to CLIP visual encoder's shared space via linear projection
Translation Data Augmentation
Uses translated GCC+MSCOCO+VizWiz composite data to generate multilingual training sets
Model Capabilities
Multilingual Text Embedding
Cross-modal Retrieval
Image-Text Matching
Multilingual Visual Semantic Understanding
Use Cases
Cross-modal Retrieval
Multilingual Image Search
Retrieve relevant images using queries in different languages
Multilingual Content Understanding
Multilingual Image Captioning
Generate descriptive texts for images in multiple languages
Featured Recommended AI Models