Open-Source MetaCLIP-b16-400m Vision-Language Model - Building a Shared Embedding Space for Images and Text

Home

Metaclip B16 400m

Developed by facebook

MetaCLIP is a vision-language model trained on CommonCrawl data for constructing shared image-text embedding spaces

Text-to-Image

Transformers

#Zero-shot image classification #Cross-modal retrieval #CLIP data optimization

Downloads 51

Release Time : 10/9/2023

Model Overview

This model applies the MetaCLIP framework to 400 million data points from CommonCrawl to reveal CLIP training data selection methods, supporting cross-modal understanding between images and text

Model Features

Public data training

Trained on the open CommonCrawl dataset with high data transparency

Cross-modal understanding

Can process both visual and textual information to establish shared embedding spaces

Zero-shot learning

Capable of performing new tasks without task-specific training

Model Capabilities

Zero-shot image classification

Text-based image retrieval

Image-based text retrieval

Cross-modal feature extraction

Use Cases

Content retrieval

Image search engine

Retrieve relevant images using natural language descriptions

Intelligent labeling

Automatic image tagging

Generate descriptive labels for unlabeled images

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Metaclip B16 400m

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 MetaCLIP Model (Base-sized Version, Patch Resolution 16)

🚀 Quick Start

✨ Features

Model description

Intended uses & limitations

How to use

BibTeX entry and citation info

📄 License