MetaCLIP-b32-400M Open-Source Vision-Language Model - Building a Shared Embedding Space for Images and Text

Metaclip B32 400m

Developed by facebook

The MetaCLIP base model is a vision-language model trained on CommonCrawl data for constructing shared image-text embedding spaces.

Text-to-Image

Transformers

#Zero-shot Image Classification #Cross-modal Retrieval #Trained on 400M Data

Downloads 135.37k

Release Time : 10/7/2023

Model Overview

This model applies MetaCLIP technology to 400 million data points, supporting tasks like zero-shot image classification and text-based image retrieval.

Model Features

Large-scale Data Training

Trained on 400 million data points from CommonCrawl, with strong generalization capabilities

Zero-shot Learning Capability

Capable of performing various vision tasks without task-specific fine-tuning

Shared Embedding Space

Constructs a unified representation space for images and text, supporting cross-modal retrieval

Model Capabilities

Zero-shot Image Classification

Text-based Image Retrieval

Image-based Text Retrieval

Cross-modal Representation Learning

Use Cases

Content Retrieval

Image Search Engine

Retrieve relevant images using natural language descriptions

Content Classification

Zero-shot Image Classification

Classify images of new categories without training

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Metaclip B32 400m

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 MetaCLIP Model (Base-sized, Patch Resolution 32)

🚀 Quick Start

✨ Features

📚 Documentation

Model description

Intended uses & limitations

How to use

BibTeX entry and citation info

📄 License