AltCLIP Open-Source Bilingual AI Model - Free Support for Chinese and English Image-Text Representation Tasks

Altclip

Developed by BAAI

AltCLIP is a simple and efficient bilingual CLIP model supporting Chinese-English text-image representation tasks.

Supports Multiple LanguagesOpen Source License:Openrail #Bilingual Text-Image Representation #Zero-shot Image Classification #Multimodal Contrastive Learning

Downloads 12.78k

Release Time : 11/15/2022

Model Overview

AltCLIP is a bilingual text-image representation model based on the CLIP architecture. Through two-stage training involving parallel knowledge distillation and bilingual contrastive learning, it supports image-text retrieval and matching tasks in Chinese and English scenarios.

Model Features

Bilingual Support

Extends CLIP's language encoder to support Chinese and English, enabling bilingual text-image representation tasks.

Two-Stage Training

Employs parallel knowledge distillation and bilingual contrastive learning in two-stage training to enhance model performance.

Open Access

The model follows the CreativeML OpenRAIL-M license, permitting commercial use and service-based redistribution.

Model Capabilities

Zero-shot Image Classification

Image-Text Retrieval

Bilingual Text-Image Matching

Use Cases

Multimedia Content Analysis

Cross-lingual Image Search

Use Chinese or English text queries to match relevant image content.

Supports high-precision bilingual text-image matching.

Content Generation

AltDiffusion Support

Provides bilingual text-image representation support for the AltDiffusion model.

Generates high-quality images matching the text descriptions.

🚀 AltCLIP

A simple and efficient method to train a better bilingual CLIP model

🚀 Quick Start

AltCLIP is a bilingual CLIP model that can provide support for the AltDiffusion model. You can download the code from FlagAI AltCLIP and follow the instructions in the README to get started.

✨ Features

Bilingual Support: AltCLIP supports both Chinese and English, making it suitable for a wide range of applications.
High Performance: The model shows excellent performance in text-image retrieval tasks, outperforming many existing models.
Open Source: The model code is open source on FlagAI, and the weights are available on modelhub.

📋 Model Information

Property	Details
Name	AltCLIP
Task	text-image representation
Language(s)	Chinese & English
Model	CLIP
Github	FlagAI

📚 Documentation

Brief Introduction

We propose a simple and efficient method to train a better bilingual CLIP model, named AltCLIP. AltCLIP is trained based on Stable Diffusiosn, and the training data comes from WuDao dataset and LIAON.

The AltCLIP model can support the AltDiffusion model in this project. For more information about the AltDiffusion model, please refer to this tutorial.

Training

There are two phases of training. In the parallel knowledge distillation phase, we only use parallel corpus texts for distillation (parallel corpus is easier to obtain and larger in number compared to image text pairs). In the bilingual comparison learning phase, we use a small number of Chinese-English image-text pairs (about 2 million in total) to train our text encoder to better fit the image encoder.

Performance

The following table shows the performance of AltCLIP and other models in text-to-image and image-to-text retrieval tasks:

Language	Method	Text-to-Image Retrival (R@1)	Text-to-Image Retrival (R@5)	Text-to-Image Retrival (R@10)	Image-to-Text Retrival (R@1)	Image-to-Text Retrival (R@5)	Image-to-Text Retrival (R@10)	MR
English	CLIP	65.0	87.1	92.2	85.1	97.3	99.2	87.6
English	Taiyi	25.3	48.2	59.2	39.3	68.1	79.6	53.3
English	Wukong	-	-	-	-	-	-	-
English	R2D2	-	-	-	-	-	-	-
English	CN-CLIP	49.5	76.9	83.8	66.5	91.2	96.0	77.3
English	AltCLIP	66.3	87.8	92.7	85.9	97.7	99.1	88.3
English	AltCLIP∗	72.5	91.6	95.4	86.0	98.0	99.1	90.4
Chinese	CLIP	0.0	2.4	4.0	2.3	8.1	12.6	5.0
Chinese	Taiyi	53.7	79.8	86.6	63.8	90.5	95.9	78.4
Chinese	Wukong	51.7	78.9	86.3	76.1	94.8	97.5	80.9
Chinese	R2D2	60.9	86.8	92.7	77.6	96.7	98.9	85.6
Chinese	CN-CLIP	68.0	89.7	94.4	80.2	96.6	98.2	87.9
Chinese	AltCLIP	63.7	86.3	92.1	84.7	97.4	98.7	87.2
Chinese	AltCLIP∗	69.8	89.9	94.7	84.8	97.4	98.8	89.2

Visualization effects

Based on AltCLIP, we have also developed the AltDiffusion model, visualized as follows:

💻 Usage Examples

Basic Usage

from PIL import Image
import requests

# transformers version >= 4.21.0
from modeling_altclip import AltCLIP
from processing_altclip import AltCLIPProcessor

# now our repo's in private, so we need `use_auth_token=True`
model = AltCLIP.from_pretrained("BAAI/AltCLIP")
processor = AltCLIPProcessor.from_pretrained("BAAI/AltCLIP")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

🔧 Technical Details

Training Phases

The training of AltCLIP consists of two phases:

Parallel Knowledge Distillation Phase: We use parallel corpus texts for distillation. Parallel corpus is easier to obtain and larger in quantity compared to image-text pairs.
Bilingual Contrastive Learning Phase: We use a small number (about 2 million) of Chinese-English image-text pairs to train the text encoder to better match the image encoder.

📄 License

This model is licensed under the CreativeML OpenRAIL-M license. By accessing the repository, you agree to the terms of the license and that your contact information (email address and username) may be shared with the model authors.

📚 Citation

If you find this work helpful, please consider citing the following paper:

@article{https://doi.org/10.48550/arxiv.2211.06679,
  doi = {10.48550/ARXIV.2211.06679},
  url = {https://arxiv.org/abs/2211.06679},
  author = {Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences},
  title = {AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご