Taiyi-CLIP-RoBERTa-326M-ViT-H-Chinese Open-Source Model - Empowering Chinese Image-Text Matching and Retrieval

Taiyi CLIP RoBERTa 326M ViT H Chinese

Developed by IDEA-CCNL

The first open-source Chinese CLIP model, pre-trained on 123 million image-text pairs, with RoBERTa-large architecture as the text encoder.

Text-to-Image

Transformers

ChineseOpen Source License:Apache-2.0 #Chinese Image-Text Retrieval #Zero-Shot Classification #Multimodal Feature Extraction

Downloads 108

Release Time : 9/26/2022

Model Overview

This model is a vision-language representation system capable of joint feature extraction for images and texts, supporting zero-shot image classification and text-image retrieval tasks.

Model Features

Chinese Multimodal Understanding

Vision-language joint representation capability optimized specifically for Chinese scenarios

Large-Scale Pretraining

Pre-trained on 123 million Chinese image-text pairs, covering a wide range of visual concepts

Efficient Architecture Design

Freezes visual encoder parameters and only fine-tunes the language encoder to improve training efficiency

Model Capabilities

Zero-shot image classification

Text-image retrieval

Multimodal feature extraction

Cross-modal similarity calculation

Use Cases

Image Understanding

Zero-shot image classification

Classify images without specific training

Achieves 54.35% Top1 accuracy on ImageNet1k-CN

Cross-modal Retrieval

Text-to-image retrieval

Retrieve relevant images based on text descriptions

Achieves 60.82% Top1 accuracy on Flickr30k-CNA test set

Image-to-text retrieval

Retrieve relevant text descriptions based on images

Achieves 60.02% Top1 accuracy on COCO-CN test set

🚀 Taiyi-CLIP-RoBERTa-326M-ViT-H-Chinese

The first open-source Chinese CLIP model, with the text end RoBERTa-large pre-trained on 123 million image-text pairs.

🚀 Quick Start

Taiyi-CLIP-RoBERTa-326M-ViT-H-Chinese is a powerful model for feature extraction. It follows the experimental setup of CLIP to obtain strong visual-language representations.

✨ Features

Open Source: It is the first open-source Chinese CLIP model.
Pre-training: Pre-trained on 123 million image-text pairs.
Model Architecture: Uses RoBERTa-large as the text encoder and ViT-H-14 as the vision encoder.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from PIL import Image
import requests
import open_clip
import torch
from transformers import BertModel, BertConfig, BertTokenizer
from transformers import CLIPProcessor, CLIPModel
import numpy as np

query_texts = ["一只猫", "一只狗",'两只猫', '两只老虎','一只老虎']  # Here is the input text, which can be replaced at will.
# Load Taiyi Chinese text encoder
text_tokenizer = BertTokenizer.from_pretrained("IDEA-CCNL/Taiyi-CLIP-RoBERTa-326M-ViT-H-Chinese")
text_encoder = BertModel.from_pretrained("IDEA-CCNL/Taiyi-CLIP-RoBERTa-326M-ViT-H-Chinese").eval()

url = "http://images.cocodataset.org/val2017/000000039769.jpg"  # Here can be replaced with the URL of any picture
# Load the image encoder of openclip
clip_model, _, processor = open_clip.create_model_and_transforms('ViT-H-14', pretrained='laion2b_s32b_b79k')
clip_model = clip_model.eval()


text = text_tokenizer(query_texts, return_tensors='pt', padding=True)['input_ids']
image = processor(Image.open(requests.get(url, stream=True).raw)).unsqueeze(0)
with torch.no_grad():
    image_features = clip_model.encode_image(image)
    text_features = text_encoder(text)[1]
    # Normalization
    image_features = image_features / image_features.norm(dim=1, keepdim=True)
    text_features = text_features / text_features.norm(dim=1, keepdim=True)
    # Calculate cosine similarity. logit_scale is the scale coefficient
    logit_scale = clip_model.logit_scale.exp()
    logits_per_image =  logit_scale * image_features @ text_features.t()
    logits_per_text = logits_per_image.t()
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()
    print(np.around(probs, 3))

📚 Documentation

Model Taxonomy

Property	Details
Demand	Special
Task	Multimodal
Series	Taiyi
Model	CLIP (RoBERTa)
Parameter	326M
Extra	Uses ViT-H as the visual extractor - Chinese (ViT-H-Chinese)

Model Information

We follow the experimental setup of CLIP to obtain powerful visual-language intelligence. To obtain the CLIP for Chinese, we employ chinese-roberta-wwm-large for the language encoder, and apply the ViT-H-14 in open_clip for the vision encoder. We freeze the vision encoder and tune the language encoder to speed up and stabilize the pre-training process. Moreover, we apply Noah-Wukong dataset (100M) and Zero dataset (23M) as the pre-training datasets. The model was first trained 24 epochs on wukong and zero, which takes 8 days to train on A100x32. To the best of our knowledge, our TaiyiCLIP is currently the only open-sourced Chinese CLIP in the huggingface community.

Performance

Zero-Shot Classification

Model	Dataset	Top1	Top5
Taiyi-CLIP-RoBERTa-326M-ViT-H-Chinese	ImageNet1k-CN	54.35%	80.64%

Zero-Shot Text-to-Image Retrieval

Model	Dataset	Top1	Top5	Top10
Taiyi-CLIP-RoBERTa-326M-ViT-H-Chinese	Flickr30k-CNA-test	60.82%	85.00%	91.04%
Taiyi-CLIP-RoBERTa-326M-ViT-H-Chinese	COCO-CN-test	60.02%	83.95%	93.26%
Taiyi-CLIP-RoBERTa-326M-ViT-H-Chinese	wukong50k	66.85%	92.81%	96.69%

📄 License

This project is licensed under the Apache-2.0 license.

📖 Citation

If you are using the resource for your work, please cite the our paper:

@article{fengshenbang,
  author    = {Jiaxing Zhang and Ruyi Gan and Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen},
  title     = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence},
  journal   = {CoRR},
  volume    = {abs/2209.02970},
  year      = {2022}
}

You can also cite our website:

@misc{Fengshenbang-LM,
  title={Fengshenbang-LM},
  author={IDEA-CCNL},
  year={2021},
  howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご