chinese-clip-vit-large-patch14-336px Open Source Model - Empowering Chinese Text-Image Matching Applications

Chinese Clip Vit Large Patch14 336px

Developed by OFA-Sys

Chinese CLIP is a simplified implementation of CLIP based on approximately 200 million Chinese image-text pairs, using ViT-L/14@336px as the image encoder and RoBERTa-wwm-base as the text encoder.

Text-to-Image

Transformers

#Chinese Image-Text Retrieval #Zero-shot Learning #Multimodal Pre-training

Downloads 713

Release Time : 11/9/2022

Model Overview

Large-scale Chinese vision-language pre-training model, supporting tasks such as image-text similarity calculation and cross-modal retrieval.

Model Features

Large-scale Chinese Pre-training

Trained on 200 million Chinese image-text pairs, with better understanding capabilities for Chinese scenarios.

High-performance Cross-modal Retrieval

Achieves SOTA performance on Chinese benchmarks such as MUGE and Flickr30K-CN.

Zero-shot Transfer Capability

Supports zero-shot image classification and cross-modal retrieval tasks.

Model Capabilities

Image-text similarity calculation

Text-to-image retrieval

Image-to-text retrieval

Zero-shot image classification

Use Cases

E-commerce

Product Image-Text Matching

Automatically match product images with descriptive text

Improves product search accuracy

Content Moderation

Inappropriate Content Detection

Detect inconsistent or inappropriate image-text content

Enhances moderation efficiency

🚀 Chinese-CLIP-ViT-Large-Patch14-336px

This is a large-scale Chinese CLIP model that combines a ViT-L/14@336px image encoder with a RoBERTa-wwm-base text encoder, enabling effective processing of Chinese image-text pairs.

🚀 Quick Start

This is the large-version of the Chinese CLIP, with ViT-L/14@336px as the image encoder and RoBERTa-wwm-base as the text encoder. Chinese CLIP is a simple implementation of CLIP on a large-scale dataset of around 200 million Chinese image-text pairs. For more details, please refer to our technical report https://arxiv.org/abs/2211.01335 and our official github repo https://github.com/OFA-Sys/Chinese-CLIP (Welcome to star! 🔥🔥)

✨ Features

Large-scale Dataset: Trained on approximately 200 million Chinese image-text pairs.
Dual Encoders: Utilizes ViT-L/14@336px for image encoding and RoBERTa-wwm-base for text encoding.

💻 Usage Examples

Basic Usage

from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel

model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14-336px")
processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14-336px")

url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # probs: [[0.0219, 0.0316, 0.0043, 0.9423]]

Advanced Usage

If you are not satisfied with only using the API, feel free to check our github repo https://github.com/OFA-Sys/Chinese-CLIP for more details about training and inference.

📚 Documentation

Results

MUGE Text-to-Image Retrieval

Setup	Zero-shot - R@1	Zero-shot - R@5	Zero-shot - R@10	Zero-shot - MR	Finetune - R@1	Finetune - R@5	Finetune - R@10	Finetune - MR
Wukong	42.7	69.0	78.0	63.2	52.7	77.9	85.6	72.1
R2D2	49.5	75.7	83.2	69.5	60.1	82.9	89.4	77.5
CN-CLIP	63.0	84.1	89.2	78.8	68.9	88.7	93.1	83.6

Flickr30K-CN Retrieval

Setup	Text-to-Image - Zero-shot - R@1	Text-to-Image - Zero-shot - R@5	Text-to-Image - Zero-shot - R@10	Text-to-Image - Finetune - R@1	Text-to-Image - Finetune - R@5	Text-to-Image - Finetune - R@10	Image-to-Text - Zero-shot - R@1	Image-to-Text - Zero-shot - R@5	Image-to-Text - Zero-shot - R@10	Image-to-Text - Finetune - R@1	Image-to-Text - Finetune - R@5	Image-to-Text - Finetune - R@10
Wukong	51.7	78.9	86.3	77.4	94.5	97.0	76.1	94.8	97.5	92.7	99.1	99.6
R2D2	60.9	86.8	92.7	84.4	96.7	98.4	77.6	96.7	98.9	95.6	99.8	100.0
CN-CLIP	71.2	91.4	95.5	83.8	96.9	98.6	81.6	97.5	98.8	95.3	99.7	100.0

COCO-CN Retrieval

Setup	Text-to-Image - Zero-shot - R@1	Text-to-Image - Zero-shot - R@5	Text-to-Image - Zero-shot - R@10	Text-to-Image - Finetune - R@1	Text-to-Image - Finetune - R@5	Text-to-Image - Finetune - R@10	Image-to-Text - Zero-shot - R@1	Image-to-Text - Zero-shot - R@5	Image-to-Text - Zero-shot - R@10	Image-to-Text - Finetune - R@1	Image-to-Text - Finetune - R@5	Image-to-Text - Finetune - R@10
Wukong	53.4	80.2	90.1	74.0	94.4	98.1	55.2	81.0	90.6	73.3	94.0	98.0
R2D2	56.4	85.0	93.1	79.1	96.5	98.9	63.3	89.3	95.7	79.3	97.1	98.7
CN-CLIP	69.2	89.9	96.1	81.5	96.9	99.1	63.0	86.6	92.9	83.5	97.3	99.2

Zero-shot Image Classification

Task	CIFAR10	CIFAR100	DTD	EuroSAT	FER	FGVC	KITTI	MNIST	PC	VOC
GIT	88.5	61.1	42.9	43.4	41.4	6.7	22.1	68.9	50.0	80.2
ALIGN	94.9	76.8	66.1	52.1	50.8	25.0	41.2	74.0	55.2	83.0
CLIP	94.9	77.0	56.0	63.0	48.3	33.3	11.5	79.0	62.3	84.0
Wukong	95.4	77.1	40.9	50.3	-	-	-	-	-	-
CN-CLIP	96.0	79.7	51.2	52.0	55.1	26.2	49.9	79.4	63.5	84.9

📄 License

No license information provided in the original document.

📚 Citation

If you find Chinese CLIP helpful, feel free to cite our paper. Thanks for your support!

@article{chinese-clip,
  title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
  author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
  journal={arXiv preprint arXiv:2211.01335},
  year={2022}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご