๐ Chinese-CLIP-ViT-Huge-Patch14
This is a huge - version Chinese CLIP model. It uses ViT - H/14 as the image encoder and RoBERTa - wwm - large as the text encoder. Chinese CLIP is a simple implementation of CLIP on a large - scale dataset of around 200 million Chinese image - text pairs.
๐ Quick Start
โจ Features
- This is the huge - version of the Chinese CLIP.
- Uses ViT - H/14 as the image encoder and RoBERTa - wwm - large as the text encoder.
- Implemented on a large - scale dataset of around 200 million Chinese image - text pairs.
๐ฆ Installation
No installation steps provided in the original document, so this section is skipped.
๐ป Usage Examples
Basic Usage
from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel
model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-huge-patch14")
processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-huge-patch14")
url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
texts = ["ๆฐๅฐผ้พ", "ๅฆ่็งๅญ", "ๅฐ็ซ้พ", "็ฎๅกไธ"]
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
๐ Documentation
This is the huge - version of the Chinese CLIP, with ViT - H/14 as the image encoder and RoBERTa - wwm - large as the text encoder. For more details, please refer to our technical report https://arxiv.org/abs/2211.01335 and our official github repo [https://github.com/OFA - Sys/Chinese - CLIP](https://github.com/OFA - Sys/Chinese - CLIP) (Welcome to star! ๐ฅ๐ฅ)
If you are not satisfied with only using the API, feel free to check our github repo [https://github.com/OFA - Sys/Chinese - CLIP](https://github.com/OFA - Sys/Chinese - CLIP) for more details about training and inference.
๐ง Technical Details
No technical details provided in the original document, so this section is skipped.
๐ License
No license information provided in the original document, so this section is skipped.
Results
MUGE Text - to - Image Retrieval
Setup |
Zero - shot (R@1) |
Zero - shot (R@5) |
Zero - shot (R@10) |
Zero - shot (MR) |
Finetune (R@1) |
Finetune (R@5) |
Finetune (R@10) |
Finetune (MR) |
Wukong |
42.7 |
69.0 |
78.0 |
63.2 |
52.7 |
77.9 |
85.6 |
72.1 |
R2D2 |
49.5 |
75.7 |
83.2 |
69.5 |
60.1 |
82.9 |
89.4 |
77.5 |
CN - CLIP |
63.0 |
84.1 |
89.2 |
78.8 |
68.9 |
88.7 |
93.1 |
83.6 |
Flickr30K - CN Retrieval
Task |
Setup |
Zero - shot (R@1) |
Zero - shot (R@5) |
Zero - shot (R@10) |
Finetune (R@1) |
Finetune (R@5) |
Finetune (R@10) |
Text - to - Image |
Wukong |
51.7 |
78.9 |
86.3 |
77.4 |
94.5 |
97.0 |
Text - to - Image |
R2D2 |
60.9 |
86.8 |
92.7 |
84.4 |
96.7 |
98.4 |
Text - to - Image |
CN - CLIP |
71.2 |
91.4 |
95.5 |
83.8 |
96.9 |
98.6 |
Image - to - Text |
Wukong |
76.1 |
94.8 |
97.5 |
92.7 |
99.1 |
99.6 |
Image - to - Text |
R2D2 |
77.6 |
96.7 |
98.9 |
95.6 |
99.8 |
100.0 |
Image - to - Text |
CN - CLIP |
81.6 |
97.5 |
98.8 |
95.3 |
99.7 |
100.0 |
COCO - CN Retrieval
Task |
Setup |
Zero - shot (R@1) |
Zero - shot (R@5) |
Zero - shot (R@10) |
Finetune (R@1) |
Finetune (R@5) |
Finetune (R@10) |
Text - to - Image |
Wukong |
53.4 |
80.2 |
90.1 |
74.0 |
94.4 |
98.1 |
Text - to - Image |
R2D2 |
56.4 |
85.0 |
93.1 |
79.1 |
96.5 |
98.9 |
Text - to - Image |
CN - CLIP |
69.2 |
89.9 |
96.1 |
81.5 |
96.9 |
99.1 |
Image - to - Text |
Wukong |
55.2 |
81.0 |
90.6 |
73.3 |
94.0 |
98.0 |
Image - to - Text |
R2D2 |
63.3 |
89.3 |
95.7 |
79.3 |
97.1 |
98.7 |
Image - to - Text |
CN - CLIP |
63.0 |
86.6 |
92.9 |
83.5 |
97.3 |
99.2 |
Zero - shot Image Classification
Task |
CIFAR10 |
CIFAR100 |
DTD |
EuroSAT |
FER |
FGVC |
KITTI |
MNIST |
PC |
VOC |
GIT |
88.5 |
61.1 |
42.9 |
43.4 |
41.4 |
6.7 |
22.1 |
68.9 |
50.0 |
80.2 |
ALIGN |
94.9 |
76.8 |
66.1 |
52.1 |
50.8 |
25.0 |
41.2 |
74.0 |
55.2 |
83.0 |
CLIP |
94.9 |
77.0 |
56.0 |
63.0 |
48.3 |
33.3 |
11.5 |
79.0 |
62.3 |
84.0 |
Wukong |
95.4 |
77.1 |
40.9 |
50.3 |
- |
- |
- |
- |
- |
- |
CN - CLIP |
96.0 |
79.7 |
51.2 |
52.0 |
55.1 |
26.2 |
49.9 |
79.4 |
63.5 |
84.9 |
Citation
If you find Chinese CLIP helpful, feel free to cite our paper. Thanks for your support!
@article{chinese-clip,
title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
journal={arXiv preprint arXiv:2211.01335},
year={2022}
}