Chineses-Clip-ViT-Large-Patch14 Open-source Model - Supports Chinese Visual Language Task Applications

Home

Chinese Clip Vit Large Patch14

Developed by OFA-Sys

Chinese CLIP model, based on VIT architecture, supports Chinese vision-language tasks

Image Classification

Transformers

#Chinese Multimodal Understanding #Zero-shot Image Classification #Vision-Text Alignment

Downloads 2,333

Release Time : 11/9/2022

Model Overview

This is a Chinese CLIP model based on the Vision Transformer architecture, capable of joint representation learning for images and text, suitable for cross-modal retrieval and classification tasks.

Model Features

Chinese Cross-modal Understanding

A vision-language joint representation model optimized specifically for Chinese scenarios

Efficient Visual Encoding

Based on ViT architecture, capable of efficiently processing image inputs

Zero-shot Classification Capability

Supports zero-shot image classification based on text descriptions

Model Capabilities

Image-text matching

Cross-modal retrieval

Zero-shot image classification

Chinese vision-language understanding

Use Cases

Content Moderation

Inappropriate Content Detection

Detect inappropriate image content through text descriptions

Can identify specific types of inappropriate content

E-commerce

Product Search

Search for related product images through text descriptions

Improves product search accuracy

Social Media

Content Recommendation

Recommend related image-text content based on user interests

Enhances user engagement

🚀 Chinese-CLIP-ViT-Large-Patch14

This is a large-scale Chinese image-text matching model, which can effectively calculate the similarity between images and text.

🚀 Quick Start

This is the large-version of the Chinese CLIP, with ViT-L/14 as the image encoder and RoBERTa-wwm-base as the text encoder. Chinese CLIP is a simple implementation of CLIP on a large-scale dataset of around 200 million Chinese image-text pairs. For more details, please refer to our technical report https://arxiv.org/abs/2211.01335 and our official github repo https://github.com/OFA-Sys/Chinese-CLIP (Welcome to star! 🔥🔥)

💻 Usage Examples

Basic Usage

from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel

model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14")
processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14")

url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # probs: [[0.0066, 0.0211, 0.0031, 0.9692]]

However, if you are not satisfied with only using the API, feel free to check our github repo https://github.com/OFA-Sys/Chinese-CLIP for more details about training and inference.

📚 Documentation

Results

MUGE Text-to-Image Retrieval

Setup	Zero-shot - R@1	Zero-shot - R@5	Zero-shot - R@10	Zero-shot - MR	Finetune - R@1	Finetune - R@5	Finetune - R@10	Finetune - MR
Wukong	42.7	69.0	78.0	63.2	52.7	77.9	85.6	72.1
R2D2	49.5	75.7	83.2	69.5	60.1	82.9	89.4	77.5
CN-CLIP	63.0	84.1	89.2	78.8	68.9	88.7	93.1	83.6

Flickr30K-CN Retrieval

Task	Text-to-Image - Zero-shot - R@1	Text-to-Image - Zero-shot - R@5	Text-to-Image - Zero-shot - R@10	Text-to-Image - Finetune - R@1	Text-to-Image - Finetune - R@5	Text-to-Image - Finetune - R@10	Image-to-Text - Zero-shot - R@1	Image-to-Text - Zero-shot - R@5	Image-to-Text - Zero-shot - R@10	Image-to-Text - Finetune - R@1	Image-to-Text - Finetune - R@5	Image-to-Text - Finetune - R@10
Wukong	51.7	78.9	86.3	77.4	94.5	97.0	76.1	94.8	97.5	92.7	99.1	99.6
R2D2	60.9	86.8	92.7	84.4	96.7	98.4	77.6	96.7	98.9	95.6	99.8	100.0
CN-CLIP	71.2	91.4	95.5	83.8	96.9	98.6	81.6	97.5	98.8	95.3	99.7	100.0

COCO-CN Retrieval

Task	Text-to-Image - Zero-shot - R@1	Text-to-Image - Zero-shot - R@5	Text-to-Image - Zero-shot - R@10	Text-to-Image - Finetune - R@1	Text-to-Image - Finetune - R@5	Text-to-Image - Finetune - R@10	Image-to-Text - Zero-shot - R@1	Image-to-Text - Zero-shot - R@5	Image-to-Text - Zero-shot - R@10	Image-to-Text - Finetune - R@1	Image-to-Text - Finetune - R@5	Image-to-Text - Finetune - R@10
Wukong	53.4	80.2	90.1	74.0	94.4	98.1	55.2	81.0	90.6	73.3	94.0	98.0
R2D2	56.4	85.0	93.1	79.1	96.5	98.9	63.3	89.3	95.7	79.3	97.1	98.7
CN-CLIP	69.2	89.9	96.1	81.5	96.9	99.1	63.0	86.6	92.9	83.5	97.3	99.2

Zero-shot Image Classification

Task	CIFAR10	CIFAR100	DTD	EuroSAT	FER	FGVC	KITTI	MNIST	PC	VOC
GIT	88.5	61.1	42.9	43.4	41.4	6.7	22.1	68.9	50.0	80.2
ALIGN	94.9	76.8	66.1	52.1	50.8	25.0	41.2	74.0	55.2	83.0
CLIP	94.9	77.0	56.0	63.0	48.3	33.3	11.5	79.0	62.3	84.0
Wukong	95.4	77.1	40.9	50.3	-	-	-	-	-	-
CN-CLIP	96.0	79.7	51.2	52.0	55.1	26.2	49.9	79.4	63.5	84.9

📄 License

If you find Chinese CLIP helpful, feel free to cite our paper. Thanks for your support!

@article{chinese-clip,
  title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
  author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
  journal={arXiv preprint arXiv:2211.01335},
  year={2022}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご