Chinese-Clip-ViT-Huge-Patch14 Open Source Multimodal Model - Supports Chinese Vision-Language Task Applications

Chinese Clip Vit Huge Patch14

Developed by OFA-Sys

Chinese CLIP is a multimodal model based on the Vision Transformer architecture, supporting Chinese vision-language tasks.

Image Classification

Transformers

#Multimodal Understanding #Zero-shot Classification #Chinese Visual Recognition

Downloads 623

Release Time : 11/9/2022

Model Overview

This model combines visual and language processing capabilities, enabling the understanding of associations between Chinese text and images, suitable for cross-modal retrieval and classification tasks.

Model Features

Chinese Multimodal Understanding

Optimized specifically for Chinese scenarios, capable of processing both image and Chinese text inputs

Vision Transformer Architecture

Adopts ViT-Base structure with 16x16 image patch processing, balancing performance and efficiency

Zero-shot Classification Capability

Can perform image classification tasks via text prompts without fine-tuning

Model Capabilities

Image-Text Matching

Cross-modal Retrieval

Zero-shot Image Classification

Chinese Scene Understanding

Use Cases

Content Moderation

Inappropriate Content Detection

Detect inappropriate image content through text descriptions

Can identify sensitive content in specific scenarios

E-commerce

Product Search

Find matching product images through natural language descriptions

Improves search accuracy and user experience

🚀 Chinese-CLIP-ViT-Huge-Patch14

This is a huge - version Chinese CLIP model. It uses ViT - H/14 as the image encoder and RoBERTa - wwm - large as the text encoder. Chinese CLIP is a simple implementation of CLIP on a large - scale dataset of around 200 million Chinese image - text pairs.

🚀 Quick Start

✨ Features

This is the huge - version of the Chinese CLIP.
Uses ViT - H/14 as the image encoder and RoBERTa - wwm - large as the text encoder.
Implemented on a large - scale dataset of around 200 million Chinese image - text pairs.

📦 Installation

No installation steps provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel

model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-huge-patch14")
processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-huge-patch14")

url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # probs: [[1.1419e-02, 1.0478e-02, 5.2018e-04, 9.7758e-01]]

📚 Documentation

This is the huge - version of the Chinese CLIP, with ViT - H/14 as the image encoder and RoBERTa - wwm - large as the text encoder. For more details, please refer to our technical report https://arxiv.org/abs/2211.01335 and our official github repo [https://github.com/OFA - Sys/Chinese - CLIP](https://github.com/OFA - Sys/Chinese - CLIP) (Welcome to star! 🔥🔥)

If you are not satisfied with only using the API, feel free to check our github repo [https://github.com/OFA - Sys/Chinese - CLIP](https://github.com/OFA - Sys/Chinese - CLIP) for more details about training and inference.

🔧 Technical Details

No technical details provided in the original document, so this section is skipped.

📄 License

No license information provided in the original document, so this section is skipped.

Results

MUGE Text - to - Image Retrieval

Setup	Zero - shot (R@1)	Zero - shot (R@5)	Zero - shot (R@10)	Zero - shot (MR)	Finetune (R@1)	Finetune (R@5)	Finetune (R@10)	Finetune (MR)
Wukong	42.7	69.0	78.0	63.2	52.7	77.9	85.6	72.1
R2D2	49.5	75.7	83.2	69.5	60.1	82.9	89.4	77.5
CN - CLIP	63.0	84.1	89.2	78.8	68.9	88.7	93.1	83.6

Flickr30K - CN Retrieval

Task	Setup	Zero - shot (R@1)	Zero - shot (R@5)	Zero - shot (R@10)	Finetune (R@1)	Finetune (R@5)	Finetune (R@10)
Text - to - Image	Wukong	51.7	78.9	86.3	77.4	94.5	97.0
Text - to - Image	R2D2	60.9	86.8	92.7	84.4	96.7	98.4
Text - to - Image	CN - CLIP	71.2	91.4	95.5	83.8	96.9	98.6
Image - to - Text	Wukong	76.1	94.8	97.5	92.7	99.1	99.6
Image - to - Text	R2D2	77.6	96.7	98.9	95.6	99.8	100.0
Image - to - Text	CN - CLIP	81.6	97.5	98.8	95.3	99.7	100.0

COCO - CN Retrieval

Task	Setup	Zero - shot (R@1)	Zero - shot (R@5)	Zero - shot (R@10)	Finetune (R@1)	Finetune (R@5)	Finetune (R@10)
Text - to - Image	Wukong	53.4	80.2	90.1	74.0	94.4	98.1
Text - to - Image	R2D2	56.4	85.0	93.1	79.1	96.5	98.9
Text - to - Image	CN - CLIP	69.2	89.9	96.1	81.5	96.9	99.1
Image - to - Text	Wukong	55.2	81.0	90.6	73.3	94.0	98.0
Image - to - Text	R2D2	63.3	89.3	95.7	79.3	97.1	98.7
Image - to - Text	CN - CLIP	63.0	86.6	92.9	83.5	97.3	99.2

Zero - shot Image Classification

Task	CIFAR10	CIFAR100	DTD	EuroSAT	FER	FGVC	KITTI	MNIST	PC	VOC
GIT	88.5	61.1	42.9	43.4	41.4	6.7	22.1	68.9	50.0	80.2
ALIGN	94.9	76.8	66.1	52.1	50.8	25.0	41.2	74.0	55.2	83.0
CLIP	94.9	77.0	56.0	63.0	48.3	33.3	11.5	79.0	62.3	84.0
Wukong	95.4	77.1	40.9	50.3	-	-	-	-	-	-
CN - CLIP	96.0	79.7	51.2	52.0	55.1	26.2	49.9	79.4	63.5	84.9

Citation

If you find Chinese CLIP helpful, feel free to cite our paper. Thanks for your support!

@article{chinese-clip,
  title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
  author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
  journal={arXiv preprint arXiv:2211.01335},
  year={2022}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご