オープンソースのchinese-clip-vit-base-patch16モデル - 画像とテキストのマッチングおよび検索アプリケーションをサポート

ホーム

Chinese Clip Vit Base Patch16

OFA-Sysによって開発

中国語CLIPの基本バージョンで、ViT-B/16を画像エンコーダー、RoBERTa-wwm-baseをテキストエンコーダーとして使用し、約2億組の中国語画像テキストペアの大規模データセットでトレーニングされています。

テキスト生成画像

Transformers

#中文画像検索 #ゼロショット学習 #マルチモーダル埋め込み

ダウンロード数 49.02k

リリース時間 : 11/9/2022

モデル概要

中国語CLIPは視覚と言語のモデルで、画像とテキストの埋め込みと類似度を計算でき、中国語画像テキスト検索と分類タスクをサポートします。

モデル特徴

中国語最適化

中国語と言語シーンに特化して最適化され、中国語画像テキスト検索と分類タスクをサポートします。

大規模トレーニング

約2億組の中国語画像テキストペアの大規模データセットでトレーニングされ、強力な汎化能力を持っています。

マルチタスクサポート

画像テキスト検索、画像分類など、さまざまな視覚-言語タスクをサポートします。

モデル能力

画像とテキストの埋め込み計算

画像テキスト類似度計算

中国語画像テキスト検索

ゼロショット画像分類

使用事例

電子商取引

商品検索

テキスト記述を通じて関連商品画像を検索

MUGEデータセットでR@1が63.0に達する

コンテンツ審査

違反コンテンツ検出

テキスト記述を通じて違反画像を検出

ソーシャルメディア

画像テキストマッチング

画像に適切なテキスト記述を自動生成

Flickr30K-CNデータセットで画像からテキストR@1が81.6に達する

🚀 Chinese-CLIP-ViT-Base-Patch16

このモデルは、画像エンコーダとしてViT - B/16、テキストエンコーダとしてRoBERTa - wwm - baseを使用した、Chinese CLIPのベースバージョンです。Chinese CLIPは、約2億の中国語の画像 - テキストペアの大規模データセットでCLIPを実装したものです。詳細については、技術レポートhttps://arxiv.org/abs/2211.01335 と公式GitHubリポジトリhttps://github.com/OFA-Sys/Chinese-CLIP を参照してください（スターをつけることをおすすめします🔥🔥）。

🚀 クイックスタート

概要

このセクションでは、Chinese - CLIP - ViT - Base - Patch16のAPIを使って画像とテキストの埋め込みと類似度を計算する方法を紹介します。

コード例

from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel

model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")

url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # probs: [[1.2686e-03, 5.4499e-02, 6.7968e-04, 9.4355e-01]]

追加情報

APIの使用に満足できない場合は、トレーニングと推論に関する詳細を公式GitHubリポジトリhttps://github.com/OFA-Sys/Chinese-CLIP で確認してください。

✨ 主な機能

モデル構成

画像エンコーダ：ViT - B/16
テキストエンコーダ：RoBERTa - wwm - base

データセット

約2億の中国語の画像 - テキストペアの大規模データセットを使用して訓練されています。

📚 ドキュメント

実験結果

MUGE Text - to - Image Retrieval

Setup	Zero - shot R@1	Zero - shot R@5	Zero - shot R@10	Zero - shot MR	Finetune R@1	Finetune R@5	Finetune R@10	Finetune MR
Wukong	42.7	69.0	78.0	63.2	52.7	77.9	85.6	72.1
R2D2	49.5	75.7	83.2	69.5	60.1	82.9	89.4	77.5
CN - CLIP	63.0	84.1	89.2	78.8	68.9	88.7	93.1	83.6

Flickr30K - CN Retrieval

Task	Setup	Zero - shot R@1	Zero - shot R@5	Zero - shot R@10	Finetune R@1	Finetune R@5	Finetune R@10
Text - to - Image	Wukong	51.7	78.9	86.3	77.4	94.5	97.0
Text - to - Image	R2D2	60.9	86.8	92.7	84.4	96.7	98.4
Text - to - Image	CN - CLIP	71.2	91.4	95.5	83.8	96.9	98.6
Image - to - Text	Wukong	76.1	94.8	97.5	92.7	99.1	99.6
Image - to - Text	R2D2	77.6	96.7	98.9	95.6	99.8	100.0
Image - to - Text	CN - CLIP	81.6	97.5	98.8	95.3	99.7	100.0

COCO - CN Retrieval

Task	Setup	Zero - shot R@1	Zero - shot R@5	Zero - shot R@10	Finetune R@1	Finetune R@5	Finetune R@10
Text - to - Image	Wukong	53.4	80.2	90.1	74.0	94.4	98.1
Text - to - Image	R2D2	56.4	85.0	93.1	79.1	96.5	98.9
Text - to - Image	CN - CLIP	69.2	89.9	96.1	81.5	96.9	99.1
Image - to - Text	Wukong	55.2	81.0	90.6	73.3	94.0	98.0
Image - to - Text	R2D2	63.3	89.3	95.7	79.3	97.1	98.7
Image - to - Text	CN - CLIP	63.0	86.6	92.9	83.5	97.3	99.2

Zero - shot Image Classification

Task	CIFAR10	CIFAR100	DTD	EuroSAT	FER	FGVC	KITTI	MNIST	PC	VOC
GIT	88.5	61.1	42.9	43.4	41.4	6.7	22.1	68.9	50.0	80.2
ALIGN	94.9	76.8	66.1	52.1	50.8	25.0	41.2	74.0	55.2	83.0
CLIP	94.9	77.0	56.0	63.0	48.3	33.3	11.5	79.0	62.3	84.0
Wukong	95.4	77.1	40.9	50.3	-	-	-	-	-	-
CN - CLIP	96.0	79.7	51.2	52.0	55.1	26.2	49.9	79.4	63.5	84.9

引用

このChinese CLIPが役に立った場合は、以下の論文を引用してください。ご支援いただきありがとうございます！

@article{chinese-clip,
  title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
  author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
  journal={arXiv preprint arXiv:2211.01335},
  year={2022}
}