chinese-clip-vit-large-patch14-336px開源模型 - 助力中文圖文匹配應用

首頁

Chinese Clip Vit Large Patch14 336px

由OFA-Sys開發

中文CLIP是基於約2億中文圖文對數據集的CLIP簡化實現，採用ViT-L/14@336px作為圖像編碼器，RoBERTa-wwm-base作為文本編碼器。

文本生成圖像

Transformers

#中文圖文檢索 #零樣本學習 #多模態預訓練

下載量 713

發布時間 : 11/9/2022

模型概述

大規模中文視覺語言預訓練模型，支持圖文相似度計算、跨模態檢索等任務。

模型特點

大規模中文預訓練

基於2億中文圖文對數據集訓練，對中文場景有更好的理解能力

高性能跨模態檢索

在MUGE、Flickr30K-CN等中文基準測試上達到SOTA性能

零樣本遷移能力

支持零樣本圖像分類和跨模態檢索任務

模型能力

圖文相似度計算

文本到圖像檢索

圖像到文本檢索

零樣本圖像分類

使用案例

電商

商品圖文匹配

自動匹配商品圖片與描述文字

提升商品搜索準確率

內容審核

違規內容檢測

檢測圖文不一致的違規內容

提高審核效率

🚀 中文CLIP-ViT-Large-Patch14-336px

這是中文CLIP的大版本模型，使用ViT-L/14@336px作為圖像編碼器，RoBERTa-wwm-base作為文本編碼器。中文CLIP是在約2億個中文圖像 - 文本對的大規模數據集上對CLIP的簡單實現。更多詳細信息，請參考我們的技術報告https://arxiv.org/abs/2211.01335 和我們的官方GitHub倉庫https://github.com/OFA-Sys/Chinese-CLIP （歡迎點亮小星星！🔥🔥）

🚀 快速開始

✨ 主要特性

本項目是中文CLIP的大版本模型，其特點在於使用了特定的圖像編碼器和文本編碼器，在大規模的中文圖像 - 文本數據集上進行訓練。

📦 安裝指南

文檔未提及安裝步驟，跳過該章節。

💻 使用示例

基礎用法

from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel

model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14-336px")
processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14-336px")

url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["傑尼龜", "妙蛙種子", "小火龍", "皮卡丘"]

# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # probs: [[0.0219, 0.0316, 0.0043, 0.9423]]

📚 詳細文檔

實驗結果

MUGE文本到圖像檢索

配置	零樣本R@1	零樣本R@5	零樣本R@10	零樣本MR	微調R@1	微調R@5	微調R@10	微調MR
Wukong	42.7	69.0	78.0	63.2	52.7	77.9	85.6	72.1
R2D2	49.5	75.7	83.2	69.5	60.1	82.9	89.4	77.5
CN - CLIP	63.0	84.1	89.2	78.8	68.9	88.7	93.1	83.6

Flickr30K - CN檢索

任務	文本到圖像零樣本R@1	文本到圖像零樣本R@5	文本到圖像零樣本R@10	文本到圖像微調R@1	文本到圖像微調R@5	文本到圖像微調R@10	圖像到文本零樣本R@1	圖像到文本零樣本R@5	圖像到文本零樣本R@10	圖像到文本微調R@1	圖像到文本微調R@5	圖像到文本微調R@10
Wukong	51.7	78.9	86.3	77.4	94.5	97.0	76.1	94.8	97.5	92.7	99.1	99.6
R2D2	60.9	86.8	92.7	84.4	96.7	98.4	77.6	96.7	98.9	95.6	99.8	100.0
CN - CLIP	71.2	91.4	95.5	83.8	96.9	98.6	81.6	97.5	98.8	95.3	99.7	100.0

COCO - CN檢索

任務	文本到圖像零樣本R@1	文本到圖像零樣本R@5	文本到圖像零樣本R@10	文本到圖像微調R@1	文本到圖像微調R@5	文本到圖像微調R@10	圖像到文本零樣本R@1	圖像到文本零樣本R@5	圖像到文本零樣本R@10	圖像到文本微調R@1	圖像到文本微調R@5	圖像到文本微調R@10
Wukong	53.4	80.2	90.1	74.0	94.4	98.1	55.2	81.0	90.6	73.3	94.0	98.0
R2D2	56.4	85.0	93.1	79.1	96.5	98.9	63.3	89.3	95.7	79.3	97.1	98.7
CN - CLIP	69.2	89.9	96.1	81.5	96.9	99.1	63.0	86.6	92.9	83.5	97.3	99.2

零樣本圖像分類

任務	CIFAR10	CIFAR100	DTD	EuroSAT	FER	FGVC	KITTI	MNIST	PC	VOC
GIT	88.5	61.1	42.9	43.4	41.4	6.7	22.1	68.9	50.0	80.2
ALIGN	94.9	76.8	66.1	52.1	50.8	25.0	41.2	74.0	55.2	83.0
CLIP	94.9	77.0	56.0	63.0	48.3	33.3	11.5	79.0	62.3	84.0
Wukong	95.4	77.1	40.9	50.3	-	-	-	-	-	-
CN - CLIP	96.0	79.7	51.2	52.0	55.1	26.2	49.9	79.4	63.5	84.9

📄 許可證

文檔未提及許可證信息，跳過該章節。

🔧 技術細節

文檔未提供具體技術實現細節，跳過該章節。

📚 引用

如果您覺得中文CLIP有幫助，請引用我們的論文。感謝您的支持！

@article{chinese-clip,
  title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
  author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
  journal={arXiv preprint arXiv:2211.01335},
  year={2022}
}