chinese-clip-vit-huge-patch14開源多模態模型 - 支持中文視覺語言任務應用

首頁

Chinese Clip Vit Huge Patch14

由OFA-Sys開發

Chinese CLIP是一個基於Vision Transformer架構的多模態模型，支持中文視覺-語言任務。

圖像分類

Transformers

#多模態理解 #零樣本分類 #中文視覺識別

下載量 623

發布時間 : 11/9/2022

模型概述

該模型結合了視覺和語言處理能力，能夠理解中文文本與圖像的關聯，適用於跨模態檢索和分類任務。

模型特點

中文多模態理解

專門針對中文場景優化，能同時處理圖像和中文文本輸入

視覺Transformer架構

採用ViT-Base結構，16x16圖像分塊處理，平衡性能與效率

零樣本分類能力

無需微調即可通過文本提示完成圖像分類任務

模型能力

圖像-文本匹配

跨模態檢索

零樣本圖像分類

中文場景理解

使用案例

內容審核

違規內容檢測

通過文本描述檢測違規圖像內容

可識別特定場景下的敏感內容

電子商務

商品搜索

通過自然語言描述查找匹配商品圖片

提升搜索準確率和用戶體驗

🚀 中文CLIP-ViT-Huge-Patch14

這是中文CLIP的超大版本，使用ViT-H/14作為圖像編碼器，RoBERTa-wwm-large作為文本編碼器。中文CLIP是在約2億個中文圖像 - 文本對的大規模數據集上對CLIP的簡單實現。

🚀 快速開始

官方API使用方法

我們提供了一個簡單的代碼片段，展示如何使用中文CLIP的API來計算圖像和文本的嵌入以及相似度。

from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel

model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-huge-patch14")
processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-huge-patch14")

url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["傑尼龜", "妙蛙種子", "小火龍", "皮卡丘"]

# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # probs: [[1.1419e-02, 1.0478e-02, 5.2018e-04, 9.7758e-01]]

如果你不滿足於僅使用API，可以查看我們的GitHub倉庫https://github.com/OFA-Sys/Chinese-CLIP 以獲取更多關於訓練和推理的詳細信息。

✨ 主要特性

模型信息

這是中文CLIP的超大版本，使用ViT-H/14作為圖像編碼器，RoBERTa-wwm-large作為文本編碼器。中文CLIP是在約2億個中文圖像 - 文本對的大規模數據集上對CLIP的簡單實現。更多詳細信息，請參考我們的技術報告https://arxiv.org/abs/2211.01335 和我們的官方GitHub倉庫https://github.com/OFA-Sys/Chinese-CLIP （歡迎點亮小星星！🔥🔥）

實驗結果

MUGE文本到圖像檢索

模型	零樣本R@1	零樣本R@5	零樣本R@10	零樣本MR	微調R@1	微調R@5	微調R@10	微調MR
Wukong	42.7	69.0	78.0	63.2	52.7	77.9	85.6	72.1
R2D2	49.5	75.7	83.2	69.5	60.1	82.9	89.4	77.5
CN-CLIP	63.0	84.1	89.2	78.8	68.9	88.7	93.1	83.6

Flickr30K-CN檢索

任務	文本到圖像（零樣本R@1）	文本到圖像（零樣本R@5）	文本到圖像（零樣本R@10）	文本到圖像（微調R@1）	文本到圖像（微調R@5）	文本到圖像（微調R@10）	圖像到文本（零樣本R@1）	圖像到文本（零樣本R@5）	圖像到文本（零樣本R@10）	圖像到文本（微調R@1）	圖像到文本（微調R@5）	圖像到文本（微調R@10）
Wukong	51.7	78.9	86.3	77.4	94.5	97.0	76.1	94.8	97.5	92.7	99.1	99.6
R2D2	60.9	86.8	92.7	84.4	96.7	98.4	77.6	96.7	98.9	95.6	99.8	100.0
CN-CLIP	71.2	91.4	95.5	83.8	96.9	98.6	81.6	97.5	98.8	95.3	99.7	100.0

COCO-CN檢索

任務	文本到圖像（零樣本R@1）	文本到圖像（零樣本R@5）	文本到圖像（零樣本R@10）	文本到圖像（微調R@1）	文本到圖像（微調R@5）	文本到圖像（微調R@10）	圖像到文本（零樣本R@1）	圖像到文本（零樣本R@5）	圖像到文本（零樣本R@10）	圖像到文本（微調R@1）	圖像到文本（微調R@5）	圖像到文本（微調R@10）
Wukong	53.4	80.2	90.1	74.0	94.4	98.1	55.2	81.0	90.6	73.3	94.0	98.0
R2D2	56.4	85.0	93.1	79.1	96.5	98.9	63.3	89.3	95.7	79.3	97.1	98.7
CN-CLIP	69.2	89.9	96.1	81.5	96.9	99.1	63.0	86.6	92.9	83.5	97.3	99.2

零樣本圖像分類

任務	CIFAR10	CIFAR100	DTD	EuroSAT	FER	FGVC	KITTI	MNIST	PC	VOC
GIT	88.5	61.1	42.9	43.4	41.4	6.7	22.1	68.9	50.0	80.2
ALIGN	94.9	76.8	66.1	52.1	50.8	25.0	41.2	74.0	55.2	83.0
CLIP	94.9	77.0	56.0	63.0	48.3	33.3	11.5	79.0	62.3	84.0
Wukong	95.4	77.1	40.9	50.3	-	-	-	-	-	-
CN-CLIP	96.0	79.7	51.2	52.0	55.1	26.2	49.9	79.4	63.5	84.9

📚 詳細文檔

引用信息

如果你覺得中文CLIP有幫助，請引用我們的論文。感謝支持！

@article{chinese-clip,
  title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
  author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
  journal={arXiv preprint arXiv:2211.01335},
  year={2022}
}