TinyCLIP開源跨模態模型 - 實現語言圖像匹配，兼顧速度與精度

首頁

Tinyclip ViT 39M 16 Text 19M YFCC15M

由wkcn開發

TinyCLIP是一種針對大規模語言-圖像預訓練模型的創新跨模態蒸餾方法，通過親和力模仿與權重繼承技術，實現了速度與精度的最佳平衡。

文本生成圖像

Transformers

開源協議:MIT #零樣本分類 #跨模態蒸餾 #輕量級CLIP

下載量 654

發布時間 : 12/19/2023

模型概述

TinyCLIP是一種跨模態蒸餾方法，通過親和力模仿與權重繼承技術，釋放了小型CLIP模型的潛力，結合大規模模型與預訓練數據的優勢，適用於零樣本圖像分類任務。

模型特點

親和力模仿

通過模仿大規模CLIP模型的跨模態親和力關係，提升小模型的性能。

權重繼承

自動或手動從大規模模型中繼承權重，加速訓練並提升模型效果。

高效推理

在參數量減少50%的同時獲得2倍推理加速，保持高性能。

模型能力

零樣本圖像分類

跨模態檢索

圖像-文本匹配

使用案例

圖像分類

動物識別

識別圖像中的動物類別

在ImageNet上達到56.4%-63.5%準確率

內容檢索

圖文匹配

根據文本描述檢索相關圖像

🚀 TinyCLIP：通過親和度模仿和權重繼承實現CLIP蒸餾

TinyCLIP是一種用於大規模語言 - 圖像預訓練模型的新型跨模態蒸餾方法。該方法引入了兩項核心技術：親和度模仿和權重繼承。這項工作釋放了小型CLIP模型的潛力，充分利用了大規模模型以及預訓練數據，在速度和準確性之間取得了最佳平衡。

🚀 快速開始

與Transformers庫配合使用

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("wkcn/TinyCLIP-ViT-39M-16-Text-19M-YFCC15M")
processor = CLIPProcessor.from_pretrained("wkcn/TinyCLIP-ViT-39M-16-Text-19M-YFCC15M")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

✨ 主要特性

特性亮點

TinyCLIP ViT - 45M/32僅使用ViT - B/32一半的參數，就能實現相當的零樣本性能。
TinyCLIP ResNet - 19M在參數減少50%的同時，推理速度提升了2倍，並在ImageNet上獲得了56.4%的準確率。

模型庫

屬性	詳情
模型類型	TinyCLIP系列模型，包括不同參數規模和架構（如ViT、ResNet）的變體
訓練數據	YFCC - 15M、LAION - 400M、LAION+YFCC - 400M

模型	權重繼承方式	預訓練數據	ImageNet-1K 準確率@1(%)	MACs(G)	吞吐量(對/秒)	鏈接
TinyCLIP ViT-39M/16 Text-19M	手動	YFCC-15M	63.5	9.5	1,469	模型
TinyCLIP ViT-8M/16 Text-3M	手動	YFCC-15M	41.1	2.0	4,150	模型
TinyCLIP ResNet-30M Text-29M	手動	LAION-400M	59.1	6.9	1,811	模型
TinyCLIP ResNet-19M Text-19M	手動	LAION-400M	56.4	4.4	3,024	模型
TinyCLIP ViT-61M/32 Text-29M	手動	LAION-400M	62.4	5.3	3,191	模型
TinyCLIP ViT-40M/32 Text-19M	手動	LAION-400M	59.8	3.5	4,641	模型
TinyCLIP ViT-63M/32 Text-31M	自動	LAION-400M	63.9	5.6	2,905	模型
TinyCLIP ViT-45M/32 Text-18M	自動	LAION-400M	61.4	3.7	3,682	模型
TinyCLIP ViT-22M/32 Text-10M	自動	LAION-400M	53.7	1.9	5,504	模型
TinyCLIP ViT-63M/32 Text-31M	自動	LAION+YFCC-400M	64.5	5.6	2,909	模型
TinyCLIP ViT-45M/32 Text-18M	自動	LAION+YFCC-400M	62.7	1.9	3,685	模型

注：具有自動繼承的模型配置是自動生成的。

💻 使用示例

基礎用法

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("wkcn/TinyCLIP-ViT-39M-16-Text-19M-YFCC15M")
processor = CLIPProcessor.from_pretrained("wkcn/TinyCLIP-ViT-39M-16-Text-19M-YFCC15M")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

📚 詳細文檔

官方PyTorch實現

https://github.com/microsoft/Cream/tree/main/TinyCLIP

引用

如果這個倉庫對你有幫助，請考慮引用它。非常感謝！

@InProceedings{tinyclip,
    title     = {TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance},
    author    = {Wu, Kan and Peng, Houwen and Zhou, Zhenghong and Xiao, Bin and Liu, Mengchen and Yuan, Lu and Xuan, Hong and Valenzuela, Michael and Chen, Xi (Stephen) and Wang, Xinggang and Chao, Hongyang and Hu, Han},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {21970-21980}
}