TinyCLIP开源跨模态模型 - 实现语言图像匹配，兼顾速度与精度

首页

Tinyclip ViT 39M 16 Text 19M YFCC15M

由 wkcn 开发

TinyCLIP是一种针对大规模语言-图像预训练模型的创新跨模态蒸馏方法，通过亲和力模仿与权重继承技术，实现了速度与精度的最佳平衡。

文本生成图像

Transformers

开源协议:MIT #零样本分类 #跨模态蒸馏 #轻量级CLIP

下载量 654

发布时间 : 12/19/2023

模型简介

TinyCLIP是一种跨模态蒸馏方法，通过亲和力模仿与权重继承技术，释放了小型CLIP模型的潜力，结合大规模模型与预训练数据的优势，适用于零样本图像分类任务。

模型特点

亲和力模仿

通过模仿大规模CLIP模型的跨模态亲和力关系，提升小模型的性能。

权重继承

自动或手动从大规模模型中继承权重，加速训练并提升模型效果。

高效推理

在参数量减少50%的同时获得2倍推理加速，保持高性能。

模型能力

零样本图像分类

跨模态检索

图像-文本匹配

使用案例

图像分类

动物识别

识别图像中的动物类别

在ImageNet上达到56.4%-63.5%准确率

内容检索

图文匹配

根据文本描述检索相关图像

🚀 TinyCLIP：通过亲和度模仿和权重继承实现CLIP蒸馏

TinyCLIP是一种用于大规模语言 - 图像预训练模型的新型跨模态蒸馏方法。该方法引入了两项核心技术：亲和度模仿和权重继承。这项工作释放了小型CLIP模型的潜力，充分利用了大规模模型以及预训练数据，在速度和准确性之间取得了最佳平衡。

🚀 快速开始

与Transformers库配合使用

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("wkcn/TinyCLIP-ViT-39M-16-Text-19M-YFCC15M")
processor = CLIPProcessor.from_pretrained("wkcn/TinyCLIP-ViT-39M-16-Text-19M-YFCC15M")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

✨ 主要特性

特性亮点

TinyCLIP ViT - 45M/32仅使用ViT - B/32一半的参数，就能实现相当的零样本性能。
TinyCLIP ResNet - 19M在参数减少50%的同时，推理速度提升了2倍，并在ImageNet上获得了56.4%的准确率。

模型库

属性	详情
模型类型	TinyCLIP系列模型，包括不同参数规模和架构（如ViT、ResNet）的变体
训练数据	YFCC - 15M、LAION - 400M、LAION+YFCC - 400M

模型	权重继承方式	预训练数据	ImageNet-1K 准确率@1(%)	MACs(G)	吞吐量(对/秒)	链接
TinyCLIP ViT-39M/16 Text-19M	手动	YFCC-15M	63.5	9.5	1,469	模型
TinyCLIP ViT-8M/16 Text-3M	手动	YFCC-15M	41.1	2.0	4,150	模型
TinyCLIP ResNet-30M Text-29M	手动	LAION-400M	59.1	6.9	1,811	模型
TinyCLIP ResNet-19M Text-19M	手动	LAION-400M	56.4	4.4	3,024	模型
TinyCLIP ViT-61M/32 Text-29M	手动	LAION-400M	62.4	5.3	3,191	模型
TinyCLIP ViT-40M/32 Text-19M	手动	LAION-400M	59.8	3.5	4,641	模型
TinyCLIP ViT-63M/32 Text-31M	自动	LAION-400M	63.9	5.6	2,905	模型
TinyCLIP ViT-45M/32 Text-18M	自动	LAION-400M	61.4	3.7	3,682	模型
TinyCLIP ViT-22M/32 Text-10M	自动	LAION-400M	53.7	1.9	5,504	模型
TinyCLIP ViT-63M/32 Text-31M	自动	LAION+YFCC-400M	64.5	5.6	2,909	模型
TinyCLIP ViT-45M/32 Text-18M	自动	LAION+YFCC-400M	62.7	1.9	3,685	模型

注：具有自动继承的模型配置是自动生成的。

💻 使用示例

基础用法

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("wkcn/TinyCLIP-ViT-39M-16-Text-19M-YFCC15M")
processor = CLIPProcessor.from_pretrained("wkcn/TinyCLIP-ViT-39M-16-Text-19M-YFCC15M")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

📚 详细文档

官方PyTorch实现

https://github.com/microsoft/Cream/tree/main/TinyCLIP

引用

如果这个仓库对你有帮助，请考虑引用它。非常感谢！

@InProceedings{tinyclip,
    title     = {TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance},
    author    = {Wu, Kan and Peng, Houwen and Zhou, Zhenghong and Xiao, Bin and Liu, Mengchen and Yuan, Lu and Xuan, Hong and Valenzuela, Michael and Chen, Xi (Stephen) and Wang, Xinggang and Chao, Hongyang and Hu, Han},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {21970-21980}
}