CLIP - convnext_large_d_320.laion2B - s29B - b131K - ft开源模型，支持图像分类和文本检索！

首页

CLIP Convnext Large D 320.laion2B S29b B131k Ft

由 laion 开发

基于ConvNeXt-Large架构的CLIP模型，在LAION-2B数据集上训练，支持零样本图像分类和图像文本检索任务。

文本生成图像

TensorBoard

开源协议:MIT #零样本图像分类 #高分辨率视觉理解 #多模态检索

下载量 3,810

发布时间 : 2/11/2023

模型简介

该模型采用ConvNeXt-Large作为视觉编码器，具有额外的文本深度和视觉MLP头，在320x320分辨率下进行微调，适用于零样本图像分类和跨模态检索任务。

模型特点

高分辨率处理能力

在320x320分辨率下微调，比同类模型更高效，计算资源消耗更低。

增强的视觉MLP头

视觉塔使用MLP（fc-gelu-drop-fc）头而非单一投影，提升特征表达能力。

大规模训练数据

基于LAION-2B数据集（20亿英语样本）训练，覆盖广泛视觉概念。

模型能力

零样本图像分类

图像文本检索

跨模态表示学习

使用案例

图像理解

零样本图像分类

无需微调即可对新类别图像进行分类

在ImageNet-1k上达到76.6%的零样本Top-1准确率

跨模态检索

图文检索系统

构建基于自然语言查询的图像检索系统

🚀 CLIP - convnext_large_d_320.laion2B - s29B - b131K - ft模型卡

CLIP - convnext_large_d_320.laion2B - s29B - b131K - ft是基于Transformer架构的零样本图像分类模型，可用于图像和文本检索、图像生成引导等任务，为研究人员提供了探索零样本图像分类的工具。

🚀 快速开始

本模型主要用于零样本图像分类、图像和文本检索等任务。如果你想使用该模型进行零样本图像分类，可以参考以下步骤：

确保你已经安装了open_clip库。
加载模型并进行推理。

以下是一个简单的示例代码：

import open_clip
import torch
from PIL import Image

# 加载模型和预处理函数
model, _, preprocess = open_clip.create_model_and_transforms('convnext_large_d_320.laion2B-s29B-b131K-ft')
tokenizer = open_clip.get_tokenizer('convnext_large_d_320.laion2B-s29B-b131K-ft')

# 加载图像
image = preprocess(Image.open("your_image.jpg")).unsqueeze(0)
text = tokenizer(["a photo of a cat", "a photo of a dog"])

# 进行推理
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)

✨ 主要特性

多模态处理：能够处理图像和文本两种模态的数据，实现零样本图像分类和图像文本检索等任务。
高效性能：在320x320分辨率下，ConvNext - Large - D模型比OpenAI微调的L/14 - 336模型更高效，具有更少的GMAC、激活值和参数。
广泛适用性：可用于多种下游任务，如图像分类微调、线性探针图像分类、图像生成引导等。

📦 安装指南

要使用本模型，你需要安装open_clip库。可以使用以下命令进行安装：

pip install open_clip_torch

💻 使用示例

基础用法

以下是一个使用本模型进行零样本图像分类的基础示例：

import open_clip
import torch
from PIL import Image

# 加载模型和预处理函数
model, _, preprocess = open_clip.create_model_and_transforms('convnext_large_d_320.laion2B-s29B-b131K-ft')
tokenizer = open_clip.get_tokenizer('convnext_large_d_320.laion2B-s29B-b131K-ft')

# 加载图像
image = preprocess(Image.open("your_image.jpg")).unsqueeze(0)
text = tokenizer(["a photo of a cat", "a photo of a dog"])

# 进行推理
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)

高级用法

如果你想使用本模型进行图像和文本检索，可以参考以下示例：

import open_clip
import torch
from PIL import Image
import numpy as np

# 加载模型和预处理函数
model, _, preprocess = open_clip.create_model_and_transforms('convnext_large_d_320.laion2B-s29B-b131K-ft')
tokenizer = open_clip.get_tokenizer('convnext_large_d_320.laion2B-s29B-b131K-ft')

# 加载图像数据集
image_paths = ["image1.jpg", "image2.jpg", "image3.jpg"]
image_features_list = []
for path in image_paths:
    image = preprocess(Image.open(path)).unsqueeze(0)
    with torch.no_grad():
        image_features = model.encode_image(image)
        image_features /= image_features.norm(dim=-1, keepdim=True)
        image_features_list.append(image_features.cpu().numpy())
image_features_matrix = np.concatenate(image_features_list, axis=0)

# 输入查询文本
query_text = tokenizer(["a photo of a beautiful landscape"])
with torch.no_grad():
    text_features = model.encode_text(query_text)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    text_features = text_features.cpu().numpy()

# 计算相似度
similarities = np.dot(image_features_matrix, text_features.T).flatten()
sorted_indices = np.argsort(similarities)[::-1]

# 输出检索结果
for i in sorted_indices:
    print(f"Image: {image_paths[i]}, Similarity: {similarities[i]}")

📚 详细文档

模型详情

本模型是一系列基于CLIP的ConvNeXt - Large模型，使用OpenCLIP在LAION - 2B（英语）子集上进行训练。具体特性如下：

图像塔：使用[timm](https://github.com/rwightman/pytorch - image - models)的ConvNeXt - Large模型（convnext_large）。
视觉塔头部：采用MLP（fc - gelu - drop - fc）头部，而非其他CLIP模型的单一投影。
文本塔：宽度与ViT - L / RN50x16模型相同，但深度增加4层（深度为16，嵌入维度为768）。

本320x320分辨率的模型是对[CLIP - convnext_large_d.laion2B - s26B - b102K - augreg](https://huggingface.co/laion/CLIP - convnext_large_d.laion2B - s26B - b102K - augreg)的高分辨率微调版本。它在原始256x256训练运行的最终检查点基础上，使用额外的约25亿个样本和较低的学习率进行微调。

在320x320分辨率下，ConvNext - Large - D模型比OpenAI微调的336x336的L/14模型更高效。L/14 - 336模型的GMAC是其2.5倍，激活值是2.8倍，参数是1.22倍。

模型	数据集	分辨率	AugReg	ImageNet零样本Top - 1准确率(%)
[convnext_large_d.laion2b_s26b_b102k - augreg](https://huggingface.co/laion/CLIP - convnext_large_d.laion2B - s26B - b102K - augreg)	LAION - 2B	256x256	RRC (0.33, 1.0), RE (0.35), SD (0.1), D(0.1)	75.9
[convnext_large_d_320.laion2b_s29b_b131k - ft](https://huggingface.co/laion/CLIP - convnext_large_d_320.laion2B - s29B - b131K - ft)	LAION - 2B	320x320	RRC (0.5, 1.0), RE (0.4), SD (0.1), D(0.0)	76.6
[convnext_large_d_320.laion2b_s29b_b131k - ft - soup](https://huggingface.co/laion/CLIP - convnext_large_d_320.laion2B - s29B - b131K - ft - soup)	LAION - 2B	320x320	RRC (0.5, 1.0), RE (0.4), SD (0.1), D(0.0)	76.9

RRC = 随机调整裁剪（裁剪百分比），RE = 随机擦除（概率），SD = 随机深度（概率） - 仅图像塔，D = 丢弃（概率） - 仅图像塔头部。

LAION - A是LAION - 2B的约9亿样本子集，经过pHash去重和美学分数过滤。

模型训练由Ross Wightman在stability.ai集群上完成。

使用场景

直接使用

零样本图像分类
图像和文本检索

下游使用

图像分类和其他图像任务微调
线性探针图像分类
图像生成引导和条件控制

不适用场景

部署使用：目前模型的任何部署用例（无论商业与否）都超出了适用范围。除非对模型进行特定、固定类别的彻底领域测试，否则不建议在受限环境中进行图像搜索等非部署用例。因为我们的安全评估表明，考虑到CLIP在不同类别分类法下性能的可变性，需要进行特定任务测试。未经测试和无约束地部署模型在任何用例中目前都可能有害。
监控和人脸识别：涉及监控和人脸识别领域的用例始终不在适用范围内。因为目前缺乏确保公平使用的测试规范和检查，使用人工智能进行此类任务可能为时过早。
非英语语言：由于模型仅在英语上进行训练和评估，其使用应限于英语用例。

训练详情

训练数据

本模型使用LAION - 2B进行训练，它是LAION - 5B（https://laion.ai/blog/laion - 5b/）的20亿样本英语子集。

重要提示：创建该数据集的目的是推动大规模多模态模型训练和处理从公共互联网爬取的未整理大规模数据集的研究和实验。因此，建议将数据集用于研究目的。请注意，该大规模数据集未经过整理，收集的链接可能会指向令人极度不适和不安的内容。因此，请谨慎使用演示链接并自行承担风险。可以通过基于安全标签过滤样本（使用我们构建的自定义训练的NSFW分类器）提取“安全”子集。虽然这大大降低了查看时遇到潜在有害内容的可能性，但我们不能完全排除安全模式下仍存在有害内容的可能性，因此警告仍然适用。我们认为，向广泛的研究和其他感兴趣的社区公开提供数据集，将有助于透明地研究训练大规模模型带来的好处以及使用封闭大型数据集时可能未报告或未注意到的陷阱和危险。然而，我们不建议使用该数据集创建现成的工业产品，因为关于此类大规模模型的一般属性和安全性的基础研究仍在进行中。

训练过程

所有320x320模型的微调都使用全局批次大小为131072，在10 - 16个检查点间隔（每个间隔2.037亿个样本）内进行，微调期间总共处理约20 - 30亿个样本。

对于320x320模型，在64个8 - GPU（A100 40GB）节点（Stability）上使用以下slurm脚本（带srun）：

/opt/slurm/sbin/srun --cpu_bind=v --accel-bind=gn python -m training.main \
    --save-frequency 1 \
    --name "convnext_large_320" \
    --pretrained ""/runs/convnext_large_256/epoch_128.pt" \
    --resume 'latest' \
    --train-data="pipe:aws s3 cp s3://mybucket/path/{laion{00000..xxxxx}.tar -" \
    --train-num-samples 203666042 \
    --dataset-type webdataset \
    --precision amp_bfloat16 \
    --beta2 0.98 \
    --warmup 2000 \
    --batch-size=256 \
    --epochs=12 \
    --dataset-resampled \
    --aug-cfg use_timm=True scale='(0.5, 1.0)' re_prob=0.4 \
    --clip-grad-norm 5.0 \
    --lr 5e-5 \
    --workers=6 \
    --model "convnext_large_d_320" \
    --seed 0 \
    --ddp-static-graph \
    --local-loss \
    --gather-with-grad \
    --grad-checkpointing

评估

测试数据、因素和指标

测试数据

分类测试使用VTAB +（VTAB（https://arxiv.org/abs/1910.04867）与额外鲁棒性数据集的组合），检索测试使用COCO和Flickr。

结果

模型在ImageNet - 1k上的零样本Top - 1准确率在75.9%至76.9%之间。

原始从头开始的256x256训练的零样本曲线：

已在更广泛的数据集上进行了初步基准测试，可在https://github.com/LAION - AI/CLIP_benchmark/blob/main/benchmark/results.ipynb查看。

🔧 技术细节

本模型基于CLIP架构，结合了ConvNeXt - Large模型的优势。在图像塔中使用了timm库的ConvNeXt - Large模型，并采用MLP头部进行特征提取。文本塔在宽度不变的情况下增加了深度，以提高文本特征的表示能力。在训练过程中，使用了大规模的LAION - 2B数据集，并采用了随机调整裁剪、随机擦除等数据增强技术。在微调过程中，使用了较低的学习率和较大的全局批次大小，以提高模型的性能。

📄 许可证

本模型使用MIT许可证。

致谢

感谢stability.ai提供训练本模型所需的计算资源。

引用

BibTeX：

LAION - 5B

@inproceedings{schuhmann2022laionb,
  title={{LAION}-5B: An open large-scale dataset for training next generation image-text models},
  author={Christoph Schuhmann and
          Romain Beaumont and
          Richard Vencu and
          Cade W Gordon and
          Ross Wightman and
          Mehdi Cherti and
          Theo Coombes and
          Aarush Katta and
          Clayton Mullis and
          Mitchell Wortsman and
          Patrick Schramowski and
          Srivatsa R Kundurthy and
          Katherine Crowson and
          Ludwig Schmidt and
          Robert Kaczmarczyk and
          Jenia Jitsev},
  booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2022},
  url={https://openreview.net/forum?id=M3Y74vmsMcY}
}

OpenCLIP软件

@software{ilharco_gabriel_2021_5143773,
  author       = {Ilharco, Gabriel and
                  Wortsman, Mitchell and
                  Wightman, Ross and
                  Gordon, Cade and
                  Carlini, Nicholas and
                  Taori, Rohan and
                  Dave, Achal and
                  Shankar, Vaishaal and
                  Namkoong, Hongseok and
                  Miller, John and
                  Hajishirzi, Hannaneh and
                  Farhadi, Ali and
                  Schmidt, Ludwig},
  title        = {OpenCLIP},
  month        = jul,
  year         = 2021,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5143773},
  url          = {https://doi.org/10.5281/zenodo.5143773}
}

OpenAI CLIP论文

@inproceedings{Radford2021LearningTV,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
  booktitle={ICML},
  year={2021}
}

@Article{liu2022convnet,
  author  = {Zhuang Liu and Hanzi Mao and Chao-Yuan Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie},
  title   = {A ConvNet for the 2020s},
  journal = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year    = {2022},
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}