CLIP-convnext_xxlarge开源模型 - 免费助力零样本图像分类任务！

首页

CLIP Convnext Xxlarge Laion2b S34b B82k Augreg Rewind

由 laion 开发

基于LAION-2B数据集训练的CLIP ConvNeXt-XXLarge模型，采用OpenCLIP框架实现，专注于零样本图像分类任务。

文本生成图像开源协议:MIT #零样本图像分类 #大规模视觉语言模型 #高精度ConvNeXt架构

下载量 63

发布时间 : 2/26/2023

模型简介

这是一个大型视觉语言模型，结合了ConvNeXt-XXLarge图像编码器和文本编码器，用于零样本图像分类和图文检索任务。

模型特点

大规模ConvNeXt架构

使用847M参数的ConvNeXt-XXLarge作为图像编码器，是发布的最大ConvNeXt预训练模型。

高性能零样本分类

在ImageNet-1k上实现79.3%的top-1零样本准确率，性能介于ViT-g和ViT-G之间。

高效训练

采用大规模分布式训练，使用高达1024个GPU，全局批大小达到81920-95744。

模型能力

零样本图像分类

图文检索

图像特征提取

文本特征提取

使用案例

计算机视觉

图像分类

无需特定训练即可对图像进行分类

在ImageNet-1k上达到79.3%准确率

图文检索

根据文本描述搜索相关图像或根据图像生成描述

研究

多模态学习研究

用于研究视觉语言模型的表征学习

🚀 CLIP - ConvNeXt_XXLarge - LAION2B - S34B - B82K - AugReg - Rewind模型卡片

本模型是一系列基于CLIP的ConvNeXt - XXLarge（一种自定义的timm ConvNeXt尺寸）模型，旨在助力研究人员更好地理解和探索零样本、任意图像分类任务，同时也可用于相关的跨学科研究。

🚀 快速开始

本模型主要用于研究，如果你想使用该模型开展零样本图像分类、图像和文本检索等任务，可参考以下内容进一步了解模型详情。

✨ 主要特性

大规模训练：在LAION - 2B（英文）数据集上进行训练，该数据集是LAION - 5B的子集。
高性能表现：在ImageNet上实现了79.1% - 79.4%的top - 1零样本准确率。
创新突破：是最大的已发布预训练ConvNeXt模型，非ViT图像塔CLIP模型在ImageNet top - 1零样本准确率上首次超过79%。

📦 安装指南

文档未提及具体安装步骤，暂不提供。

💻 使用示例

文档未提供代码示例，暂不提供。

📚 详细文档

🔍 模型详情

模型描述

一系列基于CLIP的ConvNeXt - XXLarge（自定义timm ConvNeXt尺寸）模型，使用OpenCLIP在LAION - 2B（英文）数据集上进行训练。

模型	数据集	分辨率	AugReg	ImageNet Top - 1零样本准确率(%)
[convnext_xxlarge.laion2b_s34b_b82k - augreg](https://huggingface.co/laion/CLIP - convnext_xxlarge - laion2B - s34B - b82K - augreg)	LAION - 2B	256x256	RRC (0.33, 1.0), RE (0.35), SD (0.1)	79.1
[convnext_xxlarge.laion2b_s34b_b82k - augreg - rewind](https://huggingface.co/laion/CLIP - convnext_xxlarge - laion2B - s34B - b82K - augreg - rewind)	LAION - 2B	256x256	RRC (0.3, 1.0), RE (0.4), SD (0.1)	79.3
[convnext_xxlarge.laion2b_s34b_b82k - augreg - soup](https://huggingface.co/laion/CLIP - convnext_xxlarge - laion2B - s34B - b82K - augreg - soup)	LAION - 2B	256x256	N/A	79.4

RRC = 随机调整裁剪（裁剪百分比），RE = 随机擦除（概率），SD = 随机深度（概率） - 仅适用于图像塔

模型核心训练分阶段进行，历时约2个月，核心训练的全局批量大小为81920，最后约10%的训练在全局批量大小为95744、更高学习率和增强策略下重新进行，最终将两者平均。

模型目标是将最大卷积CLIP图像塔的性能提升到ViT - g到ViT - G的水平，并改善下游使用的图像尺寸缩放。

模型特点包括：

使用[timm](https://github.com/rwightman/pytorch - image - models)的ConvNeXt - XXLarge模型作为图像塔。
图像塔末端有标准投影。
文本塔大小与ViT - H - 14和ViT - g - 14模型相同（维度1024，头数16，深度24）。

模型在256x256图像分辨率下训练，图像 + 文本CLIP模型总参数为12亿，FLOPS为222 GMAC，激活次数为146 MActs。

模型	图像尺寸	嵌入维度	GMACs	MActs	M参数	图像GMACs	图像MActs	图像M参数	文本GMACs	文本MActs	文本M参数
ViT - H - 16	224	1024	150.96	122.01	986.26	127.4	100.81	632.23	23.57	21.2	354.03
ViT - H - 14	224	1024	190.97	160.61	986.11	167.4	139.41	632.08	23.57	21.2	354.03
ViT - L - 14 - 336	336	768	197.76	278.19	427.94	191.1	270.24	304.29	6.66	7.95	123.65
convnext_xxlarge	256	1024	221.66	145.66	1200.58	198.09	124.45	846.54	23.57	21.2	354.03
RN50x64	448	1024	276.8	249.73	623.26	265.02	239.13	420.38	11.78	10.6	202.88
ViT - g - 14	224	1024	290.74	213.84	1366.68	267.18	192.64	1012.65	23.57	21.2	354.03
convnext_xxlarge_320	320	1024	333.08	215.66	1200.58	309.52	194.46	846.54	23.57	21.2	354.03
ViT - H - 14 - 336	336	1024	414.53	428.74	986.52	390.97	407.54	632.49	23.57	21.2	354.03
ViT - bigG - 14	224	1280	532.92	310.71	2539.57	483.96	275.37	1844.91	48.96	35.34	694.66

模型由Ross Wightman在stability.ai集群和[JUWELS Booster](https://apps.fz - juelich.de/jsc/hps/juwels/booster - overview.html)超级计算机上训练。

🔍 用途

直接用途

零样本图像分类、图像和文本检索等。

下游用途

图像分类和其他图像任务微调、线性探针图像分类、图像生成引导和条件控制等。

不适用场景

目前，模型的任何部署用例（无论是否商业用途）都超出适用范围。不建议在受限环境中进行图像搜索等非部署用例，除非对模型进行特定、固定类别分类法的全面领域内测试。因为安全评估表明，CLIP在不同类别分类法下性能差异大，未经测试和无约束的模型部署可能有害。

监控和人脸识别等用例始终不在适用范围内，因为目前缺乏确保公平使用的测试规范和检查，使用人工智能进行此类任务尚不成熟。

由于模型仅在英文数据上训练和评估，其使用应限于英文用例。

🔍 训练详情

训练数据

模型使用LAION - 2B数据集进行训练，该数据集是LAION - 5B（https://laion.ai/blog/laion - 5b/）的20亿样本英文子集。

⚠️ 重要提示

数据集创建的目的是推动大规模多模态模型训练和未整理大规模数据集处理的研究和实验。建议仅将数据集用于研究目的。该大规模数据集未经过整理，收集的链接可能包含令人不适和不安的内容。可通过基于安全标签过滤样本（使用自定义训练的NSFW分类器）提取“安全”子集，但不能完全排除有害内容的存在。不建议使用该数据集创建工业产品，因为大规模模型的基本属性和安全性研究仍在进行中。

训练过程

主要训练的全局批量大小为81920，共256个检查点间隔，每个间隔1.356亿样本，训练总样本约340亿。

训练过程中遇到模型数值稳定性、集群稳定性和性能等问题。最初使用float16 AMP和默认adam beta2导致损失峰值和NaN爆炸，将beta2降至0.97有所改善，但损失/零样本曲线未达预期。切换到PyTorch夜间版本后，可使用bfloat16 + AMP训练，beta2恢复到0.98，指标得到改善。

检查点间隔	集群	GPU数量	节点数量	GPU型号	本地批量大小	每秒样本数	每个GPU每秒样本数	精度	adam beta2
1 - 2	Stability	1024	128	A100 40GB	80	37 - 40k	36 - 39	amp + fp16	0.97
3 - 32	Stability	512	64	A100 80GB	160	27 - 32k	52 - 62	amp + fp16	0.97
33 - 75	Booster	1024	256	A100 40GB	80	48k	47	amp + fp16	0.97
76 - 165	Booster	1024	256	A100 40GB	80	51k	50	amp + bf16	0.98
166 - 232	Stability	320	40	A100 80GB	256	18 - 19k	56 - 59	amp + bf16	0.98
233 - 249	Booster	1024	256	A100 40GB	80	51k	50	amp + bf16	0.98
250 - 256	Stability	1024	128	A100 40GB	80	27 - 31k	26 - 30	amp + bf16	0.98

JUWELS Booster每个节点有4个A100 GPU，每个节点4个HDR - 200 IB适配器（每个GPU 200Gbit/秒）；Stability设置每个节点8个A100 GPU，每个节点400Gbit/秒EFA网络（每个GPU 50 GBit/秒）。不同配置下训练效率（每个GPU吞吐量）差异显著，两个集群的1024 GPU配置尤其容易崩溃。

以下是128个8 - GPU（40GB A100）配置的slurm srun命令行：

srun --cpu_bind=v --accel-bind=gn python -m training.main \
    --save-frequency 1 \
    --name "xxlarge-2b-81920-bf16" \
    --resume "latest" \
    --logs "/runs" \
    --log-every-n-steps 50 \
    --train-data="pipe:aws s3 cp s3://laion5b/laion2B-data/{000000..231349}.tar -" \
    --train-num-samples 135646078 \
    --dataset-type webdataset \
    --warmup 10000 \
    --batch-size=80 \
    --epochs=256 \
    --dataset-resampled \
    --aug-cfg use_timm=True scale='(0.33, 1.0)' re_prob=0.35 \
    --precision amp_bfloat16 \
    --grad-clip-norm 5.0 \
    --lr 1e-3 \
    --workers=6 \
    --beta2 0.98 \
    --model "convnext_xxlarge" \
    --seed 0 \
    --ddp-static-graph \
    --local-loss \
    --gather-with-grad \
    --grad-checkpointing \
    --report-to "tensorboard"

最后10%的训练使用更高的全局批量大小95744、更高学习率和稍强的增强策略。

检查点间隔	集群	GPU数量	节点数量	GPU型号	本地批量大小	每秒样本数	每个GPU每秒样本数	精度	adam beta2
231 - 256	stability	1088	136	A100 40GB	88	32 - 35k	29 - 32	amp + bf16	0.98

136个8 - GPU（40GB A100）节点的slurm srun命令行：

srun --cpu_bind=v --accel-bind=gn python -m training.main \
    --save-frequency 1 \
    --name "xxlarge-2b-81920-r-bf16" \
    --resume "latest" \
    --logs "/runs" \
    --log-every-n-steps 50 \
    --train-data="pipe:aws s3 cp s3://laion5b/laion2B-data/{000000..231349}.tar -" \
    --train-num-samples 135646078 \
    --dataset-type webdataset \
    --warmup 10000 \
    --batch-size=88 \
    --epochs=256 \
    --dataset-resampled \
    --aug-cfg use_timm=True scale='(0.3, 1.0)' re_prob=0.4 \
    --precision amp_bfloat16 \
    --grad-clip-norm 5.0 \
    --lr 2e-3 \
    --workers=6 \
    --beta2 0.98 \
    --model "convnext_xxlarge" \
    --seed 0 \
    --ddp-static-graph \
    --local-loss \
    --gather-with-grad \
    --grad-checkpointing \
    --report-to "tensorboard"

🔍 评估

使用[LAION CLIP Benchmark suite](https://github.com/LAION - AI/CLIP_benchmark)中的代码进行评估。

测试数据、因素和指标

测试数据

分类任务使用VTAB +（VTAB（https://arxiv.org/abs/1910.04867）与额外鲁棒性数据集的组合），检索任务使用COCO和Flickr数据集。

评估结果

模型在ImageNet - 1k上的top - 1零样本准确率为79.1% - 79.4%。

ConvNeXt XXLarge零样本准确率

最后10%训练的放大图：

ConvNeXt XXLarge零样本准确率放大图

已在更广泛数据集上进行了初步基准测试，结果可在https://github.com/LAION - AI/CLIP_benchmark/blob/main/benchmark/results.ipynb查看。

🔍 致谢

感谢stability.ai和高斯超级计算中心（http://gauss - centre.eu），通过于利希超级计算中心（JSC）的约翰·冯·诺伊曼计算研究所（NIC）在GCS超级计算机JUWELS Booster上提供计算时间，资助了这项工作。

🔍 引用

BibTeX格式引用：

LAION - 5B

@inproceedings{schuhmann2022laionb,
  title={{LAION}-5B: An open large-scale dataset for training next generation image-text models},
  author={Christoph Schuhmann and
          Romain Beaumont and
          Richard Vencu and
          Cade W Gordon and
          Ross Wightman and
          Mehdi Cherti and
          Theo Coombes and
          Aarush Katta and
          Clayton Mullis and
          Mitchell Wortsman and
          Patrick Schramowski and
          Srivatsa R Kundurthy and
          Katherine Crowson and
          Ludwig Schmidt and
          Robert Kaczmarczyk and
          Jenia Jitsev},
  booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2022},
  url={https://openreview.net/forum?id=M3Y74vmsMcY}
}

OpenCLIP软件

@software{ilharco_gabriel_2021_5143773,
  author       = {Ilharco, Gabriel and
                  Wortsman, Mitchell and
                  Wightman, Ross and
                  Gordon, Cade and
                  Carlini, Nicholas and
                  Taori, Rohan and
                  Dave, Achal and
                  Shankar, Vaishaal and
                  Namkoong, Hongseok and
                  Miller, John and
                  Hajishirzi, Hannaneh and
                  Farhadi, Ali and
                  Schmidt, Ludwig},
  title        = {OpenCLIP},
  month        = jul,
  year         = 2021,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5143773},
  url          = {https://doi.org/10.5281/zenodo.5143773}
}

OpenAI CLIP论文

@inproceedings{Radford2021LearningTV,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
  booktitle={ICML},
  year={2021}
}

@Article{liu2022convnet,
  author  = {Zhuang Liu and Hanzi Mao and Chao-Yuan Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie},
  title   = {A ConvNet for the 2020s},
  journal = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year    = {2022},
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}

@InProceedings{pmlr-v162-wortsman22a,
  title = 	 {Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time},
  author =       {Wortsman, Mitchell and Ilharco, Gabriel and Gadre, Samir Ya and Roelofs, Rebecca and Gontijo-Lopes, Raphael and Morcos, Ari S and Namkoong, Hongseok and Farhadi, Ali and Carmon, Yair and Kornblith, Simon and Schmidt, Ludwig},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {23965--23998},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/wortsman22a/wortsman22a.pdf},
  url = 	 {https://proceedings.mlr.press/v162/wortsman22a.html}
}