CLIP-convnext_xxlarge開源模型 - 免費助力零樣本圖像分類任務！

首頁

CLIP Convnext Xxlarge Laion2b S34b B82k Augreg Rewind

由laion開發

基於LAION-2B數據集訓練的CLIP ConvNeXt-XXLarge模型，採用OpenCLIP框架實現，專注於零樣本圖像分類任務。

文本生成圖像開源協議:MIT #零樣本圖像分類 #大規模視覺語言模型 #高精度ConvNeXt架構

下載量 63

發布時間 : 2/26/2023

模型概述

這是一個大型視覺語言模型，結合了ConvNeXt-XXLarge圖像編碼器和文本編碼器，用於零樣本圖像分類和圖文檢索任務。

模型特點

大規模ConvNeXt架構

使用847M參數的ConvNeXt-XXLarge作為圖像編碼器，是發佈的最大ConvNeXt預訓練模型。

高性能零樣本分類

在ImageNet-1k上實現79.3%的top-1零樣本準確率，性能介於ViT-g和ViT-G之間。

高效訓練

採用大規模分佈式訓練，使用高達1024個GPU，全局批大小達到81920-95744。

模型能力

零樣本圖像分類

圖文檢索

圖像特徵提取

文本特徵提取

使用案例

計算機視覺

圖像分類

無需特定訓練即可對圖像進行分類

在ImageNet-1k上達到79.3%準確率

圖文檢索

根據文本描述搜索相關圖像或根據圖像生成描述

研究

多模態學習研究

用於研究視覺語言模型的表徵學習

🚀 CLIP - ConvNeXt_XXLarge - LAION2B - S34B - B82K - AugReg - Rewind模型卡片

本模型是一系列基於CLIP的ConvNeXt - XXLarge（一種自定義的timm ConvNeXt尺寸）模型，旨在助力研究人員更好地理解和探索零樣本、任意圖像分類任務，同時也可用於相關的跨學科研究。

🚀 快速開始

本模型主要用於研究，如果你想使用該模型開展零樣本圖像分類、圖像和文本檢索等任務，可參考以下內容進一步瞭解模型詳情。

✨ 主要特性

大規模訓練：在LAION - 2B（英文）數據集上進行訓練，該數據集是LAION - 5B的子集。
高性能表現：在ImageNet上實現了79.1% - 79.4%的top - 1零樣本準確率。
創新突破：是最大的已發佈預訓練ConvNeXt模型，非ViT圖像塔CLIP模型在ImageNet top - 1零樣本準確率上首次超過79%。

📦 安裝指南

文檔未提及具體安裝步驟，暫不提供。

💻 使用示例

文檔未提供代碼示例，暫不提供。

📚 詳細文檔

🔍 模型詳情

模型描述

一系列基於CLIP的ConvNeXt - XXLarge（自定義timm ConvNeXt尺寸）模型，使用OpenCLIP在LAION - 2B（英文）數據集上進行訓練。

模型	數據集	分辨率	AugReg	ImageNet Top - 1零樣本準確率(%)
[convnext_xxlarge.laion2b_s34b_b82k - augreg](https://huggingface.co/laion/CLIP - convnext_xxlarge - laion2B - s34B - b82K - augreg)	LAION - 2B	256x256	RRC (0.33, 1.0), RE (0.35), SD (0.1)	79.1
[convnext_xxlarge.laion2b_s34b_b82k - augreg - rewind](https://huggingface.co/laion/CLIP - convnext_xxlarge - laion2B - s34B - b82K - augreg - rewind)	LAION - 2B	256x256	RRC (0.3, 1.0), RE (0.4), SD (0.1)	79.3
[convnext_xxlarge.laion2b_s34b_b82k - augreg - soup](https://huggingface.co/laion/CLIP - convnext_xxlarge - laion2B - s34B - b82K - augreg - soup)	LAION - 2B	256x256	N/A	79.4

RRC = 隨機調整裁剪（裁剪百分比），RE = 隨機擦除（概率），SD = 隨機深度（概率） - 僅適用於圖像塔

模型核心訓練分階段進行，歷時約2個月，核心訓練的全局批量大小為81920，最後約10%的訓練在全局批量大小為95744、更高學習率和增強策略下重新進行，最終將兩者平均。

模型目標是將最大卷積CLIP圖像塔的性能提升到ViT - g到ViT - G的水平，並改善下游使用的圖像尺寸縮放。

模型特點包括：

使用[timm](https://github.com/rwightman/pytorch - image - models)的ConvNeXt - XXLarge模型作為圖像塔。
圖像塔末端有標準投影。
文本塔大小與ViT - H - 14和ViT - g - 14模型相同（維度1024，頭數16，深度24）。

模型在256x256圖像分辨率下訓練，圖像 + 文本CLIP模型總參數為12億，FLOPS為222 GMAC，激活次數為146 MActs。

模型	圖像尺寸	嵌入維度	GMACs	MActs	M參數	圖像GMACs	圖像MActs	圖像M參數	文本GMACs	文本MActs	文本M參數
ViT - H - 16	224	1024	150.96	122.01	986.26	127.4	100.81	632.23	23.57	21.2	354.03
ViT - H - 14	224	1024	190.97	160.61	986.11	167.4	139.41	632.08	23.57	21.2	354.03
ViT - L - 14 - 336	336	768	197.76	278.19	427.94	191.1	270.24	304.29	6.66	7.95	123.65
convnext_xxlarge	256	1024	221.66	145.66	1200.58	198.09	124.45	846.54	23.57	21.2	354.03
RN50x64	448	1024	276.8	249.73	623.26	265.02	239.13	420.38	11.78	10.6	202.88
ViT - g - 14	224	1024	290.74	213.84	1366.68	267.18	192.64	1012.65	23.57	21.2	354.03
convnext_xxlarge_320	320	1024	333.08	215.66	1200.58	309.52	194.46	846.54	23.57	21.2	354.03
ViT - H - 14 - 336	336	1024	414.53	428.74	986.52	390.97	407.54	632.49	23.57	21.2	354.03
ViT - bigG - 14	224	1280	532.92	310.71	2539.57	483.96	275.37	1844.91	48.96	35.34	694.66

模型由Ross Wightman在stability.ai集群和[JUWELS Booster](https://apps.fz - juelich.de/jsc/hps/juwels/booster - overview.html)超級計算機上訓練。

🔍 用途

直接用途

零樣本圖像分類、圖像和文本檢索等。

下游用途

圖像分類和其他圖像任務微調、線性探針圖像分類、圖像生成引導和條件控制等。

不適用場景

目前，模型的任何部署用例（無論是否商業用途）都超出適用範圍。不建議在受限環境中進行圖像搜索等非部署用例，除非對模型進行特定、固定類別分類法的全面領域內測試。因為安全評估表明，CLIP在不同類別分類法下性能差異大，未經測試和無約束的模型部署可能有害。

監控和人臉識別等用例始終不在適用範圍內，因為目前缺乏確保公平使用的測試規範和檢查，使用人工智能進行此類任務尚不成熟。

由於模型僅在英文數據上訓練和評估，其使用應限於英文用例。

🔍 訓練詳情

訓練數據

模型使用LAION - 2B數據集進行訓練，該數據集是LAION - 5B（https://laion.ai/blog/laion - 5b/）的20億樣本英文子集。

⚠️ 重要提示

數據集創建的目的是推動大規模多模態模型訓練和未整理大規模數據集處理的研究和實驗。建議僅將數據集用於研究目的。該大規模數據集未經過整理，收集的鏈接可能包含令人不適和不安的內容。可通過基於安全標籤過濾樣本（使用自定義訓練的NSFW分類器）提取“安全”子集，但不能完全排除有害內容的存在。不建議使用該數據集創建工業產品，因為大規模模型的基本屬性和安全性研究仍在進行中。

訓練過程

主要訓練的全局批量大小為81920，共256個檢查點間隔，每個間隔1.356億樣本，訓練總樣本約340億。

訓練過程中遇到模型數值穩定性、集群穩定性和性能等問題。最初使用float16 AMP和默認adam beta2導致損失峰值和NaN爆炸，將beta2降至0.97有所改善，但損失/零樣本曲線未達預期。切換到PyTorch夜間版本後，可使用bfloat16 + AMP訓練，beta2恢復到0.98，指標得到改善。

檢查點間隔	集群	GPU數量	節點數量	GPU型號	本地批量大小	每秒樣本數	每個GPU每秒樣本數	精度	adam beta2
1 - 2	Stability	1024	128	A100 40GB	80	37 - 40k	36 - 39	amp + fp16	0.97
3 - 32	Stability	512	64	A100 80GB	160	27 - 32k	52 - 62	amp + fp16	0.97
33 - 75	Booster	1024	256	A100 40GB	80	48k	47	amp + fp16	0.97
76 - 165	Booster	1024	256	A100 40GB	80	51k	50	amp + bf16	0.98
166 - 232	Stability	320	40	A100 80GB	256	18 - 19k	56 - 59	amp + bf16	0.98
233 - 249	Booster	1024	256	A100 40GB	80	51k	50	amp + bf16	0.98
250 - 256	Stability	1024	128	A100 40GB	80	27 - 31k	26 - 30	amp + bf16	0.98

JUWELS Booster每個節點有4個A100 GPU，每個節點4個HDR - 200 IB適配器（每個GPU 200Gbit/秒）；Stability設置每個節點8個A100 GPU，每個節點400Gbit/秒EFA網絡（每個GPU 50 GBit/秒）。不同配置下訓練效率（每個GPU吞吐量）差異顯著，兩個集群的1024 GPU配置尤其容易崩潰。

以下是128個8 - GPU（40GB A100）配置的slurm srun命令行：

srun --cpu_bind=v --accel-bind=gn python -m training.main \
    --save-frequency 1 \
    --name "xxlarge-2b-81920-bf16" \
    --resume "latest" \
    --logs "/runs" \
    --log-every-n-steps 50 \
    --train-data="pipe:aws s3 cp s3://laion5b/laion2B-data/{000000..231349}.tar -" \
    --train-num-samples 135646078 \
    --dataset-type webdataset \
    --warmup 10000 \
    --batch-size=80 \
    --epochs=256 \
    --dataset-resampled \
    --aug-cfg use_timm=True scale='(0.33, 1.0)' re_prob=0.35 \
    --precision amp_bfloat16 \
    --grad-clip-norm 5.0 \
    --lr 1e-3 \
    --workers=6 \
    --beta2 0.98 \
    --model "convnext_xxlarge" \
    --seed 0 \
    --ddp-static-graph \
    --local-loss \
    --gather-with-grad \
    --grad-checkpointing \
    --report-to "tensorboard"

最後10%的訓練使用更高的全局批量大小95744、更高學習率和稍強的增強策略。

檢查點間隔	集群	GPU數量	節點數量	GPU型號	本地批量大小	每秒樣本數	每個GPU每秒樣本數	精度	adam beta2
231 - 256	stability	1088	136	A100 40GB	88	32 - 35k	29 - 32	amp + bf16	0.98

136個8 - GPU（40GB A100）節點的slurm srun命令行：

srun --cpu_bind=v --accel-bind=gn python -m training.main \
    --save-frequency 1 \
    --name "xxlarge-2b-81920-r-bf16" \
    --resume "latest" \
    --logs "/runs" \
    --log-every-n-steps 50 \
    --train-data="pipe:aws s3 cp s3://laion5b/laion2B-data/{000000..231349}.tar -" \
    --train-num-samples 135646078 \
    --dataset-type webdataset \
    --warmup 10000 \
    --batch-size=88 \
    --epochs=256 \
    --dataset-resampled \
    --aug-cfg use_timm=True scale='(0.3, 1.0)' re_prob=0.4 \
    --precision amp_bfloat16 \
    --grad-clip-norm 5.0 \
    --lr 2e-3 \
    --workers=6 \
    --beta2 0.98 \
    --model "convnext_xxlarge" \
    --seed 0 \
    --ddp-static-graph \
    --local-loss \
    --gather-with-grad \
    --grad-checkpointing \
    --report-to "tensorboard"

🔍 評估

使用[LAION CLIP Benchmark suite](https://github.com/LAION - AI/CLIP_benchmark)中的代碼進行評估。

測試數據、因素和指標

測試數據

分類任務使用VTAB +（VTAB（https://arxiv.org/abs/1910.04867）與額外魯棒性數據集的組合），檢索任務使用COCO和Flickr數據集。

評估結果

模型在ImageNet - 1k上的top - 1零樣本準確率為79.1% - 79.4%。

ConvNeXt XXLarge零樣本準確率

最後10%訓練的放大圖：

ConvNeXt XXLarge零樣本準確率放大圖

已在更廣泛數據集上進行了初步基準測試，結果可在https://github.com/LAION - AI/CLIP_benchmark/blob/main/benchmark/results.ipynb查看。

🔍 致謝

感謝stability.ai和高斯超級計算中心（http://gauss - centre.eu），通過於利希超級計算中心（JSC）的約翰·馮·諾伊曼計算研究所（NIC）在GCS超級計算機JUWELS Booster上提供計算時間，資助了這項工作。

🔍 引用

BibTeX格式引用：

LAION - 5B

@inproceedings{schuhmann2022laionb,
  title={{LAION}-5B: An open large-scale dataset for training next generation image-text models},
  author={Christoph Schuhmann and
          Romain Beaumont and
          Richard Vencu and
          Cade W Gordon and
          Ross Wightman and
          Mehdi Cherti and
          Theo Coombes and
          Aarush Katta and
          Clayton Mullis and
          Mitchell Wortsman and
          Patrick Schramowski and
          Srivatsa R Kundurthy and
          Katherine Crowson and
          Ludwig Schmidt and
          Robert Kaczmarczyk and
          Jenia Jitsev},
  booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2022},
  url={https://openreview.net/forum?id=M3Y74vmsMcY}
}

OpenCLIP軟件

@software{ilharco_gabriel_2021_5143773,
  author       = {Ilharco, Gabriel and
                  Wortsman, Mitchell and
                  Wightman, Ross and
                  Gordon, Cade and
                  Carlini, Nicholas and
                  Taori, Rohan and
                  Dave, Achal and
                  Shankar, Vaishaal and
                  Namkoong, Hongseok and
                  Miller, John and
                  Hajishirzi, Hannaneh and
                  Farhadi, Ali and
                  Schmidt, Ludwig},
  title        = {OpenCLIP},
  month        = jul,
  year         = 2021,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5143773},
  url          = {https://doi.org/10.5281/zenodo.5143773}
}

OpenAI CLIP論文

@inproceedings{Radford2021LearningTV,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
  booktitle={ICML},
  year={2021}
}

@Article{liu2022convnet,
  author  = {Zhuang Liu and Hanzi Mao and Chao-Yuan Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie},
  title   = {A ConvNet for the 2020s},
  journal = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year    = {2022},
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}

@InProceedings{pmlr-v162-wortsman22a,
  title = 	 {Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time},
  author =       {Wortsman, Mitchell and Ilharco, Gabriel and Gadre, Samir Ya and Roelofs, Rebecca and Gontijo-Lopes, Raphael and Morcos, Ari S and Namkoong, Hongseok and Farhadi, Ali and Carmon, Yair and Kornblith, Simon and Schmidt, Ludwig},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {23965--23998},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/wortsman22a/wortsman22a.pdf},
  url = 	 {https://proceedings.mlr.press/v162/wortsman22a.html}
}