CLIP-convnext_large_d_320オープンソースモデル - 無料でゼロショット画像分類とテキスト画像検索を実現

ホーム

CLIP Convnext Large D 320.laion2B S29b B131k Ft Soup

laionによって開発

ConvNeXt-Largeアーキテクチャに基づくCLIPモデルで、LAION-2Bデータセットで訓練され、ゼロショット画像分類と画像テキスト検索タスクをサポート

テキスト生成画像

TensorBoard

オープンソースライセンス:MIT #ゼロショット画像分類 #高解像度視覚モデル #マルチモーダル検索

ダウンロード数 83.56k

リリース時間 : 2/11/2023

モデル概要

これはConvNeXt-Largeアーキテクチャに基づくCLIPモデルで、OpenCLIPフレームワークを使用してLAION-2Bデータセットで訓練されました。モデルはゼロショット画像分類や画像テキスト検索などのタスクをサポートし、高い画像理解能力を持っています。

モデル特徴

高解像度処理能力

320x320解像度入力をサポートし、標準的な256x256モデルに比べてより優れた詳細処理能力を提供

重み平均最適化

複数の微調整重み平均(soup)技術を採用し、モデル性能を向上

効率的なアーキテクチャ設計

ConvNeXt-Large-Dアーキテクチャは320x320解像度で類似モデルよりも効率的

モデル能力

ゼロショット画像分類

画像テキスト検索

クロスモーダル理解

画像特徴抽出

使用事例

画像分類

ゼロショット画像分類

特定の訓練なしで画像を分類可能

ImageNet-1kで76.9%のゼロショットTop-1精度を達成

情報検索

画像テキスト検索

テキストクエリに基づいて関連画像を検索、または画像に基づいて関連テキストを検索

🚀 CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soupのモデルカード

このモデルは、ゼロショット画像分類に特化したCLIPベースのモデルです。LAION-2Bデータセットを用いて学習され、画像とテキストの関連付けやゼロショット画像分類などのタスクに有効です。

🚀 クイックスタート

このモデルは、研究コミュニティ向けの研究成果として提供されています。ゼロショット画像分類や画像・テキスト検索などのタスクに使用できます。

✨ 主な機能

ゼロショット画像分類：事前に学習していないクラスに対しても画像を分類できます。
画像とテキストの検索：画像とテキストの関連性を利用して検索を行います。
下流タスクの微調整：画像分類や画像生成などの下流タスクに微調整できます。

📦 インストール

このモデルは、OpenCLIPを使用して学習されています。使用するには、OpenCLIPをインストールする必要があります。

💻 使用例

基本的な使用法

# コード例は原文を保持
from open_clip import create_model_and_transforms, get_tokenizer
import torch

model, _, preprocess = create_model_and_transforms('convnext_large_d_320.laion2B-s29B-b131K-ft-soup')
tokenizer = get_tokenizer('convnext_large_d_320.laion2B-s29B-b131K-ft-soup')

image = preprocess(Image.open("image.jpg")).unsqueeze(0)
text = tokenizer(["a photo of a cat", "a photo of a dog"])

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print(similarity)

高度な使用法

# 高度な使用法の説明とコード例
# ここでは、下流タスクでの微調整の例を示します。
import torch
from open_clip import create_model_and_transforms, get_tokenizer
from torch.utils.data import DataLoader
from torchvision.datasets import ImageFolder

model, _, preprocess = create_model_and_transforms('convnext_large_d_320.laion2B-s29B-b131K-ft-soup')
tokenizer = get_tokenizer('convnext_large_d_320.laion2B-s29B-b131K-ft-soup')

# データセットの準備
train_dataset = ImageFolder(root='path/to/train', transform=preprocess)
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# オプティマイザと損失関数の定義
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
criterion = torch.nn.CrossEntropyLoss()

# 学習ループ
for epoch in range(10):
    for images, labels in train_dataloader:
        optimizer.zero_grad()
        logits = model(images)
        loss = criterion(logits, labels)
        loss.backward()
        optimizer.step()
    print(f'Epoch {epoch+1} completed')

📚 ドキュメント

モデルの詳細

このモデルは、ConvNeXt-Largeをベースにした一連のCLIPモデルです。画像タワーにはtimmのConvNeXt-Largeモデルを使用し、テキストタワーにはViT-L / RN50x16モデルよりも4層深い構造を持ちます。

使用方法

このモデルは、研究コミュニティ向けの研究成果として提供されています。ゼロショット画像分類や画像・テキスト検索などのタスクに使用できます。ただし、モデルのデプロイや商用利用は現時点では対象外です。

トレーニングの詳細

このモデルは、LAION-2Bデータセットを使用して学習されています。トレーニングには、64台の8-GPU (A100 40GB)ノードを使用し、グローバルバッチサイズを131072としています。

評価

このモデルは、LAION CLIP Benchmark suiteを使用して評価されています。ImageNet-1kでのトップ1ゼロショット精度は75.9から76.9の間です。

🔧 技術詳細

モデルの構造

画像タワー：timmのConvNeXt-Largeモデル (convnext_large) を使用。
テキストタワー：ViT-L / RN50x16モデルよりも4層深い構造 (深さ16、埋め込み次元768)。
ビジョンタワーのヘッド：MLP (fc - gelu - drop - fc) を使用。

トレーニングの設定

データセット：LAION-2B
解像度：320x320
グローバルバッチサイズ：131072
学習率：5e-5
エポック数：12

📄 ライセンス

このモデルは、MITライセンスの下で提供されています。

謝辞

このモデルのトレーニングに使用されたコンピューティングリソースは、stability.aiから提供されています。

引用

BibTeX

LAION-5B

@inproceedings{schuhmann2022laionb,
  title={{LAION}-5B: An open large-scale dataset for training next generation image-text models},
  author={Christoph Schuhmann and
          Romain Beaumont and
          Richard Vencu and
          Cade W Gordon and
          Ross Wightman and
          Mehdi Cherti and
          Theo Coombes and
          Aarush Katta and
          Clayton Mullis and
          Mitchell Wortsman and
          Patrick Schramowski and
          Srivatsa R Kundurthy and
          Katherine Crowson and
          Ludwig Schmidt and
          Robert Kaczmarczyk and
          Jenia Jitsev},
  booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2022},
  url={https://openreview.net/forum?id=M3Y74vmsMcY}
}

OpenCLIPソフトウェア

@software{ilharco_gabriel_2021_5143773,
  author       = {Ilharco, Gabriel and
                  Wortsman, Mitchell and
                  Wightman, Ross and
                  Gordon, Cade and
                  Carlini, Nicholas and
                  Taori, Rohan and
                  Dave, Achal and
                  Shankar, Vaishaal and
                  Namkoong, Hongseok and
                  Miller, John and
                  Hajishirzi, Hannaneh and
                  Farhadi, Ali and
                  Schmidt, Ludwig},
  title        = {OpenCLIP},
  month        = jul,
  year         = 2021,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5143773},
  url          = {https://doi.org/10.5281/zenodo.5143773}
}

@InProceedings{pmlr-v162-wortsman22a,
  title = 	 {Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time},
  author =       {Wortsman, Mitchell and Ilharco, Gabriel and Gadre, Samir Ya and Roelofs, Rebecca and Gontijo-Lopes, Raphael and Morcos, Ari S and Namkoong, Hongseok and Farhadi, Ali and Carmon, Yair and Kornblith, Simon and Schmidt, Ludwig},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {23965--23998},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/wortsman22a/wortsman22a.pdf},
  url = 	 {https://proceedings.mlr.press/v162/wortsman22a.html}
}

OpenAI CLIP論文

@inproceedings{Radford2021LearningTV,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
  booktitle={ICML},
  year={2021}
}

@Article{liu2022convnet,
  author  = {Zhuang Liu and Hanzi Mao and Chao-Yuan Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie},
  title   = {A ConvNet for the 2020s},
  journal = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year    = {2022},
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}