ViT - gopt - 16 - SigLIP2 - 384オープンソースビジュアル言語モデル

ホーム

Vit Gopt 16 SigLIP2 384

timmによって開発

WebLIデータセットで訓練されたSigLIP 2視覚言語モデル、ゼロショット画像分類をサポート

テキスト生成画像

Safetensors

オープンソースライセンス:Apache-2.0 #ゼロショット画像分類 #多言語視覚言語 #Sigmoid損失最適化

ダウンロード数 1,953

リリース時間 : 2/21/2025

モデル概要

これは対比画像-テキストモデルで、ゼロショット画像分類タスク向けに設計されており、画像内容を理解しテキスト記述とマッチングできる

モデル特徴

SigLIP 2アーキテクチャ

改良されたSigmoid損失関数を使用した視覚言語事前訓練で、より優れたセマンティック理解能力を提供

ゼロショット分類

特定タスクの微調整なしで直接画像分類タスクに適用可能

多言語サポート

論文情報に基づく多言語テキスト理解の推論サポート（追加検証必要）

モデル能力

画像-テキスト対比学習

ゼロショット画像分類

画像セマンティック理解

マルチモーダル特徴抽出

使用事例

画像理解

食品認識

画像中の食品タイプを識別（ドーナツ、ベニエなど）

例ではベニエを最も高い確率で正しく識別

動物認識

画像中の動物種を識別（猫、犬など）

コンテンツモデレーション

不適切コンテンツ検出

画像中に含まれる可能性のある不適切コンテンツを自動検出

🚀 ViT - gopt - 16 - SigLIP2 - 384のモデルカード

このモデルは、WebLIデータセットを使用して学習されたSigLIP 2のビジョン - 言語モデルです。元のJAXチェックポイントからOpenCLIPで使用できるように変換されています。

🚀 クイックスタート

このモデルは、Zero - Shot画像分類に使用できます。以下のコード例を参考に、モデルを使用してみましょう。

💻 使用例

基本的な使用法

import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer # works on open-clip-torch >= 2.31.0, timm >= 1.0.15

model, preprocess = create_model_from_pretrained('hf-hub:timm/ViT-gopt-16-SigLIP2-384')
tokenizer = get_tokenizer('hf-hub:timm/ViT-gopt-16-SigLIP2-384')

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)

labels_list = ["a dog", "a cat", "a donut", "a beignet"]
text = tokenizer(labels_list, context_length=model.context_length)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image, normalize=True)
    text_features = model.encode_text(text, normalize=True)
    text_probs = torch.sigmoid(image_features @ text_features.T * model.logit_scale.exp() + model.logit_bias)

zipped_list = list(zip(labels_list, [100 * round(p.item(), 3) for p in text_probs[0]]))
print("Label probabilities: ", zipped_list)

📚 ドキュメント

モデルの詳細

SigLIP 2のビジョン - 言語モデルで、WebLIデータセットを使用して学習されています。このモデルは、Big Visionの元のJAXチェックポイントからOpenCLIPで使用できるように変換されています。

属性	详情
モデルタイプ	対照的画像 - テキスト、Zero - Shot画像分類
オリジナル	https://github.com/google-research/big_vision
データセット	WebLI
論文	- SigLIP 2: Multilingual Vision - Language Encoders with Improved Semantic Understanding, Localization, and Dense Features: https://arxiv.org/abs/2502.14786 - Sigmoid loss for language image pre - training: https://arxiv.org/abs/2303.15343

📄 ライセンス

このモデルは、Apache 2.0ライセンスの下で提供されています。

📖 引用

@article{tschannen2025siglip,
  title={SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features},
  author={Tschannen, Michael and Gritsenko, Alexey and Wang, Xiao and Naeem, Muhammad Ferjad and Alabdulmohsin, Ibrahim and Parthasarathy, Nikhil and Evans, Talfan and Beyer, Lucas and Xia, Ye and Mustafa, Basil and H'enaff, Olivier and Harmsen, Jeremiah and Steiner, Andreas and Zhai, Xiaohua},
  year={2025},
  journal={arXiv preprint arXiv:2502.14786}
}

@article{zhai2023sigmoid,
  title={Sigmoid loss for language image pre-training},
  author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
  journal={arXiv preprint arXiv:2303.15343},
  year={2023}
}

@misc{big_vision,
  author = {Beyer, Lucas and Zhai, Xiaohua and Kolesnikov, Alexander},
  title = {Big Vision},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/google-research/big_vision}}
}