SigLIP 2オープンソース視覚言語エンコーダ - 強化された意味理解と位置特定の特徴抽出能力

ホーム

Siglip2 Large Patch16 384

googleによって開発

SigLIP 2 は SigLIP を基に改良された多言語視覚言語エンコーダーで、意味理解、位置特定、高密度特徴抽出能力が向上しています。

テキスト生成画像

Transformers

オープンソースライセンス:Apache-2.0 #ゼロショット画像分類 #画像テキスト検索 #多言語視覚エンコーディング

ダウンロード数 6,525

リリース時間 : 2/17/2025

モデル概要

SigLIP 2 は視覚言語モデルで、ゼロショット画像分類や画像テキスト検索などのタスクに使用可能、または他の視覚タスクの視覚エンコーダーとして利用できます。

モデル特徴

統合トレーニング手法

デコーダ損失、グローバル-ローカル、マスク予測損失など複数の技術を統合し、統一されたトレーニング手法を形成

適応型トレーニング

アスペクト比と解像度の適応型トレーニングをサポート

マルチタスク能力

意味理解、位置特定、高密度特徴抽出能力を同時に備える

モデル能力

ゼロショット画像分類

画像テキスト検索

視覚的特徴抽出

使用事例

画像理解

ゼロショット画像分類

特定のトレーニングなしで新規カテゴリの画像を分類可能

カスタムラベル分類をサポート

視覚エンコーディング

他の視覚タスクの視覚エンコーダーとして利用

高品質な画像特徴表現を提供

クロスモーダル応用

画像テキスト検索

画像とテキスト間のクロスモーダル検索を実現

🚀 SigLIP 2 Large

SigLIP 2 は、SigLIP の事前学習目標を、事前に独立して開発された手法を用いて拡張し、統一されたレシピにまとめることで、セマンティック理解、位置特定、および密な特徴量の向上を実現します。

🚀 クイックスタート

SigLIP 2 Largeモデルは、ゼロショット画像分類や画像 - テキスト検索などのタスクに使用できます。以下に、このモデルを使用した具体的なコード例を示します。

💻 使用例

基本的な使用法

from transformers import pipeline

# load pipeline
ckpt = "google/siglip2-large-patch16-384"
image_classifier = pipeline(model=ckpt, task="zero-shot-image-classification")

# load image and candidate labels
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
candidate_labels = ["2 cats", "a plane", "a remote"]

# run inference
outputs = image_classifier(image, candidate_labels)
print(outputs)

高度な使用法

import torch
from transformers import AutoModel, AutoProcessor
from transformers.image_utils import load_image

# load the model and processor
ckpt = "google/siglip2-large-patch16-384"
model = AutoModel.from_pretrained(ckpt, device_map="auto").eval()
processor = AutoProcessor.from_pretrained(ckpt)

# load the image
image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
inputs = processor(images=[image], return_tensors="pt").to(model.device)

# run infernece
with torch.no_grad():
    image_embeddings = model.get_image_features(**inputs)    

print(image_embeddings.shape)

より多くのコード例については、siglip documentation を参照してください。

📚 ドキュメント

想定される用途

このモデルは、ゼロショット画像分類や画像 - テキスト検索などのタスクに使用できます。また、VLM（およびその他のビジョンタスク）のビジョンエンコーダとしても利用できます。

学習手順

SigLIP 2は、SigLIPにいくつかの賢い学習目標を追加しています。

デコーダ損失
グローバル - ローカルおよびマスクされた予測損失
アスペクト比と解像度の適応性

学習データ

SigLIP 2は、WebLIデータセット (Chen et al., 2023) で事前学習されています。

計算環境

このモデルは、最大2048個のTPU - v5eチップで学習されました。

評価結果

SigLIP 2の評価結果を以下に示します（論文から引用）。 Evaluation Table

BibTeXエントリと引用情報

@misc{tschannen2025siglip2multilingualvisionlanguage,
      title={SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features}, 
      author={Michael Tschannen and Alexey Gritsenko and Xiao Wang and Muhammad Ferjad Naeem and Ibrahim Alabdulmohsin and Nikhil Parthasarathy and Talfan Evans and Lucas Beyer and Ye Xia and Basil Mustafa and Olivier Hénaff and Jeremiah Harmsen and Andreas Steiner and Xiaohua Zhai},
      year={2025},
      eprint={2502.14786},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.14786}, 
}