SigLIP 2オープンソースビジュアル言語モデル - 無料でデプロイしてより強力な意味理解と特徴抽出を実現

Home

Siglip2 So400m Patch16 512

Developed by google

SigLIP 2はSigLIPを基盤とした視覚言語モデルで、意味理解、位置特定、高密度特徴抽出能力が強化されています。

テキスト生成画像

Transformers

Open Source License:Apache-2.0 #ゼロショット画像分類 #マルチモーダル意味理解 #高密度特徴抽出

Downloads 46.46k

Release Time : 2/17/2025

Model Overview

このモデルはゼロショット画像分類や画像テキスト検索などのタスクに使用可能で、視覚言語モデルの視覚エンコーダーとしても利用できます。

Model Features

強化された意味理解

複数の技術を統合し意味理解能力を向上

位置特定能力

画像内オブジェクトの位置特定能力を改善

高密度特徴抽出

より豊富な画像特徴を抽出可能

統合トレーニング手法

複数のトレーニング目標を統合した手法

Model Capabilities

ゼロショット画像分類

画像テキスト検索

視覚特徴抽出

Use Cases

画像分類

ゼロショット画像分類

トレーニング不要で画像を分類

カスタム候補ラベルをサポート

視覚言語タスク

視覚エンコーダー

他の視覚言語モデルの視覚エンコーダーとして使用可能

🚀 SigLIP 2 So400m

SigLIP 2 は、SigLIP の事前学習目的を、以前に独立して開発された手法を統合したレシピに拡張し、セマンティック理解、位置特定、および密な特徴量を向上させています。

🚀 クイックスタート

想定される用途

このモデルは、ゼロショット画像分類や画像 - テキスト検索などのタスクに生モデルとして使用できます。また、VLM（およびその他のビジョンタスク）のビジョンエンコーダーとしても利用できます。

以下は、このモデルを使用してゼロショット画像分類を実行する方法です。

from transformers import pipeline

# load pipeline
ckpt = "google/siglip2-so400m-patch16-512"
image_classifier = pipeline(model=ckpt, task="zero-shot-image-classification")

# load image and candidate labels
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
candidate_labels = ["2 cats", "a plane", "a remote"]

# run inference
outputs = image_classifier(image, candidate_labels)
print(outputs)

また、Vision Tower を使用して画像をエンコードすることもできます。

import torch
from transformers import AutoModel, AutoProcessor
from transformers.image_utils import load_image

# load the model and processor
ckpt = "google/siglip2-so400m-patch16-512"
model = AutoModel.from_pretrained(ckpt, device_map="auto").eval()
processor = AutoProcessor.from_pretrained(ckpt)

# load the image
image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
inputs = processor(images=[image], return_tensors="pt").to(model.device)

# run infernece
with torch.no_grad():
    image_embeddings = model.get_image_features(**inputs)    

print(image_embeddings.shape)

より多くのコード例については、siglip ドキュメントを参照してください。

💻 使用例

基本的な使用法

from transformers import pipeline

# load pipeline
ckpt = "google/siglip2-so400m-patch16-512"
image_classifier = pipeline(model=ckpt, task="zero-shot-image-classification")

# load image and candidate labels
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
candidate_labels = ["2 cats", "a plane", "a remote"]

# run inference
outputs = image_classifier(image, candidate_labels)
print(outputs)

高度な使用法

import torch
from transformers import AutoModel, AutoProcessor
from transformers.image_utils import load_image

# load the model and processor
ckpt = "google/siglip2-so400m-patch16-512"
model = AutoModel.from_pretrained(ckpt, device_map="auto").eval()
processor = AutoProcessor.from_pretrained(ckpt)

# load the image
image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
inputs = processor(images=[image], return_tensors="pt").to(model.device)

# run infernece
with torch.no_grad():
    image_embeddings = model.get_image_features(**inputs)    

print(image_embeddings.shape)

🔧 技術詳細

学習手順

SigLIP 2 は、SigLIP にいくつかの賢い学習目的を追加しています。

デコーダー損失
グローバル - ローカルおよびマスクされた予測損失
アスペクト比と解像度の適応性

学習データ

SigLIP 2 は、WebLI データセット (Chen et al., 2023) で事前学習されています。

コンピューティング

このモデルは、最大 2048 個の TPU - v5e チップで学習されました。

📚 ドキュメント

評価結果

SigLIP 2 の評価結果は以下の通りです（論文から引用）。

Evaluation Table

BibTeX エントリと引用情報

@misc{tschannen2025siglip2multilingualvisionlanguage,
      title={SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features}, 
      author={Michael Tschannen and Alexey Gritsenko and Xiao Wang and Muhammad Ferjad Naeem and Ibrahim Alabdulmohsin and Nikhil Parthasarathy and Talfan Evans and Lucas Beyer and Ye Xia and Basil Mustafa and Olivier Hénaff and Jeremiah Harmsen and Andreas Steiner and Xiaohua Zhai},
      year={2025},
      eprint={2502.14786},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.14786}, 
}