fg-clip-baseオープンソースモデル - 細粒度なビジュアルとテキストのアライメントを実現し、画像とテキストを正確にマッチング

ホーム

Fg Clip Base

qihoo360によって開発

FG-CLIPは細粒度視覚とテキストのアラインメントモデルで、2段階のトレーニングによりグローバルおよび領域レベルの画像-テキストアラインメントを実現します。

テキスト生成画像

Transformers

英語オープンソースライセンス:Apache-2.0 #細粒度視覚テキストアラインメント #ゼロショット画像分類 #領域レベル記述

ダウンロード数 692

リリース時間 : 5/8/2025

モデル概要

FG-CLIPは細粒度視覚とテキストのアラインメントに焦点を当て、2段階のトレーニングによりより正確な画像-テキストマッチング能力を実現します。

モデル特徴

2段階トレーニング

第1段階でグローバルレベルのキャプション-画像アラインメントを実現し、第2段階で領域レベルのキャプションを追加してアラインメント効果を最適化

細粒度アラインメント

領域レベルの記述を含む細粒度の視覚とテキストのアラインメントタスクを処理可能

密な特徴抽出

画像の密な特徴を取得可能で、より詳細な視覚分析に利用可能

モデル能力

ゼロショット画像分類

画像-テキストマッチング

細粒度視覚分析

密な特徴抽出

使用事例

画像検索

画像分類

テキスト記述に基づいて画像を分類

例では猫の画像を正しく識別

視覚分析

領域特徴分析

画像内の特定領域の特徴を分析

領域レベルの類似度ヒートマップを生成可能

🚀 FG-CLIP: 細粒度な視覚とテキストのアライメント

FG-CLIPは、画像とテキストの細粒度なアライメントを実現するモデルです。2段階のトレーニングを行い、画像とテキストの関係を高精度に捉えます。

🚀 クイックスタート

モデルの読み込み

import torch
from PIL import Image
from transformers import (
    AutoImageProcessor,
    AutoTokenizer,
    AutoModelForCausalLM,
)


model_root = "qihoo360/fg-clip-base"
image_size=224
model = AutoModelForCausalLM.from_pretrained(model_root,trust_remote_code=True).cuda()

device = model.device

tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)

検索

img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

# NOTE Short captions: max_length=77 && walk_short_pos=True
walk_short_pos = True
captions=["a photo of a cat", "a photo of a dog"]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)

# NOTE Long captions: max_length=248 && walk_short_pos=False
# ......

with torch.no_grad():
  image_feature = model.get_image_features(image_input)
  text_feature = model.get_text_features(caption_input,walk_short_pos=walk_short_pos)
  image_feature = image_feature / image_feature.norm(p=2, dim=-1, keepdim=True)
  text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)

logits_per_image = image_feature @ text_feature.T
logits_per_image = model.logit_scale.exp() * logits_per_image
probs = logits_per_image.softmax(dim=1) 
print(probs)
# [[9.9997e-01, 3.3485e-05]]

密な特徴の効果表示

import math
import matplotlib
matplotlib.use('Agg') 
import matplotlib.pyplot as plt


img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

with torch.no_grad():
    dense_image_feature = model.get_image_dense_features(image_input)
    captions = ["white cat"]
    caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
    text_feature = model.get_text_features(caption_input,walk_short_pos=True)
    text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
    dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)

similarity = dense_image_feature.squeeze() @ text_feature.squeeze().T
similarity = similarity.cpu().numpy()
patch_size = int(math.sqrt(similarity.shape[0]))


original_shape = (patch_size, patch_size)
show_image = similarity.reshape(original_shape) 


plt.figure(figsize=(6, 6))
plt.imshow(show_image)
plt.title('similarity Visualization')
plt.axis('off')  
plt.savefig("FG-CLIP/use_imgs/FGCLIP_dfcolor_cat.png")

✨ 主な機能

FG-CLIPのトレーニングは2段階で進められます。最初の段階ではグローバルレベルのキャプションと画像のペアを利用して、初期の細粒度なアライメントを達成します。次の段階では、追加の領域レベルのキャプション（詳細な領域キャプションや正/負の領域記述）を補足して、アライメントをさらに洗練します。

📚 ドキュメント

FG-CLIP: Fine-Grained Visual and Textual Alignment
Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng‚Ä†, Yuhui Yin(*Equal Contribution, ‚úùCorresponding Author)

📄 ライセンス

このプロジェクトは、それぞれ独自のライセンスに従う特定のデータセットとチェックポイントを利用しています。ユーザーはこれらの元のライセンスのすべての条項と条件に従わなければなりません。このプロジェクト自体の内容は、Apache license 2.0の下でライセンスされています。

📖 引用

もしFG-CLIPがあなたの研究やアプリケーションに役立つと思われる場合は、以下のBibTeXを使用して引用してください。

@article{xie2025fgclip,
      title={FG-CLIP: Fine-Grained Visual and Textual Alignment}, 
      author={Chunyu Xie and Bin Wang and Fanjing Kong and Jincheng Li and Dawei Liang and Gengshen Zhang and Dawei Leng and Yuhui Yin},
      year={2025},
      eprint={2505.05071},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.05071}, 
}