fg-clip-base開源模型 - 實現細粒度視覺與文本對齊，精準匹配圖像文本

首頁

Fg Clip Base

由qihoo360開發

FG-CLIP是一個細粒度視覺與文本對齊模型，通過兩階段訓練實現全局和區域級別的圖像-文本對齊。

文本生成圖像

Transformers

英語開源協議:Apache-2.0 #細粒度視覺文本對齊 #零樣本圖像分類 #區域級別描述

下載量 692

發布時間 : 5/8/2025

模型概述

FG-CLIP專注於細粒度視覺與文本對齊，通過兩階段訓練實現更精確的圖像-文本匹配能力。

模型特點

兩階段訓練

第一階段實現全局級別的標題-圖像對齊，第二階段補充區域級別的標題以優化對齊效果

細粒度對齊

能夠處理細粒度的視覺與文本對齊任務，包括區域級別的描述

密集特徵提取

支持獲取圖像的密集特徵，可用於更精細的視覺分析

模型能力

零樣本圖像分類

圖像-文本匹配

細粒度視覺分析

密集特徵提取

使用案例

圖像檢索

圖像分類

基於文本描述對圖像進行分類

在示例中正確識別貓的圖像

視覺分析

區域特徵分析

分析圖像中特定區域的特徵

可生成區域級別的相似度熱圖

🚀 FG - CLIP：細粒度視覺與文本對齊

FG - CLIP是一個專注於實現細粒度視覺與文本對齊的模型。它通過分階段的訓練方式，利用全局和區域級別的圖像 - 文本對，不斷優化對齊效果，在圖像識別和文本匹配等任務中具有出色的表現。

🚀 快速開始

加載模型

import torch
from PIL import Image
from transformers import (
    AutoImageProcessor,
    AutoTokenizer,
    AutoModelForCausalLM,
)


model_root = "qihoo360/fg-clip-base"
image_size=224
model = AutoModelForCausalLM.from_pretrained(model_root,trust_remote_code=True).cuda()

device = model.device

tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)

檢索

img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

# 注意 短描述：max_length=77 且 walk_short_pos=True
walk_short_pos = True
captions=["a photo of a cat", "a photo of a dog"]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)

# 注意 長描述：max_length=248 且 walk_short_pos=False
# ......

with torch.no_grad():
  image_feature = model.get_image_features(image_input)
  text_feature = model.get_text_features(caption_input,walk_short_pos=walk_short_pos)
  image_feature = image_feature / image_feature.norm(p=2, dim=-1, keepdim=True)
  text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)

logits_per_image = image_feature @ text_feature.T
logits_per_image = model.logit_scale.exp() * logits_per_image
probs = logits_per_image.softmax(dim=1) 
print(probs)
# [[9.9997e-01, 3.3485e-05]]

密集特徵效果展示

import math
import matplotlib
matplotlib.use('Agg') 
import matplotlib.pyplot as plt


img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

with torch.no_grad():
    dense_image_feature = model.get_image_dense_features(image_input)
    captions = ["white cat"]
    caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
    text_feature = model.get_text_features(caption_input,walk_short_pos=True)
    text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
    dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)

similarity = dense_image_feature.squeeze() @ text_feature.squeeze().T
similarity = similarity.cpu().numpy()
patch_size = int(math.sqrt(similarity.shape[0]))


original_shape = (patch_size, patch_size)
show_image = similarity.reshape(original_shape) 


plt.figure(figsize=(6, 6))
plt.imshow(show_image)
plt.title('similarity Visualization')
plt.axis('off')  
plt.savefig("FG-CLIP/use_imgs/FGCLIP_dfcolor_cat.png")

✨ 主要特性

FG - CLIP的訓練分為兩個階段：第一階段利用全局級別的圖像 - 文本對實現初始的細粒度對齊；第二階段則補充額外的區域級描述，包括詳細的區域描述以及正/負區域說明，以進一步優化對齊效果。

📚 詳細文檔

FG - CLIP：細粒度視覺與文本對齊
Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng‚Ä†, Yuhui Yin(*同等貢獻，‚úù通訊作者)

📄 許可證

本項目使用了一些數據集和檢查點，這些都遵循各自的原始許可證。用戶必須遵守這些原始許可證的所有條款和條件。本項目內容本身遵循 Apache許可證2.0。

📚 引用

如果您發現FG - CLIP對您的研究和應用有幫助，請使用以下BibTeX進行引用：

@article{xie2025fgclip,
      title={FG-CLIP: Fine-Grained Visual and Textual Alignment}, 
      author={Chunyu Xie and Bin Wang and Fanjing Kong and Jincheng Li and Dawei Liang and Gengshen Zhang and Dawei Leng and Yuhui Yin},
      year={2025},
      eprint={2505.05071},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.05071}, 
}