fg-clip-large開源模型 - 實現圖文細粒度對齊，提升視覺理解能力

首頁

Fg Clip Large

由qihoo360開發

FG-CLIP是一種細粒度視覺與文本對齊模型，通過兩階段訓練實現全局和區域級的圖文對齊，提升細粒度視覺理解能力。

多模態對齊

Transformers

英語開源協議:Apache-2.0 #細粒度對齊 #零樣本分類 #區域級描述

下載量 538

發布時間 : 4/29/2025

模型概述

FG-CLIP採用兩階段訓練策略，第一階段利用全局級圖文對實現初步細粒度對齊，第二階段通過補充區域級描述進一步優化對齊效果，適用於細粒度視覺與文本對齊任務。

模型特點

兩階段訓練

通過全局級和區域級兩階段訓練，實現更精細的視覺與文本對齊。

細粒度對齊

能夠捕捉圖像中的細節區域並與文本描述進行精確對齊。

稠密特徵可視化

支持生成圖像區域的相似度熱力圖，直觀展示模型關注點。

模型能力

細粒度圖像分類

視覺與文本對齊

圖像區域特徵提取

零樣本圖像分類

使用案例

圖像理解

細粒度圖像分類

對具有細微差別的圖像進行分類，如不同品種的貓狗識別。

能夠準確區分視覺上相似的類別。

視覺搜索

基於描述的圖像檢索

根據文本描述檢索相關圖像。

能夠理解細粒度描述並返回精確匹配的圖像。

🚀 FG-CLIP：細粒度視覺與文本對齊

FG-CLIP是一個致力於實現細粒度視覺與文本對齊的項目，通過兩階段訓練優化模型，在圖像檢索等任務中表現出色，為相關研究和應用提供了有力支持。

🚀 快速開始

加載模型

import torch
from PIL import Image
from transformers import (
    AutoImageProcessor,
    AutoTokenizer,
    AutoModelForCausalLM,
)


model_root = "qihoo360/fg-clip-large"
image_size=336
model = AutoModelForCausalLM.from_pretrained(model_root,trust_remote_code=True).cuda()

device = model.device

tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)

檢索

img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

# NOTE Short captions: max_length=77 && walk_short_pos=True
walk_short_pos = True
captions=["a photo of a cat", "a photo of a dog"]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)

# NOTE Long captions: max_length=248 && walk_short_pos=False
# ......

with torch.no_grad():
  image_feature = model.get_image_features(image_input)
  text_feature = model.get_text_features(caption_input,walk_short_pos=walk_short_pos)
  image_feature = image_feature / image_feature.norm(p=2, dim=-1, keepdim=True)
  text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)

logits_per_image = image_feature @ text_feature.T
logits_per_image = model.logit_scale.exp() * logits_per_image
probs = logits_per_image.softmax(dim=1) 
print(probs)

密集特徵效果展示

import math
import matplotlib
matplotlib.use('Agg') 
import matplotlib.pyplot as plt


img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

with torch.no_grad():
    dense_image_feature = model.get_image_dense_features(image_input)
    captions = ["white cat"]
    caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
    text_feature = model.get_text_features(caption_input,walk_short_pos=True)
    text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
    dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)


similarity = dense_image_feature.squeeze() @ text_feature.squeeze().T
similarity = similarity.cpu().numpy()
patch_size = int(math.sqrt(similarity.shape[0]))


original_shape = (patch_size, patch_size)
show_image = similarity.reshape(original_shape) 


plt.figure(figsize=(6, 6))
plt.imshow(show_image)
plt.title('similarity Visualization')
plt.axis('off')  
plt.savefig("FG-CLIP/use_imgs/FGCLIP_dfcolor_cat.png")

密集特徵效果展示

📚 詳細文檔

模型框架

FG-CLIP的訓練分為兩個階段：第一階段利用全局級別的圖像-文本對實現初始的細粒度對齊；第二階段則補充額外的區域級文本描述，包括詳細的區域文本和正/負區域描述，以進一步優化對齊效果。

模型框架

📄 引用

如果您發現FG-CLIP對您的研究和應用有幫助，請使用以下BibTeX進行引用：

@article{xie2025fgclip,
      title={FG-CLIP: Fine-Grained Visual and Textual Alignment}, 
      author={Chunyu Xie and Bin Wang and Fanjing Kong and Jincheng Li and Dawei Liang and Gengshen Zhang and Dawei Leng and Yuhui Yin},
      year={2025},
      eprint={2505.05071},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.05071}, 
}

代碼鏈接：https://github.com/360CVGroup/FG-CLIP