fg-clip-large开源模型 - 实现图文细粒度对齐，提升视觉理解能力

首页

Fg Clip Large

由 qihoo360 开发

FG-CLIP是一种细粒度视觉与文本对齐模型，通过两阶段训练实现全局和区域级的图文对齐，提升细粒度视觉理解能力。

多模态对齐

Transformers

英语开源协议:Apache-2.0 #细粒度对齐 #零样本分类 #区域级描述

下载量 538

发布时间 : 4/29/2025

模型简介

FG-CLIP采用两阶段训练策略，第一阶段利用全局级图文对实现初步细粒度对齐，第二阶段通过补充区域级描述进一步优化对齐效果，适用于细粒度视觉与文本对齐任务。

模型特点

两阶段训练

通过全局级和区域级两阶段训练，实现更精细的视觉与文本对齐。

细粒度对齐

能够捕捉图像中的细节区域并与文本描述进行精确对齐。

稠密特征可视化

支持生成图像区域的相似度热力图，直观展示模型关注点。

模型能力

细粒度图像分类

视觉与文本对齐

图像区域特征提取

零样本图像分类

使用案例

图像理解

细粒度图像分类

对具有细微差别的图像进行分类，如不同品种的猫狗识别。

能够准确区分视觉上相似的类别。

视觉搜索

基于描述的图像检索

根据文本描述检索相关图像。

能够理解细粒度描述并返回精确匹配的图像。

🚀 FG-CLIP：细粒度视觉与文本对齐

FG-CLIP是一个致力于实现细粒度视觉与文本对齐的项目，通过两阶段训练优化模型，在图像检索等任务中表现出色，为相关研究和应用提供了有力支持。

🚀 快速开始

加载模型

import torch
from PIL import Image
from transformers import (
    AutoImageProcessor,
    AutoTokenizer,
    AutoModelForCausalLM,
)


model_root = "qihoo360/fg-clip-large"
image_size=336
model = AutoModelForCausalLM.from_pretrained(model_root,trust_remote_code=True).cuda()

device = model.device

tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)

检索

img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

# NOTE Short captions: max_length=77 && walk_short_pos=True
walk_short_pos = True
captions=["a photo of a cat", "a photo of a dog"]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)

# NOTE Long captions: max_length=248 && walk_short_pos=False
# ......

with torch.no_grad():
  image_feature = model.get_image_features(image_input)
  text_feature = model.get_text_features(caption_input,walk_short_pos=walk_short_pos)
  image_feature = image_feature / image_feature.norm(p=2, dim=-1, keepdim=True)
  text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)

logits_per_image = image_feature @ text_feature.T
logits_per_image = model.logit_scale.exp() * logits_per_image
probs = logits_per_image.softmax(dim=1) 
print(probs)

密集特征效果展示

import math
import matplotlib
matplotlib.use('Agg') 
import matplotlib.pyplot as plt


img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

with torch.no_grad():
    dense_image_feature = model.get_image_dense_features(image_input)
    captions = ["white cat"]
    caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
    text_feature = model.get_text_features(caption_input,walk_short_pos=True)
    text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
    dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)


similarity = dense_image_feature.squeeze() @ text_feature.squeeze().T
similarity = similarity.cpu().numpy()
patch_size = int(math.sqrt(similarity.shape[0]))


original_shape = (patch_size, patch_size)
show_image = similarity.reshape(original_shape) 


plt.figure(figsize=(6, 6))
plt.imshow(show_image)
plt.title('similarity Visualization')
plt.axis('off')  
plt.savefig("FG-CLIP/use_imgs/FGCLIP_dfcolor_cat.png")

密集特征效果展示

📚 详细文档

模型框架

FG-CLIP的训练分为两个阶段：第一阶段利用全局级别的图像-文本对实现初始的细粒度对齐；第二阶段则补充额外的区域级文本描述，包括详细的区域文本和正/负区域描述，以进一步优化对齐效果。

模型框架

📄 引用

如果您发现FG-CLIP对您的研究和应用有帮助，请使用以下BibTeX进行引用：

@article{xie2025fgclip,
      title={FG-CLIP: Fine-Grained Visual and Textual Alignment}, 
      author={Chunyu Xie and Bin Wang and Fanjing Kong and Jincheng Li and Dawei Liang and Gengshen Zhang and Dawei Leng and Yuhui Yin},
      year={2025},
      eprint={2505.05071},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.05071}, 
}

代码链接：https://github.com/360CVGroup/FG-CLIP