FG-Clip-Large Open-Source Model - Achieve Fine-Grained Image-Text Alignment and Enhance Visual Comprehension Ability

Fg Clip Large

Developed by qihoo360

FG-CLIP is a fine-grained vision and text alignment model that achieves global and region-level image-text alignment through two-stage training, enhancing fine-grained visual understanding ability.

Multimodal Alignment

Transformers

EnglishOpen Source License:Apache-2.0 #Fine-grained alignment #Zero-shot classification #Region-level description

Downloads 538

Release Time : 4/29/2025

Model Overview

FG-CLIP adopts a two-stage training strategy. In the first stage, it uses global-level image-text pairs to achieve preliminary fine-grained alignment. In the second stage, it further optimizes the alignment effect by supplementing region-level descriptions, which is suitable for fine-grained vision and text alignment tasks.

Model Features

Two-stage training

Achieve more precise vision and text alignment through two-stage training at the global and region levels.

Fine-grained alignment

Capable of capturing detailed regions in images and precisely aligning them with text descriptions.

Dense feature visualization

Support the generation of similarity heatmaps for image regions to intuitively show the model's focus points.

Model Capabilities

Fine-grained image classification

Vision and text alignment

Image region feature extraction

Zero-shot image classification

Use Cases

Image understanding

Fine-grained image classification

Classify images with subtle differences, such as identifying different breeds of cats and dogs.

Able to accurately distinguish visually similar categories.

Visual search

Description-based image retrieval

Retrieve relevant images based on text descriptions.

Able to understand fine-grained descriptions and return precisely matched images.

🚀 FG-CLIP: Fine-Grained Visual and Textual Alignment

FG-CLIP is a model that achieves fine - grained visual and textual alignment through a two - stage training process, enhancing performance in zero - shot image classification.

FG-CLIP: Fine-Grained Visual and Textual Alignment
Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng‚Ä†, Yuhui Yin(*Equal Contribution, ‚úùCorresponding Author)

✨ Features

FG-CLIP’s training proceeds in two stages: the first stage leverages global - level caption - image pairs to achieve initial fine - grained alignment, while the second stage supplements these with additional region - level captions, including detailed region captions and positive/negative region descriptions to further refine the alignment.

🚀 Quick Start

Load Model

import torch
from PIL import Image
from transformers import (
    AutoImageProcessor,
    AutoTokenizer,
    AutoModelForCausalLM,
)


model_root = "qihoo360/fg-clip-large"
image_size=336
model = AutoModelForCausalLM.from_pretrained(model_root,trust_remote_code=True).cuda()

device = model.device

tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)

Retrieval

img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

# NOTE Short captions: max_length=77 && walk_short_pos=True
walk_short_pos = True
captions=["a photo of a cat", "a photo of a dog"]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)

# NOTE Long captions: max_length=248 && walk_short_pos=False
# ......

with torch.no_grad():
  image_feature = model.get_image_features(image_input)
  text_feature = model.get_text_features(caption_input,walk_short_pos=walk_short_pos)
  image_feature = image_feature / image_feature.norm(p=2, dim=-1, keepdim=True)
  text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)

logits_per_image = image_feature @ text_feature.T
logits_per_image = model.logit_scale.exp() * logits_per_image
probs = logits_per_image.softmax(dim=1) 
print(probs)

Dense feature effect display

import math
import matplotlib
matplotlib.use('Agg') 
import matplotlib.pyplot as plt


img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

with torch.no_grad():
    dense_image_feature = model.get_image_dense_features(image_input)
    captions = ["white cat"]
    caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
    text_feature = model.get_text_features(caption_input,walk_short_pos=True)
    text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
    dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)


similarity = dense_image_feature.squeeze() @ text_feature.squeeze().T
similarity = similarity.cpu().numpy()
patch_size = int(math.sqrt(similarity.shape[0]))


original_shape = (patch_size, patch_size)
show_image = similarity.reshape(original_shape) 


plt.figure(figsize=(6, 6))
plt.imshow(show_image)
plt.title('similarity Visualization')
plt.axis('off')  
plt.savefig("FG-CLIP/use_imgs/FGCLIP_dfcolor_cat.png")

📚 Documentation

If you find FG-CLIP useful for your research and applications, please cite using this BibTeX:

@article{xie2025fgclip,
      title={FG-CLIP: Fine-Grained Visual and Textual Alignment}, 
      author={Chunyu Xie and Bin Wang and Fanjing Kong and Jincheng Li and Dawei Liang and Gengshen Zhang and Dawei Leng and Yuhui Yin},
      year={2025},
      eprint={2505.05071},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.05071}, 
}

Code: https://github.com/360CVGroup/FG-CLIP

📄 License

This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the Apache license 2.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご