FG-Clip-Base Open-Source Model: Achieving Fine-Grained Visual and Text Alignment for Precise Image-Text Matching

Fg Clip Base

Developed by qihoo360

FG-CLIP is a fine-grained visual and text alignment model that achieves global and region-level image-text alignment through two-stage training.

Text-to-Image

Transformers

EnglishOpen Source License:Apache-2.0 #Fine-grained Visual-Text Alignment #Zero-shot Image Classification #Region-level Description

Downloads 692

Release Time : 5/8/2025

Model Overview

FG-CLIP focuses on fine-grained visual and text alignment, achieving more precise image-text matching capabilities through two-stage training.

Model Features

Two-stage Training

The first stage achieves global-level caption-image alignment, while the second stage supplements region-level captions to optimize alignment.

Fine-grained Alignment

Capable of handling fine-grained visual and text alignment tasks, including region-level descriptions.

Dense Feature Extraction

Supports obtaining dense features of images for more detailed visual analysis.

Model Capabilities

Zero-shot Image Classification

Image-Text Matching

Fine-grained Visual Analysis

Dense Feature Extraction

Use Cases

Image Retrieval

Image Classification

Classify images based on text descriptions

Correctly identifies images of cats in examples

Visual Analysis

Region Feature Analysis

Analyze features of specific regions in an image

Can generate region-level similarity heatmaps

🚀 FG-CLIP: Fine-Grained Visual and Textual Alignment

FG-CLIP is a model that achieves fine - grained visual and textual alignment through a two - stage training process, enhancing performance in zero - shot image classification.

🚀 Quick Start

✨ Features

FG-CLIP's training proceeds in two stages: the first stage leverages global - level caption - image pairs to achieve initial fine - grained alignment, while the second stage supplements these with additional region - level captions, including detailed region captions and positive/negative region descriptions to further refine the alignment.

📦 Installation

No specific installation steps other than the code for loading the model are provided. However, the code for loading the model implies the necessary dependencies.

💻 Usage Examples

Basic Usage

# Load Model
import torch
from PIL import Image
from transformers import (
    AutoImageProcessor,
    AutoTokenizer,
    AutoModelForCausalLM,
)


model_root = "qihoo360/fg-clip-base"
image_size=224
model = AutoModelForCausalLM.from_pretrained(model_root,trust_remote_code=True).cuda()

device = model.device

tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)

Advanced Usage - Retrieval

# Retrieval
img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

# NOTE Short captions: max_length=77 && walk_short_pos=True
walk_short_pos = True
captions=["a photo of a cat", "a photo of a dog"]
caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)

# NOTE Long captions: max_length=248 && walk_short_pos=False
# ......

with torch.no_grad():
  image_feature = model.get_image_features(image_input)
  text_feature = model.get_text_features(caption_input,walk_short_pos=walk_short_pos)
  image_feature = image_feature / image_feature.norm(p=2, dim=-1, keepdim=True)
  text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)

logits_per_image = image_feature @ text_feature.T
logits_per_image = model.logit_scale.exp() * logits_per_image
probs = logits_per_image.softmax(dim=1) 
print(probs)
# [[9.9997e-01, 3.3485e-05]]

Advanced Usage - Dense feature effect display

import math
import matplotlib
matplotlib.use('Agg') 
import matplotlib.pyplot as plt


img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = image.resize((image_size,image_size))

image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device)

with torch.no_grad():
    dense_image_feature = model.get_image_dense_features(image_input)
    captions = ["white cat"]
    caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device)
    text_feature = model.get_text_features(caption_input,walk_short_pos=True)
    text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
    dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)

similarity = dense_image_feature.squeeze() @ text_feature.squeeze().T
similarity = similarity.cpu().numpy()
patch_size = int(math.sqrt(similarity.shape[0]))


original_shape = (patch_size, patch_size)
show_image = similarity.reshape(original_shape) 


plt.figure(figsize=(6, 6))
plt.imshow(show_image)
plt.title('similarity Visualization')
plt.axis('off')  
plt.savefig("FG-CLIP/use_imgs/FGCLIP_dfcolor_cat.png")

📚 Documentation

FG-CLIP: Fine-Grained Visual and Textual Alignment
Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng‚Ä†, Yuhui Yin(*Equal Contribution, ‚úùCorresponding Author)

📄 License

This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the Apache license 2.0.

📚 Citation

If you find FG-CLIP useful for your research and applications, please cite using this BibTeX:

@article{xie2025fgclip,
      title={FG-CLIP: Fine-Grained Visual and Textual Alignment}, 
      author={Chunyu Xie and Bin Wang and Fanjing Kong and Jincheng Li and Dawei Liang and Gengshen Zhang and Dawei Leng and Yuhui Yin},
      year={2025},
      eprint={2505.05071},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.05071}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご