LongCLIP-GmP-ViT-L-14 Open-source Model - Supports Long Text Input, a Practical Choice with Improved Performance

Longclip GmP ViT L 14

Developed by zer0int

A CLIP model fine-tuned based on BeichenZhang/LongCLIP-L, supporting long-text input (248 tokens) with performance enhanced by Geometric parameterization (GmP) technology

Text-to-Image

Transformers

#Long-text CLIP #248-token support #Image-text matching

Downloads 4,859

Release Time : 6/15/2024

Model Overview

An improved CLIP model that breaks the traditional 77-token limit, specially optimized for long-text comprehension, and can serve as a text encoder for generative models like SDXL/Stable Diffusion

Model Features

Long-text support

Supports 248-token input (traditional CLIP only 77 tokens), significantly improving comprehension of long-text descriptions

Geometric parameterization (GmP)

Maintains the geometric properties of pre-trained knowledge through weight decomposition techniques, enhancing fine-tuning stability

Label smoothing loss

Uses a custom loss function, particularly suitable for small-batch/narrow-domain fine-tuning scenarios

Generative model compatibility

Can directly replace the text encoder of generative models like Stable Diffusion/Flux.1

Model Capabilities

Long-text image matching

Generative model text encoding

Cross-modal retrieval

Zero-shot classification

Use Cases

AI-generated content

SDXL text encoding enhancement

Serves as the text encoder for Stable Diffusion XL, supporting more detailed long-text prompts

Cosine similarity with 248-token input improves by approximately 29% compared to the 77-token truncated version

Cross-modal retrieval

E-commerce product search

Matches corresponding images based on detailed product descriptions

After narrow-domain fine-tuning, ImageNet accuracy reaches 0.89

🚀 A fine-tune of Long-CLIP

This project is a fine-tuning of Long-CLIP, with the original model being BeichenZhang/LongCLIP-L. It aims to enhance the performance of the Long-CLIP model in specific scenarios.

✨ Features

Datasets: Utilizes datasets such as SPRIGHT-T2I/spright_coco.
Base Model: Built upon the BeichenZhang/LongCLIP-L model.
Fine-tuning: Offers fine-tuning capabilities with custom loss and label smoothing, which can bring performance improvements in different datasets.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Using Long-CLIP as the Text Encoder with Flux.1, SDXL, Stable Diffusion

Get the ComfyUI Long-CLIP nodes here: https://github.com/SeaArtLab/ComfyUI-Long-CLIP. If you don't use Comfy, it can serve as a starting point for reverse engineering and applying it to your code.

Loading with HuggingFace Transformers

model_id = "zer0int/LongCLIP-GmP-ViT-L-14"

model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)

However, there may be an error due to a mismatch with the defined 77 tokens in the Transformers library. You have two options:

Option 1 (simple & worse)

Truncate to 77 tokens:

CLIPModel.from_pretrained(model_id, ignore_mismatched_sizes=True)

# Cosine similarities for 77 tokens is WORSE:
# tensor[photo of a cat, picture of a dog, cat, dog] # image ground truth: cat photo
tensor([[0.16484, 0.0749, 0.1618, 0.0774]], device='cuda:0') 📉

Option 2, proper integration (RECOMMENDED)

Solution for implementation of 248 tokens (thanks @kk3dmax ):

model_id = ("zer0int/LongCLIP-GmP-ViT-L-14")
config = CLIPConfig.from_pretrained(model_id)
config.text_config.max_position_embeddings = 248
clip_model = CLIPModel.from_pretrained(model_id, torch_dtype=dtype, config=config)
clip_processor = CLIPProcessor.from_pretrained(model_id, padding="max_length", max_length=248)

pipe.tokenizer = clip_processor.tokenizer  # Replace with the CLIP tokenizer
pipe.text_encoder = clip_model.text_model  # Replace with the CLIP text encoder
pipe.tokenizer_max_length = 248
pipe.text_encoder.dtype = torch.bfloat16

# Resulting Cosine Similarities for 248 tokens padded:
# tensor[photo of a cat, picture of a dog, cat, dog] -- image ground truth: cat photo
tensor([[0.2128, 0.0978, 0.1957, 0.1133]], device='cuda:0') ✅

📚 Documentation

Update 12/AUG/2024

A new BEST model with custom loss and label smoothing is introduced. It can bring small gains for diverse and large high-quality datasets, and significant relative gains for fine-tuning prone to overfitting (e.g., small batch size, 1 GPU, narrow datasets like 'sneakers'). You can fine-tune your model with the provided code for GmP-Smooth: https://github.com/zer0int/Long-CLIP.

Performance

The fine-tune has an improved ImageNet/ObjectNet accuracy of 0.89 (original Long-CLIP by the authors: ~0.81).

Geometric Parametrization (GmP)

The model uses Geometric Parametrization (GmP) to decompose weights into radial and angular components, preserving the directionality and magnitude of weight vectors.

"Normal" CLIP MLP (multi-layer perceptron):

(mlp): Sequential(
  |-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
  | (gelu): QuickGELU()
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.weight
|---- visual.transformer.resblocks.0.mlp.c_proj.bias


GmP CLIP MLP:

Weight decomposition into:
- radial component 'r' as norm of pre-trained weights
- angular component 'theta' as normalized direction
-> preserves weight vectors' directionality and magnitude

(mlp): Sequential(
  |-(c_fc): GeometricLinear()
  | (gelu): QuickGELU()
|-}-(c_proj): GeometricLinear()
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|---- visual.transformer.resblocks.0.mlp.c_proj.bias

(Same thing for [text] transformer.resblocks)

Model Usage

The model / state_dict shared can be used in the same manner as any state_dict, for example, as the SDXL / SD3 Text Encoder in ComfyUI using SeaArtLab/ComfyUI-Long-CLIP custom nodes.

Training and Evaluation Details

For details on training, evaluation numbers, or fine-tuning the model yourself, see: https://github.com/zer0int/Long-CLIP.

🔧 Technical Details

The fine-tuning process involves custom loss with label smoothing and Geometric Parametrization (GmP) for weight decomposition, which helps improve the model's performance and generalization ability.

📄 License

The pre-trained CLIP model by OpenAI is licensed under the MIT License.

@article{zhang2024longclip,
        title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
        author={Beichen Zhang and Pan Zhang and Xiaoyi Dong and Yuhang Zang and Jiaqi Wang},
        journal={arXiv preprint arXiv:2403.15378},
        year={2024}
}

⚠️ Important Note

When loading the model with HuggingFace Transformers, there may be an error due to a mismatch with the defined 77 tokens in the Transformers library. You can refer to the provided options for solutions.

💡 Usage Tip

If you like this CLIP, you can help feed it if possible. All code for fine-tuning and more is available on my GitHub.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご