đ A fine-tune of Long-CLIP
This project is a fine-tuning of Long-CLIP, with the original model being BeichenZhang/LongCLIP-L. It aims to enhance the performance of the Long-CLIP model in specific scenarios.
⨠Features
- Datasets: Utilizes datasets such as SPRIGHT-T2I/spright_coco.
- Base Model: Built upon the BeichenZhang/LongCLIP-L model.
- Fine-tuning: Offers fine-tuning capabilities with custom loss and label smoothing, which can bring performance improvements in different datasets.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Using Long-CLIP as the Text Encoder with Flux.1, SDXL, Stable Diffusion
Get the ComfyUI Long-CLIP nodes here: https://github.com/SeaArtLab/ComfyUI-Long-CLIP. If you don't use Comfy, it can serve as a starting point for reverse engineering and applying it to your code.
Loading with HuggingFace Transformers
model_id = "zer0int/LongCLIP-GmP-ViT-L-14"
model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)
However, there may be an error due to a mismatch with the defined 77 tokens in the Transformers library. You have two options:
Option 1 (simple & worse)
Truncate to 77 tokens:
CLIPModel.from_pretrained(model_id, ignore_mismatched_sizes=True)
tensor([[0.16484, 0.0749, 0.1618, 0.0774]], device='cuda:0') đ
Option 2, proper integration (RECOMMENDED)
Solution for implementation of 248 tokens (thanks @kk3dmax ):
model_id = ("zer0int/LongCLIP-GmP-ViT-L-14")
config = CLIPConfig.from_pretrained(model_id)
config.text_config.max_position_embeddings = 248
clip_model = CLIPModel.from_pretrained(model_id, torch_dtype=dtype, config=config)
clip_processor = CLIPProcessor.from_pretrained(model_id, padding="max_length", max_length=248)
pipe.tokenizer = clip_processor.tokenizer
pipe.text_encoder = clip_model.text_model
pipe.tokenizer_max_length = 248
pipe.text_encoder.dtype = torch.bfloat16
tensor([[0.2128, 0.0978, 0.1957, 0.1133]], device='cuda:0') â
đ Documentation
Update 12/AUG/2024
A new BEST model with custom loss and label smoothing is introduced. It can bring small gains for diverse and large high-quality datasets, and significant relative gains for fine-tuning prone to overfitting (e.g., small batch size, 1 GPU, narrow datasets like 'sneakers'). You can fine-tune your model with the provided code for GmP-Smooth: https://github.com/zer0int/Long-CLIP.
Performance
The fine-tune has an improved ImageNet/ObjectNet accuracy of 0.89 (original Long-CLIP by the authors: ~0.81).
Geometric Parametrization (GmP)
The model uses Geometric Parametrization (GmP) to decompose weights into radial and angular components, preserving the directionality and magnitude of weight vectors.
"Normal" CLIP MLP (multi-layer perceptron):
(mlp): Sequential(
|-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
| (gelu): QuickGELU()
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
| |
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.weight
|---- visual.transformer.resblocks.0.mlp.c_proj.bias
GmP CLIP MLP:
Weight decomposition into:
- radial component 'r' as norm of pre-trained weights
- angular component 'theta' as normalized direction
-> preserves weight vectors' directionality and magnitude
(mlp): Sequential(
|-(c_fc): GeometricLinear()
| (gelu): QuickGELU()
|-}-(c_proj): GeometricLinear()
| |
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|---- visual.transformer.resblocks.0.mlp.c_proj.bias
(Same thing for [text] transformer.resblocks)
Model Usage
The model / state_dict shared can be used in the same manner as any state_dict, for example, as the SDXL / SD3 Text Encoder in ComfyUI using SeaArtLab/ComfyUI-Long-CLIP custom nodes.
Training and Evaluation Details
For details on training, evaluation numbers, or fine-tuning the model yourself, see: https://github.com/zer0int/Long-CLIP.
đ§ Technical Details
The fine-tuning process involves custom loss with label smoothing and Geometric Parametrization (GmP) for weight decomposition, which helps improve the model's performance and generalization ability.
đ License
The pre-trained CLIP model by OpenAI is licensed under the MIT License.
@article{zhang2024longclip,
title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
author={Beichen Zhang and Pan Zhang and Xiaoyi Dong and Yuhang Zang and Jiaqi Wang},
journal={arXiv preprint arXiv:2403.15378},
year={2024}
}
â ī¸ Important Note
When loading the model with HuggingFace Transformers, there may be an error due to a mismatch with the defined 77 tokens in the Transformers library. You can refer to the provided options for solutions.
đĄ Usage Tip
If you like this CLIP, you can help feed it if possible. All code for fine-tuning and more is available on my GitHub.