CLIP-GmP-ViT-L-14 Open-Source Model - Optimize Text Encoding and Improve Information Processing Performance

CLIP GmP ViT L 14

Developed by zer0int

A fine-tuned model based on OpenAI CLIP ViT-L/14, achieving performance improvements through Geometric Parametrization (GmP), with special optimization for text encoding capabilities

Text-to-Image

Transformers

Open Source License:MIT #Text-enhanced CLIP #Geometric Parametric Fine-tuning #Multimodal Retrieval

Downloads 6,275

Release Time : 6/15/2024

Model Overview

This is an improved version of the CLIP vision-language model, focusing on enhancing text understanding and image retrieval capabilities, suitable for tasks like text-to-image generation

Model Features

Geometric Parametrization (GmP)

By decomposing weights into radial and angular components, maintains the directionality and magnitude of weight vectors to enhance model performance

High-temperature Training Optimization

Adopts 0.1 high-temperature training + parameter tuning, significantly improving text understanding capabilities

Multi-version Options

Provides TEXT (text-optimized) and SMOOTH (image-optimized) versions to accommodate different needs

High-performance Retrieval

Demonstrates excellent image-text retrieval capabilities on datasets like MSCOCO

Model Capabilities

Text encoding

Image-text matching

Image retrieval

Text understanding

Supports Diffusers/Transformers integration

Use Cases

Text-to-Image Generation

Replacement Text Encoder for SD/SDXL/SD3

Serves as a replacement text encoder for models like Stable Diffusion, offering better prompt-following capabilities

Particularly adept at handling text details

Textless Image Generation

The SMOOTH version presents better details in textless images

Depends on specific prompts

Cross-modal Retrieval

Image-Text Retrieval

Retrieves relevant images based on text queries

Golden Retriever-level retrieval expert

🚀 A fine-tune of CLIP-L

This is a fine-tuned version of CLIP-L. The original model is openai/clip-vit-large-patch14. It aims to improve the performance in specific tasks with fine-tuning techniques.

✨ Features

High Accuracy: Achieves an unprecedented ImageNet/ObjectNet accuracy of ~0.90, compared to the original pre - trained model's ~0.85.
Multiple Versions: Offers different versions such as text encoder only .safetensors, full model .safetensors, state_dict pickle, and full model pickle.
New Loss Function: Implements a custom loss function with label smoothing for better fine - tuning.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import CLIPModel, CLIPProcessor, CLIPConfig
model_id = "zer0int/CLIP-GmP-ViT-L-14"
config = CLIPConfig.from_pretrained(model_id)

Advanced Usage

If you want to create your own fine - tune, you can refer to the code on GitHub. You can use "exp-acts-ft-finetune-OpenAI-CLIP-ViT-L-14-GmP-manipulate-neurons.py" to replicate the exact model fine - tune.

📚 Documentation

Model Information

Property	Details
Model Type	A fine - tuned version of OpenAI/CLIP ViT - L/14
Training Data	SPRIGHT - T2I/spright_coco

Updates

Update 23/SEP/2024

Huggingface Transformers / Diffusers pipeline is now implemented.
See Integrating my CLIP - L with Flux.1 for an example script.
Otherwise, use it as a normal HF model as shown in the basic usage example.

Update 03/SEP/2024 / edit 05/AUG

If you are looking for a Text Encoder for Flux.1 (or SD3, SDXL, SD, ...) to replace CLIP - L:

The "TEXT" model has superior prompt following, especially for text, but also for other details. DOWNLOAD
The "SMOOTH" model can sometimes have better details (when there's no text in the image). DOWNLOAD
The "GmP" initial fine - tune is deprecated / inferior to the above models. You can still DOWNLOAD it.

Update 11/AUG/2024

A new Best - Performing CLIP ViT - L/14 'GmP - smooth' model is added. You can simply download the files named BEST. Or you can create a fine - tune yourself following the steps on GitHub.

Model Performance

The TEXT model has a modality gap of 0.80 (OpenAI pre - trained: 0.82). It is trained with a high temperature of 0.1 + tinkering.
ImageNet/ObjectNet accuracy is ~0.91 for both "SMOOTH" and "TEXT" models (pre - trained: ~0.84).

🔧 Technical Details

Geometric Parametrization (GmP)

"Normal" CLIP MLP (multi - layer perceptron):

(mlp): Sequential(
  |-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
  | (gelu): QuickGELU()
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.weight
|---- visual.transformer.resblocks.0.mlp.c_proj.bias


GmP CLIP MLP:

Weight decomposition into:
- radial component 'r' as norm of pre - trained weights
- angular component 'theta' as normalized direction
-> preserves weight vectors' directionality and magnitude

(mlp): Sequential(
  |-(c_fc): GeometricLinear()
  | (gelu): QuickGELU()
|-}-(c_proj): GeometricLinear()
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|---- visual.transformer.resblocks.0.mlp.c_proj.bias

(Same thing for [text] transformer.resblocks)

📄 License

The pre - trained CLIP model by OpenAI is under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご