đ A fine-tune of CLIP-L
This is a fine-tuned version of CLIP-L. The original model is openai/clip-vit-large-patch14. It aims to improve the performance in specific tasks with fine-tuning techniques.
⨠Features
- High Accuracy: Achieves an unprecedented ImageNet/ObjectNet accuracy of ~0.90, compared to the original pre - trained model's ~0.85.
- Multiple Versions: Offers different versions such as text encoder only .safetensors, full model .safetensors, state_dict pickle, and full model pickle.
- New Loss Function: Implements a custom loss function with label smoothing for better fine - tuning.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
from transformers import CLIPModel, CLIPProcessor, CLIPConfig
model_id = "zer0int/CLIP-GmP-ViT-L-14"
config = CLIPConfig.from_pretrained(model_id)
Advanced Usage
If you want to create your own fine - tune, you can refer to the code on GitHub. You can use "exp-acts-ft-finetune-OpenAI-CLIP-ViT-L-14-GmP-manipulate-neurons.py" to replicate the exact model fine - tune.
đ Documentation
Model Information
Property |
Details |
Model Type |
A fine - tuned version of OpenAI/CLIP ViT - L/14 |
Training Data |
SPRIGHT - T2I/spright_coco |
Updates
Update 23/SEP/2024
- Huggingface Transformers / Diffusers pipeline is now implemented.
- See Integrating my CLIP - L with Flux.1 for an example script.
- Otherwise, use it as a normal HF model as shown in the basic usage example.
Update 03/SEP/2024 / edit 05/AUG
If you are looking for a Text Encoder for Flux.1 (or SD3, SDXL, SD, ...) to replace CLIP - L:
- The "TEXT" model has superior prompt following, especially for text, but also for other details. DOWNLOAD
- The "SMOOTH" model can sometimes have better details (when there's no text in the image). DOWNLOAD
- The "GmP" initial fine - tune is deprecated / inferior to the above models. You can still DOWNLOAD it.
Update 11/AUG/2024
A new Best - Performing CLIP ViT - L/14 'GmP - smooth' model is added. You can simply download the files named BEST. Or you can create a fine - tune yourself following the steps on GitHub.
Model Performance
- The TEXT model has a modality gap of 0.80 (OpenAI pre - trained: 0.82). It is trained with a high temperature of 0.1 + tinkering.
- ImageNet/ObjectNet accuracy is ~0.91 for both "SMOOTH" and "TEXT" models (pre - trained: ~0.84).
đ§ Technical Details
Geometric Parametrization (GmP)
"Normal" CLIP MLP (multi - layer perceptron):
(mlp): Sequential(
|-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
| (gelu): QuickGELU()
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
| |
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.weight
|---- visual.transformer.resblocks.0.mlp.c_proj.bias
GmP CLIP MLP:
Weight decomposition into:
- radial component 'r' as norm of pre - trained weights
- angular component 'theta' as normalized direction
-> preserves weight vectors' directionality and magnitude
(mlp): Sequential(
|-(c_fc): GeometricLinear()
| (gelu): QuickGELU()
|-}-(c_proj): GeometricLinear()
| |
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|---- visual.transformer.resblocks.0.mlp.c_proj.bias
(Same thing for [text] transformer.resblocks)
đ License
The pre - trained CLIP model by OpenAI is under the MIT License.