đ CLIP ViT-L/14 finetune: SAE-informed adversarial training
This project focuses on fine - tuning the CLIP ViT - L/14 model with SAE - informed adversarial training. It offers interesting features and performance improvements in zero - shot image classification.
⨠Features
- Model Definition:
- SAE stands for Sparse autoencoder.
- The model uses the
openai/clip - vit - large - patch14
as the base model.
- It is designed for the
zero - shot - image - classification
pipeline and uses the transformers
library.
- Performance Comparison:
- Accuracy on ImageNet/ObjectNet: The model with SAE (this project) achieves 89% accuracy, which is better than the OpenAI pre - trained model's 84.5%, but slightly lower than the [my GmP](https://huggingface.co/zer0int/CLIP - GmP - ViT - L - 14) model's 91%.
- Usage and Applications:
- It's interesting to use with tools like Flux.1. You can get the [Text - Encoder TE only version](https://huggingface.co/zer0int/CLIP - SAE - ViT - L - 14/resolve/main/ViT - L - 14 - GmP - SAE - TE - only.safetensors?download=true) and try it out.
- This SAE CLIP has the best results for linear probe at [LAION - AI/CLIP_benchmark](https://github.com/LAION - AI/CLIP_benchmark).
- It is also the best CLIP to use for HunyuanVideo, but it requires using with the [zer0int/ComfyUI - HunyuanVideo - Nyan](https://github.com/zer0int/ComfyUI - HunyuanVideo - Nyan) node.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
No code examples are provided in the original document.
đ Documentation
Dataset
- The model is trained on the following datasets:
- [zer0int/CLIP - adversarial - typographic - attack_text - image](https://huggingface.co/datasets/zer0int/CLIP - adversarial - typographic - attack_text - image)
- [SPRIGHT - T2I/spright_coco](https://huggingface.co/datasets/SPRIGHT - T2I/spright_coco)
Model Download
- You can directly download the model from [here](https://huggingface.co/zer0int/CLIP - SAE - ViT - L - 14/resolve/main/ViT - L - 14 - GmP - SAE - TE - only.safetensors?download=true).
Visual Demonstration
- You can view a video demonstration [here](https://cdn - uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/g0vO1N4JalPp8oIAq5v38.mp4).
- There are also some related images:
- [Image 1](https://cdn - uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/m6Qty30oeS7A8cDYvLWme.png)
- [Image 2](https://cdn - uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/CN7xMe5ZPfLVWST - RF6Qn.png)
- [Image 3](https://cdn - uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/_Bp8DoxgkOjhau5EnShtW.png)
Adversarial Robustness Experiment
- You can right - click and download individual images for adversarial robustness experiments:
- [Image 1](https://raw.githubusercontent.com/zer0int/CLIP - SAE - finetune/refs/heads/CLIP - vision/bwcat_cat.png)
- [Image 2](https://raw.githubusercontent.com/zer0int/CLIP - SAE - finetune/refs/heads/CLIP - vision/bwcat_dog.png)
- [Image 3](https://raw.githubusercontent.com/zer0int/CLIP - SAE - finetune/refs/heads/CLIP - vision/bwcat_notext.png)
- Then upload each into zero - shot (hopefully available soon on the right here - >) and try labels like "a photo of a cat", "a photo of a dog", "a photo of a text". You can also repeat the experiment with [my GmP models](https://huggingface.co/zer0int/CLIP - GmP - ViT - L - 14) to compare the results.
Other Information
- All training info & code can be found at [github.com/zer0int/CLIP - SAE - finetune](https://github.com/zer0int/CLIP - SAE - finetune).
- If you like this project, you can [Buy me a coffee](https://ko - fi.com/zer0int).
đ§ Technical Details
No specific technical details are provided in the original document.
đ License
This project is licensed under the MIT license.