CLIP-SAE-ViT-L-14 Open Source Model - Excellent in Zero-shot Image Classification, a Powerful Tool for Recognizing Adversarial Typography Attacks

CLIP SAE ViT L 14

Developed by zer0int

A CLIP model fine-tuned with sparse autoencoder (SAE), excelling in zero-shot image classification tasks, particularly adept at recognizing adversarial typographic attacks

Text-to-Image

Transformers

Open Source License:MIT #Zero-shot Image Classification #Adversarial Robustness Training #Sparse Autoencoder Optimization

Downloads 32

Release Time : 12/8/2024

Model Overview

This model is a fine-tuned version of OpenAI CLIP ViT-L/14, enhanced with sparse autoencoder technology to improve adversarial robustness, outperforming the original model on benchmarks like ImageNet/ObjectNet

Model Features

Enhanced Adversarial Robustness

Improves the model's ability to recognize adversarial typographic attacks through sparse autoencoder technology

High Performance

Achieves 89% accuracy on ImageNet/ObjectNet tests, surpassing the original CLIP model's 84.5%

Tencent Hunyuan Video Adaptation

Specially adapted as the optimal choice for Tencent Hunyuan Video framework

Advantage in Linear Probing Tasks

Performs best in linear probing tasks on CLIP_benchmark

Model Capabilities

Zero-shot Image Classification

Adversarial Sample Recognition

Multimodal Understanding

Text-Image Matching

Use Cases

Content Security

Adversarial Typographic Attack Detection

Identifies adversarial images processed with special typographic techniques

Accurately classifies adversarial samples like black-and-white cats/dogs

Video Processing

Tencent Hunyuan Video Integration

Serves as the visual encoder for video understanding modules

Best used with dedicated ComfyUI nodes for optimal performance

🚀 CLIP ViT-L/14 finetune: SAE-informed adversarial training

This project focuses on fine - tuning the CLIP ViT - L/14 model with SAE - informed adversarial training. It offers interesting features and performance improvements in zero - shot image classification.

✨ Features

Model Definition:
- SAE stands for Sparse autoencoder.
- The model uses the openai/clip - vit - large - patch14 as the base model.
- It is designed for the zero - shot - image - classification pipeline and uses the transformers library.
Performance Comparison:
- Accuracy on ImageNet/ObjectNet: The model with SAE (this project) achieves 89% accuracy, which is better than the OpenAI pre - trained model's 84.5%, but slightly lower than the [my GmP](https://huggingface.co/zer0int/CLIP - GmP - ViT - L - 14) model's 91%.
Usage and Applications:
- It's interesting to use with tools like Flux.1. You can get the [Text - Encoder TE only version](https://huggingface.co/zer0int/CLIP - SAE - ViT - L - 14/resolve/main/ViT - L - 14 - GmP - SAE - TE - only.safetensors?download=true) and try it out.
- This SAE CLIP has the best results for linear probe at [LAION - AI/CLIP_benchmark](https://github.com/LAION - AI/CLIP_benchmark).
- It is also the best CLIP to use for HunyuanVideo, but it requires using with the [zer0int/ComfyUI - HunyuanVideo - Nyan](https://github.com/zer0int/ComfyUI - HunyuanVideo - Nyan) node.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

No code examples are provided in the original document.

📚 Documentation

Dataset

The model is trained on the following datasets:
- [zer0int/CLIP - adversarial - typographic - attack_text - image](https://huggingface.co/datasets/zer0int/CLIP - adversarial - typographic - attack_text - image)
- [SPRIGHT - T2I/spright_coco](https://huggingface.co/datasets/SPRIGHT - T2I/spright_coco)

Model Download

You can directly download the model from [here](https://huggingface.co/zer0int/CLIP - SAE - ViT - L - 14/resolve/main/ViT - L - 14 - GmP - SAE - TE - only.safetensors?download=true).

Visual Demonstration

You can view a video demonstration [here](https://cdn - uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/g0vO1N4JalPp8oIAq5v38.mp4).
There are also some related images:
- [Image 1](https://cdn - uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/m6Qty30oeS7A8cDYvLWme.png)
- [Image 2](https://cdn - uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/CN7xMe5ZPfLVWST - RF6Qn.png)
- [Image 3](https://cdn - uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/_Bp8DoxgkOjhau5EnShtW.png)

Adversarial Robustness Experiment

You can right - click and download individual images for adversarial robustness experiments:
- [Image 1](https://raw.githubusercontent.com/zer0int/CLIP - SAE - finetune/refs/heads/CLIP - vision/bwcat_cat.png)
- [Image 2](https://raw.githubusercontent.com/zer0int/CLIP - SAE - finetune/refs/heads/CLIP - vision/bwcat_dog.png)
- [Image 3](https://raw.githubusercontent.com/zer0int/CLIP - SAE - finetune/refs/heads/CLIP - vision/bwcat_notext.png)
Then upload each into zero - shot (hopefully available soon on the right here - >) and try labels like "a photo of a cat", "a photo of a dog", "a photo of a text". You can also repeat the experiment with [my GmP models](https://huggingface.co/zer0int/CLIP - GmP - ViT - L - 14) to compare the results.

Other Information

All training info & code can be found at [github.com/zer0int/CLIP - SAE - finetune](https://github.com/zer0int/CLIP - SAE - finetune).
If you like this project, you can [Buy me a coffee](https://ko - fi.com/zer0int).

🔧 Technical Details

No specific technical details are provided in the original document.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご