LongCLIP-SAE-ViT-L-14 Open Source Model - Supports Long Text Input and Optimizes Text-Image Alignment

Longclip SAE ViT L 14

Developed by zer0int

A Long-CLIP model fine-tuned with Sparse Autoencoder (SAE), supporting long-text input and optimized for text-image alignment

Text-to-Image

Safetensors

#Long-text CLIP optimization #Adversarial fine-tuning #Zero-shot image classification

Downloads 290

Release Time : 12/19/2024

Model Overview

This model is a fine-tuned version of Long-CLIP ViT-L/14, enhanced with sparse autoencoder technology for improved long-text prompt processing, particularly suitable for use with Tencent Hunyuan Video system

Model Features

Long-text support

Breaks the original CLIP's 77-token limit, effectively handling longer text inputs

Sparse Autoencoder fine-tuning

Optimizes model representation capability through SAE technology, improving text-image alignment

Tencent Hunyuan Video compatibility

Specially optimized for seamless integration with HunyuanVideo system

Adversarial training

Trained on adversarial typographic attack datasets for enhanced robustness

Model Capabilities

Long-text guided image generation

Zero-shot image classification

Cross-modal retrieval

Text-image alignment

Use Cases

Creative content generation

Complex scene image generation

Generates corresponding images from long-text prompts containing multiple details

Can process complex scene descriptions up to 69 tokens

Atypical concept visualization

Transforms abstract or unconventional concepts into visual representations

Maintains excellent consistency and prompt-following capability

Film production assistance

Storyboard design

Generates visual references based on detailed technical descriptions

Accurately understands cinematographic parameters and artistic styles

Property	Details
Datasets	- zer0int/CLIP-adversarial-typographic-attack_text-image - SPRIGHT-T2I/spright_coco
Base Model	- BeichenZhang/LongCLIP-L
Pipeline Tag	zero-shot-image-classification