L

Longclip SAE ViT L 14

Developed by zer0int
A Long-CLIP model fine-tuned with Sparse Autoencoder (SAE), supporting long-text input and optimized for text-image alignment
Downloads 290
Release Time : 12/19/2024

Model Overview

This model is a fine-tuned version of Long-CLIP ViT-L/14, enhanced with sparse autoencoder technology for improved long-text prompt processing, particularly suitable for use with Tencent Hunyuan Video system

Model Features

Long-text support
Breaks the original CLIP's 77-token limit, effectively handling longer text inputs
Sparse Autoencoder fine-tuning
Optimizes model representation capability through SAE technology, improving text-image alignment
Tencent Hunyuan Video compatibility
Specially optimized for seamless integration with HunyuanVideo system
Adversarial training
Trained on adversarial typographic attack datasets for enhanced robustness

Model Capabilities

Long-text guided image generation
Zero-shot image classification
Cross-modal retrieval
Text-image alignment

Use Cases

Creative content generation
Complex scene image generation
Generates corresponding images from long-text prompts containing multiple details
Can process complex scene descriptions up to 69 tokens
Atypical concept visualization
Transforms abstract or unconventional concepts into visual representations
Maintains excellent consistency and prompt-following capability
Film production assistance
Storyboard design
Generates visual references based on detailed technical descriptions
Accurately understands cinematographic parameters and artistic styles
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase