CLIP ViT L 14 CommonPool.XL.clip S13b B90k
A vision-language model based on the CLIP architecture, supporting zero-shot image classification and cross-modal retrieval
Downloads 534
Release Time : 4/26/2023
Model Overview
This model is a variant of the CLIP series, combining the Vision Transformer (ViT) architecture with contrastive learning objectives, capable of understanding semantic relationships between images and text, suitable for zero-shot image classification and cross-modal retrieval tasks.
Model Features
Zero-shot learning capability
Can perform image classification on new categories without task-specific fine-tuning
Cross-modal understanding
Capable of processing and understanding semantic relationships between images and text simultaneously
Large-scale pretraining
Pretrained on the CommonPool.XL dataset, containing approximately 13B samples
Model Capabilities
Zero-shot image classification
Image-text matching
Cross-modal retrieval
Multimodal feature extraction
Use Cases
Content moderation
Inappropriate content detection
Detect inappropriate image content through text descriptions
Can identify various types of inappropriate content, accuracy depends on specific application scenarios
E-commerce
Visual search
Search for related product images through text queries
Improves product search relevance and user experience
Media analysis
Image captioning
Automatically generate text descriptions for images
Can generate semantically relevant image descriptions
Featured Recommended AI Models
Š 2025AIbase