CLIP ViT L 14 CommonPool.XL S13b B90k
A vision-language pretrained model based on the CLIP architecture, supporting zero-shot image classification and cross-modal retrieval tasks
Downloads 4,255
Release Time : 4/26/2023
Model Overview
This model is a variant of the CLIP series, using ViT-L/14 as the visual encoder, trained on the CommonPool.XL dataset, with strong cross-modal understanding capabilities.
Model Features
Zero-shot learning capability
Can perform image classification tasks without task-specific fine-tuning
Cross-modal understanding
Capable of understanding semantic relationships between images and text
Large-scale pretraining
Trained on the CommonPool.XL dataset (13B samples) with extensive knowledge coverage
Model Capabilities
Zero-shot image classification
Image-text matching
Cross-modal retrieval
Multimodal feature extraction
Use Cases
Content retrieval
Text-based image search
Retrieve relevant images using natural language queries
Can accurately match image content with text descriptions
Automatic tagging
Automatic image tagging
Generate descriptive labels for images
Can produce semantic labels relevant to image content
Featured Recommended AI Models
Š 2025AIbase