CLIP ViT B 32 Laion2b E16
A vision-language pretrained model implemented based on OpenCLIP, supporting zero-shot image classification tasks
Downloads 89
Release Time : 5/17/2023
Model Overview
This model is an implementation of the CLIP architecture, combining Vision Transformer (ViT) and text encoder, capable of understanding the correlation between images and texts, suitable for cross-modal tasks such as zero-shot image classification
Model Features
Zero-shot learning capability
Can perform image classification tasks without task-specific fine-tuning
Cross-modal understanding
Capable of processing and understanding both visual and textual information
Large-scale pretraining
Pretrained on the laion2B dataset, with strong generalization capabilities
Model Capabilities
Zero-shot image classification
Image-text matching
Cross-modal retrieval
Use Cases
Content moderation
Inappropriate content detection
Automatically identify potentially inappropriate content in images
E-commerce
Product categorization
Automatically classify product images based on descriptions
Media analysis
Image captioning
Generate descriptive labels for images
Featured Recommended AI Models
Š 2025AIbase