CLIP ViT B 32 DataComp.M S128m B4k
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks, trained on the DataComp.M dataset
Downloads 212
Release Time : 4/26/2023
Model Overview
This model is a vision-language pretrained model based on the CLIP architecture, capable of understanding the correlation between images and text, particularly suitable for zero-shot image classification tasks.
Model Features
Zero-shot Learning Capability
Can perform image classification tasks without task-specific fine-tuning
Multimodal Understanding
Simultaneously understands visual and textual information, establishing cross-modal associations
Efficient Architecture
Based on the ViT-B/32 vision transformer architecture, balancing performance and efficiency
Model Capabilities
Zero-shot Image Classification
Image-Text Matching
Cross-modal Retrieval
Use Cases
Content Management
Automatic Image Tagging
Automatically generates descriptive tags for unlabeled images
Improves content management efficiency and reduces manual labeling costs
E-commerce
Product Categorization
Classifies product images based on natural language descriptions
Enables new product categorization without training data
Featured Recommended AI Models
Š 2025AIbase