CLIP ViT B 32 CommonPool.M S128m B4k
Zero-shot image classification model based on CLIP architecture, supporting general vision-language tasks
Downloads 79
Release Time : 4/26/2023
Model Overview
This model is part of the OpenCLIP project, utilizing the ViT-B-32 architecture and trained via contrastive learning to achieve joint representation of images and text. It is suitable for tasks such as zero-shot image classification and cross-modal retrieval.
Model Features
Zero-shot Learning Capability
Can be directly applied to new category recognition without task-specific fine-tuning
Cross-modal Understanding
Processes both visual and textual information simultaneously to achieve image-text matching
Large-scale Pretraining
Trained on 128M samples with a batch size of 4K, offering strong generalization capabilities
Model Capabilities
Zero-shot Image Classification
Cross-modal Retrieval
Image-Text Matching
Multimodal Feature Extraction
Use Cases
Content Moderation
Inappropriate Content Detection
Detect inappropriate image content via text descriptions
E-commerce
Product Image Search
Match product images using natural language queries
Featured Recommended AI Models