CLIP ViT B 32 DataComp.S S13m B4k
A zero-shot image classification model based on the CLIP architecture, trained on the DataComp dataset, supporting various vision tasks.
Downloads 92
Release Time : 4/26/2023
Model Overview
This model is a vision-language model based on the CLIP architecture, capable of performing zero-shot image classification and cross-modal retrieval tasks.
Model Features
Zero-shot Learning Capability
Can perform new vision tasks without task-specific fine-tuning
Cross-modal Understanding
Capable of understanding the relationship between images and text
Efficient Visual Encoding
Uses Vision Transformer architecture for efficient image processing
Model Capabilities
Zero-shot Image Classification
Image-Text Matching
Cross-modal Retrieval
Visual Feature Extraction
Use Cases
Content Retrieval
Text-based Image Search
Retrieve relevant images using natural language descriptions
High-precision cross-modal retrieval performance
Automatic Tagging
Automatic Image Tagging
Generate descriptive labels for unlabeled images
Reduces manual labeling workload
Featured Recommended AI Models
Š 2025AIbase