CLIP ViT B 32 CommonPool.M.text S128m B4k
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks
Downloads 68
Release Time : 4/26/2023
Model Overview
This model is a variant of the CLIP architecture, combining a Vision Transformer (ViT) and a text encoder, capable of understanding the correlation between images and text, suitable for cross-modal retrieval and classification tasks.
Model Features
Zero-shot Learning Capability
Can perform new image classification tasks without task-specific fine-tuning
Cross-modal Understanding
Capable of processing and understanding both visual and textual information
Efficient Architecture
Vision encoder based on ViT-B/32, balancing performance and computational efficiency
Model Capabilities
Image Classification
Cross-modal Retrieval
Zero-shot Learning
Image-Text Matching
Use Cases
Content Retrieval
Text-based Image Search
Search for relevant images using natural language descriptions
Automatic Tagging
Image Auto-tagging
Generate descriptive labels for images
Featured Recommended AI Models
Š 2025AIbase