CLIP ViT B 32 CommonPool.M.laion S128m B4k
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks
Downloads 65
Release Time : 4/26/2023
Model Overview
This model is a variant of the CLIP architecture, combining a Vision Transformer (ViT-B-32) and a text encoder, capable of performing image classification without task-specific training.
Model Features
Zero-shot Learning Capability
Can perform image classification tasks without task-specific fine-tuning
Multimodal Understanding
Simultaneously understands visual and textual information, establishing cross-modal associations
Large-scale Pretraining
Pretrained on the laion-s128M-b4K dataset
Model Capabilities
Zero-shot Image Classification
Image-Text Matching
Cross-modal Retrieval
Use Cases
Content Management
Automatic Image Tagging
Automatically generates descriptive labels for unlabeled images
E-commerce
Product Image Search
Searches for relevant product images using natural language queries
Featured Recommended AI Models
Š 2025AIbase