C

CLIP ViT B 16 Laion2b S34b B88k

Developed by laion
A multimodal vision-language model trained on the OpenCLIP framework, completed on the LAION-2B English dataset, supporting zero-shot image classification tasks
Downloads 251.02k
Release Time : 1/3/2023

Model Overview

This CLIP model adopts the ViT-B/16 architecture, achieving joint representation of images and text through contrastive learning, applicable to cross-modal tasks such as zero-shot image classification and image-text retrieval

Model Features

Large-scale training data
Trained on a 2-billion English sample subset of LAION-5B, covering a wide range of visual concepts
Zero-shot learning capability
Can be directly applied to new category recognition tasks without fine-tuning
Cross-modal alignment
Achieves unified representation space for image and text features through contrastive learning

Model Capabilities

Zero-shot image classification
Image-text similarity calculation
Cross-modal retrieval
Image feature extraction

Use Cases

Computer vision
Open-domain image classification
Directly classify images using natural language descriptions without predefined category systems
Achieves 70.2% zero-shot top-1 accuracy on ImageNet-1k
Information retrieval
Cross-modal image-text retrieval
Enables bidirectional retrieval from text to image or image to text
Featured Recommended AI Models
ยฉ 2025AIbase