C

CLIP ViT H 14 Laion2b S32b B79k

Developed by ModelsLab
This is a vision-language model based on the OpenCLIP framework, trained on the LAION-2B English subset, excelling in zero-shot image classification and cross-modal retrieval tasks.
Downloads 132
Release Time : 1/16/2025

Model Overview

The model adopts the CLIP architecture, mapping images and text into a shared embedding space through contrastive learning, supporting tasks such as zero-shot image classification and image-text retrieval.

Model Features

Large-scale training data
Trained on a 2-billion English sample subset of LAION-5B, covering a wide range of visual concepts
Zero-shot capability
Can perform image classification tasks for new categories without fine-tuning
Cross-modal understanding
Simultaneously understands images and text, supporting image-text matching and retrieval

Model Capabilities

Zero-shot image classification
Image-text retrieval
Cross-modal embedding learning
Image content understanding

Use Cases

Computer Vision
Zero-shot image classification
Classify images without training data
Achieves 78.0% zero-shot top-1 accuracy on ImageNet-1k
Image retrieval
Retrieve relevant images based on text queries
Performs well on COCO and Flickr datasets
Research Applications
Multimodal research
Used for studying vision-language representation learning
Model fine-tuning foundation
Serves as a pretrained model for downstream tasks
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase