C

Clip Vit Large Patch14 336

Developed by openai
A large-scale vision-language pretrained model based on the Vision Transformer architecture, supporting cross-modal understanding between images and text
Downloads 5.9M
Release Time : 4/22/2022

Model Overview

This model is an implementation of the OpenAI CLIP architecture, using ViT-Large as the visual encoder, supporting 336x336 resolution image input, capable of performing image-text matching and zero-shot classification tasks

Model Features

Cross-modal Understanding
Capable of processing both visual and textual information, establishing semantic relationships between the two modalities
Zero-shot Learning
Can perform image classification tasks for new categories without task-specific fine-tuning
High-resolution Processing
Supports input resolution of 336x336 pixels, providing more fine-grained visual understanding compared to standard CLIP models (224x224)

Model Capabilities

Image-text similarity calculation
Zero-shot image classification
Multimodal feature extraction
Cross-modal retrieval

Use Cases

Content Moderation
Inappropriate Content Detection
Detect non-compliant image content through text descriptions
E-commerce
Product Search
Match relevant product images using natural language queries
Media Analysis
Image Captioning
Automatically generate descriptive text for images
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase