Clip Vit Large Patch14 336
C
Clip Vit Large Patch14 336
Developed by openai
A large-scale vision-language pretrained model based on the Vision Transformer architecture, supporting cross-modal understanding between images and text
Downloads 5.9M
Release Time : 4/22/2022
Model Overview
This model is an implementation of the OpenAI CLIP architecture, using ViT-Large as the visual encoder, supporting 336x336 resolution image input, capable of performing image-text matching and zero-shot classification tasks
Model Features
Cross-modal Understanding
Capable of processing both visual and textual information, establishing semantic relationships between the two modalities
Zero-shot Learning
Can perform image classification tasks for new categories without task-specific fine-tuning
High-resolution Processing
Supports input resolution of 336x336 pixels, providing more fine-grained visual understanding compared to standard CLIP models (224x224)
Model Capabilities
Image-text similarity calculation
Zero-shot image classification
Multimodal feature extraction
Cross-modal retrieval
Use Cases
Content Moderation
Inappropriate Content Detection
Detect non-compliant image content through text descriptions
E-commerce
Product Search
Match relevant product images using natural language queries
Media Analysis
Image Captioning
Automatically generate descriptive text for images
Featured Recommended AI Models
Š 2025AIbase