Vit Huge Patch14 Clip Quickgelu 378.dfn5b
ViT-Huge image encoder based on CLIP architecture, trained on DFN5B dataset, supports quick GELU activation
Downloads 27
Release Time : 12/26/2024
Model Overview
This model is the visual encoder part of the CLIP framework, using Vision Transformer (ViT) architecture, specially designed for efficient image feature extraction tasks.
Model Features
Large-scale ViT architecture
Uses ViT-Huge architecture with stronger feature extraction capabilities
Quick GELU activation
Uses QuickGELU activation function to improve computational efficiency
CLIP-compatible design
As part of the CLIP framework's visual encoder, it can be used with text encoders
Large-scale pre-training
Trained on DFN5B dataset with powerful visual representation capabilities
Model Capabilities
Image feature extraction
Visual representation learning
Cross-modal alignment
Use Cases
Computer vision
Image classification
Extract image features for classification tasks
Image retrieval
Generate image embeddings for similarity search
Multimodal applications
Image-text matching
Work with text encoders to achieve cross-modal image-text matching
Featured Recommended AI Models