V

Vit Huge Patch14 Clip Quickgelu 378.dfn5b

Developed by timm
ViT-Huge image encoder based on CLIP architecture, trained on DFN5B dataset, supports quick GELU activation
Downloads 27
Release Time : 12/26/2024

Model Overview

This model is the visual encoder part of the CLIP framework, using Vision Transformer (ViT) architecture, specially designed for efficient image feature extraction tasks.

Model Features

Large-scale ViT architecture
Uses ViT-Huge architecture with stronger feature extraction capabilities
Quick GELU activation
Uses QuickGELU activation function to improve computational efficiency
CLIP-compatible design
As part of the CLIP framework's visual encoder, it can be used with text encoders
Large-scale pre-training
Trained on DFN5B dataset with powerful visual representation capabilities

Model Capabilities

Image feature extraction
Visual representation learning
Cross-modal alignment

Use Cases

Computer vision
Image classification
Extract image features for classification tasks
Image retrieval
Generate image embeddings for similarity search
Multimodal applications
Image-text matching
Work with text encoders to achieve cross-modal image-text matching
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase