Vit Base Patch32 Clip 224.laion400m E31
Vision Transformer model trained on the LAION-400M dataset, compatible with both OpenCLIP and timm frameworks
Downloads 10.90k
Release Time : 10/23/2024
Model Overview
This is a vision-language model based on the Vision Transformer architecture, primarily used for zero-shot image classification tasks. The model employs 32x32 patch size and 224x224 input resolution, optimized with quickgelu activation function during training.
Model Features
Dual Framework Compatibility
Supports both OpenCLIP and timm frameworks, offering flexible usage options
Quick Activation Function
Utilizes quickgelu activation function to optimize the training process
Large-scale Training Data
Trained on the extensive LAION-400M dataset
Model Capabilities
Zero-shot Image Classification
Image Feature Extraction
Cross-modal Representation Learning
Use Cases
Computer Vision
Image Classification
Classify images of new categories without specific training
Image Retrieval
Retrieve relevant images based on text descriptions
Multimodal Applications
Image-Text Matching
Evaluate the alignment between images and text descriptions
Featured Recommended AI Models
Š 2025AIbase