Xclip Base Patch32
X-CLIP is an extended version of CLIP for general video-language understanding, trained on (video, text) pairs via contrastive learning, suitable for tasks like video classification and video-text retrieval.
Downloads 309.80k
Release Time : 8/25/2022
Model Overview
The X-CLIP model (base size, 32x32 patch resolution) was fully supervised trained on the Kinetics-400 dataset and can be used for zero-shot, few-shot, or fully supervised video classification and video-text retrieval tasks.
Model Features
Video-Language Understanding
Extends the capabilities of the CLIP model to handle contrastive learning tasks involving videos and text.
Multi-Task Support
Supports various tasks such as zero-shot, few-shot, or fully supervised video classification and video-text retrieval.
Efficient Training
Uses 8 frames per video at 224x224 resolution during training to ensure efficiency.
Model Capabilities
Video Classification
Video-Text Retrieval
Zero-Shot Learning
Few-Shot Learning
Use Cases
Video Understanding
Video Classification
Classify video content to identify actions or scenes in videos.
Achieves 80.4% top-1 accuracy and 95.0% top-5 accuracy on the Kinetics-400 dataset.
Video-Text Retrieval
Retrieve relevant videos based on text descriptions or generate descriptive text from video content.
Featured Recommended AI Models
Š 2025AIbase