Xclip Base Patch16
X-CLIP is an extended version of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs, suitable for tasks like video classification and video-text retrieval.
Downloads 1,647
Release Time : 9/7/2022
Model Overview
The X-CLIP model (base-scale, 16x16 patch resolution) was fully supervised trained on Kinetics-400, suitable for zero-shot, few-shot, or fully supervised video classification tasks.
Model Features
Video-language understanding
Trained via contrastive learning on (video, text) pairs, supporting video-text matching and understanding.
Multi-task support
Suitable for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.
High performance
Achieves top-1 accuracy of 83.8% and top-5 accuracy of 95.7% on the Kinetics-400 dataset.
Model Capabilities
Video classification
Video-text retrieval
Zero-shot learning
Few-shot learning
Use Cases
Video analysis
Video content classification
Classify video content to recognize actions or scenes in videos.
Achieves 83.8% top-1 accuracy on the Kinetics-400 dataset.
Video-text retrieval
Retrieve relevant videos based on text descriptions or generate descriptive text from video content.
Featured Recommended AI Models
Š 2025AIbase