Xclip Large Patch14
X-CLIP is an extension of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs.
Downloads 1,698
Release Time : 9/7/2022
Model Overview
The X-CLIP model (large size, 14Ã14 patch resolution) is fully supervised trained on the Kinetics-400 dataset and can be used for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.
Model Features
Video-Language Understanding
Trained via contrastive learning on (video, text) pairs, supporting video and text matching.
High Accuracy
Achieves Top-1 accuracy of 87.1% and Top-5 accuracy of 97.6% on the Kinetics-400 dataset.
Multi-task Support
Can be used for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.
Model Capabilities
Video Classification
Video-Text Retrieval
Zero-shot Learning
Few-shot Learning
Use Cases
Video Analysis
Video Classification
Classify video content, such as recognizing actions, scenes, etc.
Top-1 accuracy 87.1%, Top-5 accuracy 97.6%.
Video-Text Retrieval
Retrieve relevant video clips based on text descriptions.
Featured Recommended AI Models
Š 2025AIbase