Xclip Base Patch32 16 Frames
X-CLIP is an extended version of CLIP for general video-language understanding, trained on video-text pairs via contrastive learning, suitable for tasks like video classification and video-text retrieval.
Downloads 901
Release Time : 9/7/2022
Model Overview
The X-CLIP model (base size, 32-pixel patch resolution) was fully supervised trained on the Kinetics-400 dataset, supporting zero-shot, few-shot, or fully supervised video classification and video-text retrieval tasks.
Model Features
Video-Language Understanding
Trained on video-text pairs via contrastive learning, supporting matching and understanding between videos and text.
High Accuracy
Achieves 81.1% top-1 accuracy and 95.5% top-5 accuracy on the Kinetics-400 dataset.
Multi-task Support
Suitable for zero-shot, few-shot, or fully supervised video classification and video-text retrieval tasks.
Model Capabilities
Video Classification
Video-Text Retrieval
Zero-shot Learning
Few-shot Learning
Use Cases
Video Analysis
Video Classification
Classify video content to identify actions or scenes in videos.
Achieves 81.1% top-1 accuracy on the Kinetics-400 dataset.
Video-Text Retrieval
Retrieve relevant videos based on text descriptions or generate matching text descriptions based on video content.
Featured Recommended AI Models
Š 2025AIbase