Xclip Base Patch16 16 Frames
X
Xclip Base Patch16 16 Frames
Developed by microsoft
X-CLIP is a minimalist extension of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs.
Downloads 1,034
Release Time : 9/7/2022
Model Overview
This model can be used for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.
Model Features
Video-Language Understanding
Trained via contrastive learning on (video, text) pairs, supporting video-text matching.
Multi-Task Support
Can be used for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.
Efficient Training
Uses 16 frames per video during training at 224x224 resolution, optimizing computational efficiency.
Model Capabilities
Video Classification
Video-Text Retrieval
Zero-Shot Learning
Few-Shot Learning
Use Cases
Video Analysis
Video Classification
Classify video content, such as action recognition, scene recognition, etc.
Achieves 84.7% top-1 accuracy and 96.8% top-5 accuracy on the Kinetics-400 dataset.
Video-Text Retrieval
Retrieve relevant videos based on text descriptions or generate matching text descriptions based on video content.
Featured Recommended AI Models
Š 2025AIbase