Xclip Base Patch16 Kinetics 600 16 Frames
X-CLIP is an extension of CLIP for general video-language understanding, supporting zero-shot, few-shot, or fully supervised video classification, as well as video-text retrieval tasks.
Downloads 393
Release Time : 9/8/2022
Model Overview
The X-CLIP model (base size, 16x16 patch resolution) is fully supervised trained on the Kinetics-600 dataset, primarily for video classification and video-text retrieval tasks.
Model Features
Video-Language Understanding
Trained contrastively on (video, text) pairs to support video-text matching tasks.
High Accuracy
Achieves 85.8% top-1 accuracy and 97.3% top-5 accuracy on the Kinetics 400 dataset.
Multi-Task Support
Can be used for zero-shot, few-shot, or fully supervised video classification, as well as video-text retrieval tasks.
Model Capabilities
Video Classification
Video-Text Retrieval
Zero-Shot Learning
Few-Shot Learning
Use Cases
Video Analysis
Video Classification
Classify video content to recognize actions or scenes in videos.
85.8% top-1 accuracy, 97.3% top-5 accuracy
Video-Text Retrieval
Retrieve relevant videos based on text descriptions or generate descriptive text from video content.
Featured Recommended AI Models
Š 2025AIbase