Xclip Base Patch16 Kinetics 600
X-CLIP is an extended version of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs.
Downloads 294
Release Time : 9/8/2022
Model Overview
This model is the base-size X-CLIP model, using a 16x16 patch resolution and trained with full supervision on the Kinetics-600 dataset. Suitable for video classification and video-text retrieval tasks.
Model Features
Video-Language Understanding
Trained on video and text pairs via contrastive learning, supporting matching judgment between videos and texts.
High Accuracy
Achieves 85.3% top-1 accuracy and 97.1% top-5 accuracy on the Kinetics 400 dataset.
Zero-Shot and Few-Shot Learning
Supports zero-shot, few-shot, or fully supervised video classification tasks.
Model Capabilities
Video Classification
Video-Text Retrieval
Zero-Shot Learning
Few-Shot Learning
Use Cases
Video Analysis
Video Content Classification
Classify video content to identify actions or scenes in videos.
Performs excellently on the Kinetics 400 dataset.
Video-Text Matching
Determine whether a given text matches the video content.
Featured Recommended AI Models
Š 2025AIbase