Xclip Base Patch16 Ucf 2 Shot
X-CLIP is a minimalist extension of CLIP for general video-language understanding. The model is trained on (video, text) pairs through contrastive learning.
Downloads 51
Release Time : 9/7/2022
Model Overview
The X-CLIP model (base size, 16x16 patch resolution) is trained in a few-shot manner (K=2) on the UCF101 dataset, suitable for video classification and video-text retrieval tasks.
Model Features
Few-shot Learning Capability
This model was trained with only 2 samples on the UCF101 dataset, demonstrating strong few-shot learning capability.
Video-Language Understanding
Trained on (video, text) pairs through contrastive learning, supporting joint understanding of video and text.
General Video Recognition
Applicable to various video recognition tasks, including zero-shot, few-shot, and fully supervised video classification.
Model Capabilities
Video Classification
Video-Text Retrieval
Few-shot Learning
Use Cases
Video Analysis
Video Classification
Classify video content to identify the category of the video.
Achieves 76.4% top-1 accuracy on the UCF101 dataset.
Video-Text Retrieval
Retrieve relevant videos based on text descriptions, or generate descriptive text based on video content.
Featured Recommended AI Models
Š 2025AIbase