Xclip Base Patch16 Ucf 4 Shot
X-CLIP is a minimal extension of CLIP for general video-language understanding, trained via contrastive learning with (video, text) pairs.
Downloads 16
Release Time : 9/7/2022
Model Overview
The X-CLIP model (base-scale, 16x16 patch resolution) is trained on UCF101 in a few-shot manner (K=4), suitable for zero-shot, few-shot, or fully supervised video classification and video-text retrieval tasks.
Model Features
Few-shot Learning
The model is trained on the UCF101 dataset in a few-shot manner (K=4), suitable for scenarios with limited data.
Video-Text Contrastive Learning
Trained via contrastive learning with (video, text) pairs, supporting video-text matching tasks.
General Video Recognition
The model can be used for zero-shot, few-shot, or fully supervised video classification and video-text retrieval tasks.
Model Capabilities
Video Classification
Video-Text Retrieval
Zero-shot Learning
Few-shot Learning
Use Cases
Video Understanding
Video Classification
Classify video content, applicable to the 101 action categories in the UCF101 dataset.
Top-1 accuracy reaches 83.4%
Video-Text Retrieval
Retrieve relevant videos based on text descriptions or generate matching text descriptions based on video content.
Featured Recommended AI Models
Š 2025AIbase