Xclip Base Patch16 Zero Shot
X-CLIP is a minimalist extension of CLIP for general video-language understanding, trained via contrastive learning to match videos and texts.
Downloads 22
Release Time : 11/8/2023
Model Overview
The X-CLIP model (base size, 16x16 patch resolution) is trained on Kinetics-400 and suitable for zero-shot, few-shot, or fully supervised video classification, as well as video-text retrieval tasks.
Model Features
Zero-Shot Video Classification
Directly applicable to video classification tasks without fine-tuning, supporting zero-shot learning.
Video-Text Matching
Capable of assessing the match between text descriptions and given video content.
Multi-Task Support
Supports various tasks including video classification and video-text retrieval.
Model Capabilities
Video Classification
Video-Text Retrieval
Zero-Shot Learning
Use Cases
Video Understanding
Video Classification
Classify video content, such as action recognition and scene recognition.
Zero-shot top-1 accuracy: 44.6% on HMDB-51, 72.0% on UCF-101, and 65.2% on Kinetics-600.
Video-Text Retrieval
Retrieve relevant video content based on text descriptions.
Featured Recommended AI Models
Š 2025AIbase