X

Xclip Base Patch16 Zero Shot

Developed by microsoft
X-CLIP is a minimalist extension of CLIP for general video-language understanding, trained contrastively on (video, text) pairs, suitable for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.
Downloads 5,045
Release Time : 9/7/2022

Model Overview

The X-CLIP model (base size, 16x16 patch resolution) is trained on Kinetics-400 and suitable for video classification and video-text retrieval tasks.

Model Features

Zero-shot video classification
Supports video classification tasks without fine-tuning.
Video-text contrastive learning
Trained contrastively to understand the relationship between videos and text.
Multi-dataset applicability
Performs well on multiple datasets including HMDB-51, UCF101, and Kinetics-600.

Model Capabilities

Video classification
Video-text retrieval
Zero-shot learning

Use Cases

Video understanding
Action recognition
Identify action categories in videos.
Achieves 72.0% top-1 accuracy on UCF101.
Video content retrieval
Retrieve relevant video content based on text descriptions.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase