X

Xclip Base Patch16

Developed by microsoft
X-CLIP is an extended version of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs, suitable for tasks like video classification and video-text retrieval.
Downloads 1,647
Release Time : 9/7/2022

Model Overview

The X-CLIP model (base-scale, 16x16 patch resolution) was fully supervised trained on Kinetics-400, suitable for zero-shot, few-shot, or fully supervised video classification tasks.

Model Features

Video-language understanding
Trained via contrastive learning on (video, text) pairs, supporting video-text matching and understanding.
Multi-task support
Suitable for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.
High performance
Achieves top-1 accuracy of 83.8% and top-5 accuracy of 95.7% on the Kinetics-400 dataset.

Model Capabilities

Video classification
Video-text retrieval
Zero-shot learning
Few-shot learning

Use Cases

Video analysis
Video content classification
Classify video content to recognize actions or scenes in videos.
Achieves 83.8% top-1 accuracy on the Kinetics-400 dataset.
Video-text retrieval
Retrieve relevant videos based on text descriptions or generate descriptive text from video content.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase