X

Xclip Large Patch14

Developed by microsoft
X-CLIP is an extension of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs.
Downloads 1,698
Release Time : 9/7/2022

Model Overview

The X-CLIP model (large size, 14×14 patch resolution) is fully supervised trained on the Kinetics-400 dataset and can be used for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.

Model Features

Video-Language Understanding
Trained via contrastive learning on (video, text) pairs, supporting video and text matching.
High Accuracy
Achieves Top-1 accuracy of 87.1% and Top-5 accuracy of 97.6% on the Kinetics-400 dataset.
Multi-task Support
Can be used for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.

Model Capabilities

Video Classification
Video-Text Retrieval
Zero-shot Learning
Few-shot Learning

Use Cases

Video Analysis
Video Classification
Classify video content, such as recognizing actions, scenes, etc.
Top-1 accuracy 87.1%, Top-5 accuracy 97.6%.
Video-Text Retrieval
Retrieve relevant video clips based on text descriptions.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase