X

Xclip Base Patch32 16 Frames

Developed by microsoft
X-CLIP is an extended version of CLIP for general video-language understanding, trained on video-text pairs via contrastive learning, suitable for tasks like video classification and video-text retrieval.
Downloads 901
Release Time : 9/7/2022

Model Overview

The X-CLIP model (base size, 32-pixel patch resolution) was fully supervised trained on the Kinetics-400 dataset, supporting zero-shot, few-shot, or fully supervised video classification and video-text retrieval tasks.

Model Features

Video-Language Understanding
Trained on video-text pairs via contrastive learning, supporting matching and understanding between videos and text.
High Accuracy
Achieves 81.1% top-1 accuracy and 95.5% top-5 accuracy on the Kinetics-400 dataset.
Multi-task Support
Suitable for zero-shot, few-shot, or fully supervised video classification and video-text retrieval tasks.

Model Capabilities

Video Classification
Video-Text Retrieval
Zero-shot Learning
Few-shot Learning

Use Cases

Video Analysis
Video Classification
Classify video content to identify actions or scenes in videos.
Achieves 81.1% top-1 accuracy on the Kinetics-400 dataset.
Video-Text Retrieval
Retrieve relevant videos based on text descriptions or generate matching text descriptions based on video content.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase