X

Xclip Base Patch16 16 Frames

Developed by microsoft
X-CLIP is a minimalist extension of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs.
Downloads 1,034
Release Time : 9/7/2022

Model Overview

This model can be used for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.

Model Features

Video-Language Understanding
Trained via contrastive learning on (video, text) pairs, supporting video-text matching.
Multi-Task Support
Can be used for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.
Efficient Training
Uses 16 frames per video during training at 224x224 resolution, optimizing computational efficiency.

Model Capabilities

Video Classification
Video-Text Retrieval
Zero-Shot Learning
Few-Shot Learning

Use Cases

Video Analysis
Video Classification
Classify video content, such as action recognition, scene recognition, etc.
Achieves 84.7% top-1 accuracy and 96.8% top-5 accuracy on the Kinetics-400 dataset.
Video-Text Retrieval
Retrieve relevant videos based on text descriptions or generate matching text descriptions based on video content.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase