X

Xclip Base Patch16 Kinetics 600 16 Frames

Developed by microsoft
X-CLIP is an extension of CLIP for general video-language understanding, supporting zero-shot, few-shot, or fully supervised video classification, as well as video-text retrieval tasks.
Downloads 393
Release Time : 9/8/2022

Model Overview

The X-CLIP model (base size, 16x16 patch resolution) is fully supervised trained on the Kinetics-600 dataset, primarily for video classification and video-text retrieval tasks.

Model Features

Video-Language Understanding
Trained contrastively on (video, text) pairs to support video-text matching tasks.
High Accuracy
Achieves 85.8% top-1 accuracy and 97.3% top-5 accuracy on the Kinetics 400 dataset.
Multi-Task Support
Can be used for zero-shot, few-shot, or fully supervised video classification, as well as video-text retrieval tasks.

Model Capabilities

Video Classification
Video-Text Retrieval
Zero-Shot Learning
Few-Shot Learning

Use Cases

Video Analysis
Video Classification
Classify video content to recognize actions or scenes in videos.
85.8% top-1 accuracy, 97.3% top-5 accuracy
Video-Text Retrieval
Retrieve relevant videos based on text descriptions or generate descriptive text from video content.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase