X

Xclip Large Patch14 16 Frames

Developed by microsoft
X-CLIP is an extension of CLIP for general video-language understanding, achieving video classification and video-text retrieval tasks through contrastive learning.
Downloads 678
Release Time : 9/7/2022

Model Overview

The X-CLIP model (large, 14-pixel patch resolution) was fully supervised trained on Kinetics-400, supporting zero-shot, few-shot, or fully supervised video classification and video-text retrieval tasks.

Model Features

Video-Language Contrastive Learning
Trained via contrastive learning with (video, text) pairs, supporting video-text matching tasks.
High-Resolution Processing
Uses 16 frames per video segment during training at 336x336 resolution to ensure detail capture capability.
General Video Understanding
Applicable to various video understanding tasks, including classification and retrieval.

Model Capabilities

Video Classification
Video-Text Retrieval
Zero-shot Learning
Few-shot Learning

Use Cases

Video Content Analysis
Video Classification
Classify video content, such as recognizing actions, scenes, etc.
Top-1 accuracy 87.7%, Top-5 accuracy 97.4%.
Video-Text Retrieval
Retrieve relevant video clips based on text descriptions.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase