X

Xclip Base Patch16 Kinetics 600

Developed by microsoft
X-CLIP is an extended version of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs.
Downloads 294
Release Time : 9/8/2022

Model Overview

This model is the base-size X-CLIP model, using a 16x16 patch resolution and trained with full supervision on the Kinetics-600 dataset. Suitable for video classification and video-text retrieval tasks.

Model Features

Video-Language Understanding
Trained on video and text pairs via contrastive learning, supporting matching judgment between videos and texts.
High Accuracy
Achieves 85.3% top-1 accuracy and 97.1% top-5 accuracy on the Kinetics 400 dataset.
Zero-Shot and Few-Shot Learning
Supports zero-shot, few-shot, or fully supervised video classification tasks.

Model Capabilities

Video Classification
Video-Text Retrieval
Zero-Shot Learning
Few-Shot Learning

Use Cases

Video Analysis
Video Content Classification
Classify video content to identify actions or scenes in videos.
Performs excellently on the Kinetics 400 dataset.
Video-Text Matching
Determine whether a given text matches the video content.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase