X

Xclip Base Patch16 Ucf 2 Shot

Developed by microsoft
X-CLIP is a minimalist extension of CLIP for general video-language understanding. The model is trained on (video, text) pairs through contrastive learning.
Downloads 51
Release Time : 9/7/2022

Model Overview

The X-CLIP model (base size, 16x16 patch resolution) is trained in a few-shot manner (K=2) on the UCF101 dataset, suitable for video classification and video-text retrieval tasks.

Model Features

Few-shot Learning Capability
This model was trained with only 2 samples on the UCF101 dataset, demonstrating strong few-shot learning capability.
Video-Language Understanding
Trained on (video, text) pairs through contrastive learning, supporting joint understanding of video and text.
General Video Recognition
Applicable to various video recognition tasks, including zero-shot, few-shot, and fully supervised video classification.

Model Capabilities

Video Classification
Video-Text Retrieval
Few-shot Learning

Use Cases

Video Analysis
Video Classification
Classify video content to identify the category of the video.
Achieves 76.4% top-1 accuracy on the UCF101 dataset.
Video-Text Retrieval
Retrieve relevant videos based on text descriptions, or generate descriptive text based on video content.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase