X-CLIP Open-Source Video-Language Understanding Model - Supports Multi-modal Video Classification Tasks

Xclip Base Patch16 Hmdb 2 Shot

Developed by microsoft

X-CLIP is an extended version of CLIP for general video-language understanding, trained via contrastive learning on video-text pairs, supporting zero-shot, few-shot, and fully supervised video classification tasks.

Text-to-Video

Transformers

EnglishOpen Source License:MIT #Video Classification #Few-shot Learning #Contrastive Learning

Downloads 19

Release Time : 9/7/2022

Model Overview

The X-CLIP model (base size, 16x16 patch resolution) is trained in a few-shot manner (K=2) on HMDB-51, suitable for tasks like video classification and video-text retrieval.

Model Features

Few-shot Learning Capability

This model was trained with only 2 samples on the HMDB-51 dataset, demonstrating strong few-shot learning capability.

Video-Text Contrastive Learning

Trained via contrastive learning, it can understand the relationship between video content and text descriptions.

Multi-task Support

Supports zero-shot, few-shot, and fully supervised video classification tasks, as well as applications like video-text retrieval.

Model Capabilities

Video Classification

Video-Text Retrieval

Few-shot Learning

Zero-shot Inference

Use Cases

Video Understanding

Action Recognition

Recognize human actions in videos

Achieved 53.0% top-1 accuracy on the HMDB-51 dataset

Video Content Retrieval

Retrieve relevant video clips based on text descriptions

🚀 X-CLIP (base-sized model)

X-CLIP model (base-sized, patch resolution of 16) trained in a few-shot fashion (K=2) for video classification, expanding the capabilities of general video - language understanding.

🚀 Quick Start

X-CLIP is a minimal extension of CLIP for general video - language understanding. This model was trained on HMDB - 51 using 32 frames per video at a resolution of 224x224. It can be used for tasks like zero - shot, few - shot or fully supervised video classification and video - text retrieval.

For code examples, refer to the documentation.

✨ Features

Trained in a few - shot fashion (K = 2) on HMDB - 51 dataset.
Allows for zero - shot, few - shot or fully supervised video classification and video - text retrieval.
Based on a contrastive training approach on (video, text) pairs.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Model description

X - CLIP is a minimal extension of CLIP for general video - language understanding. The model is trained in a contrastive way on (video, text) pairs.

X - CLIP architecture

This allows the model to be used for tasks like zero - shot, few - shot or fully supervised video classification and video - text retrieval.

Intended uses & limitations

You can use the raw model for determining how well text goes with a given video. See the model hub to look for fine - tuned versions on a task that interests you.

Training data

This model was trained on HMDB - 51.

Preprocessing

The exact details of preprocessing during training can be found here.

The exact details of preprocessing during validation can be found here.

During validation, one resizes the shorter edge of each frame, after which center cropping is performed to a fixed - size resolution (like 224x224). Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.

Evaluation results

This model achieves a top - 1 accuracy of 53.0%.

🔧 Technical Details

The model is a few - shot trained (K = 2) version of X - CLIP with a base - sized architecture and a patch resolution of 16. It uses a contrastive training approach on (video, text) pairs, which enables it to perform well in various video - related tasks. The training is based on the HMDB - 51 dataset, with specific preprocessing steps during training and validation.

📄 License

This project is licensed under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご