X-CLIP Open-source Model - General Video-Language Understanding, Empowering Video-Text Processing with Contrastive Training

Xclip Base Patch16 Ucf 4 Shot

Developed by microsoft

X-CLIP is a minimal extension of CLIP for general video-language understanding, trained via contrastive learning with (video, text) pairs.

Video Processing

Transformers

EnglishOpen Source License:MIT #Video-Text Contrastive Learning #Few-shot Video Classification #Multimodal Understanding

Downloads 16

Release Time : 9/7/2022

Model Overview

The X-CLIP model (base-scale, 16x16 patch resolution) is trained on UCF101 in a few-shot manner (K=4), suitable for zero-shot, few-shot, or fully supervised video classification and video-text retrieval tasks.

Model Features

Few-shot Learning

The model is trained on the UCF101 dataset in a few-shot manner (K=4), suitable for scenarios with limited data.

Video-Text Contrastive Learning

Trained via contrastive learning with (video, text) pairs, supporting video-text matching tasks.

General Video Recognition

The model can be used for zero-shot, few-shot, or fully supervised video classification and video-text retrieval tasks.

Model Capabilities

Video Classification

Video-Text Retrieval

Zero-shot Learning

Few-shot Learning

Use Cases

Video Understanding

Video Classification

Classify video content, applicable to the 101 action categories in the UCF101 dataset.

Top-1 accuracy reaches 83.4%

Video-Text Retrieval

Retrieve relevant videos based on text descriptions or generate matching text descriptions based on video content.

🚀 X-CLIP (base-sized model)

The X-CLIP model (base-sized, patch resolution of 16) trained in a few-shot manner (K = 4) for general video-language understanding.

🚀 Quick Start

The X-CLIP model presented here is a base-sized model with a patch resolution of 16, trained in a few-shot fashion (K = 4) on UCF101. It was introduced in the paper Expanding Language-Image Pretrained Models for General Video Recognition by Ni et al. and first released in this repository.

This model was trained using 32 frames per video, at a resolution of 224x224.

Disclaimer: The team releasing X-CLIP did not write a model card for this model so this model card has been written by the Hugging Face team.

✨ Features

X-CLIP is a minimal extension of CLIP for general video-language understanding. The model is trained in a contrastive way on (video, text) pairs.

X-CLIP architecture

This allows the model to be used for tasks like zero-shot, few-shot or fully supervised video classification and video-text retrieval.

📚 Documentation

Intended uses & limitations

You can use the raw model for determining how well text goes with a given video. See the model hub to look for fine-tuned versions on a task that interests you.

How to use

For code examples, we refer to the documentation.

Training data

This model was trained on UCF101.

Preprocessing

The exact details of preprocessing during training can be found here.

The exact details of preprocessing during validation can be found here.

During validation, one resizes the shorter edge of each frame, after which center cropping is performed to a fixed-size resolution (like 224x224). Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.

Evaluation results

This model achieves a top-1 accuracy of 83.4%.

📄 License

This model is released under the MIT license.

Model Information Table

Property	Details
Model Type	X-CLIP (base-sized model, patch resolution of 16, trained in a few-shot manner (K = 4))
Training Data	UCF101
Task	Video Classification
Dataset	UCF101
Metrics (Top-1 Accuracy)	83.4%

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご